We are considering using ABBYY OCR as part of a project in which we would perform downstream processing on the results of the OCR. For this downstream processing, we would need both the raw text produced by the OCR, but also what I'm calling metadata: information on formatting including horizontal and vertical whitespace (like space between paragraphs, and indentation), font changes (italics, bold, and ideally shifts between serif and non-serif fonts), etc. What we do NOT need is a pre-formatted output, like for instance a Microsoft Word version of the document.
Is this possible with ABBYY OCR? Should we be looking at the stand-alone version, or the SDK?
By the way, for our application we will be processing text in multiple languages, most of which will probably not have language models (think Navajo or Igbo or less-known languages than that). We will be working with Roman fonts, but they may have accent marks of various kinds.
asked 11 May '16, 02:42
You can try to use ABBYY FineReader Engine 11 for your usage scenario (the desktop version of FineReader or Cloud OCR SDK cannot provide you with all neccessary information). FineReader Engine is our “big” SDK which gives you the tools to integrate OCR technologies into your applications. You can get the recognized data and so called metadata via API (the CharParams object) or from some certain export formats, as XML or ALTO. Moreover it supports creating of the custom languages and the user patterns training that should be very useful for your case.
answered 12 May '16, 13:01
Oksana Serdyuk ♦♦