We are considering using ABBYY OCR as part of a project in which we would perform downstream processing on the results of the OCR. For this downstream processing, we would need both the raw text produced by the OCR, but also what I'm calling metadata: information on formatting including horizontal and vertical whitespace (like space between paragraphs, and indentation), font changes (italics, bold, and ideally shifts between serif and non-serif fonts), etc. What we do NOT need is a pre-formatted output, like for instance a Microsoft Word version of the document.
Is this possible with ABBYY OCR? Should we be looking at the stand-alone version, or the SDK?
By the way, for our application we will be processing text in multiple languages, most of which will probably not have language models (think Navajo or Igbo or less-known languages than that). We will be working with Roman fonts, but they may have accent marks of various kinds.