data and metadata

  • Last Post 11 May 2016
mcswell posted this 11 May 2016

We are considering using ABBYY OCR as part of a project in which we would perform downstream processing on the results of the OCR. For this downstream processing, we would need both the raw text produced by the OCR, but also what I'm calling metadata: information on formatting including horizontal and vertical whitespace (like space between paragraphs, and indentation), font changes (italics, bold, and ideally shifts between serif and non-serif fonts), etc. What we do NOT need is a pre-formatted output, like for instance a Microsoft Word version of the document.

Is this possible with ABBYY OCR? Should we be looking at the stand-alone version, or the SDK?

By the way, for our application we will be processing text in multiple languages, most of which will probably not have language models (think Navajo or Igbo or less-known languages than that). We will be working with Roman fonts, but they may have accent marks of various kinds.

Order By: Standard | Newest | Votes
Vitalie posted this 11 May 2016

I`m not from ABBYY staff, but probably you need the ALTO format?

Oksana Serdyuk posted this 12 May 2016

You can try to use ABBYY FineReader Engine 11 for your usage scenario (the desktop version of FineReader or Cloud OCR SDK cannot provide you with all neccessary information). FineReader Engine is our “big” SDK which gives you the tools to integrate OCR technologies into your applications. You can get the recognized data and so called metadata via API (the CharParams object) or from some certain export formats, as XML or ALTO. Moreover it supports creating of the custom languages and the user patterns training that should be very useful for your case.

If you would like to try our SDK solution, please contact your region sales manager (all contacts can be found here) or simply fill the following form at our site.