We are considering using ABBYY OCR as part of a project in which we would perform downstream processing on the results of the OCR. For this downstream processing, we would need both the raw text produced by the OCR, but also what I'm calling metadata: information on formatting including horizontal and vertical whitespace (like space between paragraphs, and indentation), font changes (italics, bold, and ideally shifts between serif and non-serif fonts), etc. What we do NOT need is a pre-formatted output, like for instance a Microsoft Word version of the document.

Is this possible with ABBYY OCR? Should we be looking at the stand-alone version, or the SDK?

By the way, for our application we will be processing text in multiple languages, most of which will probably not have language models (think Navajo or Igbo or less-known languages than that). We will be working with Roman fonts, but they may have accent marks of various kinds.

asked 11 May '16, 02:42

mcswell's gravatar image

mcswell
121

I`m not from ABBYY staff, but probably you need the ALTO format? https://abbyy.technology/en:features:ocr:alto

(11 May '16, 15:12) Vitalie

You can try to use ABBYY FineReader Engine 11 for your usage scenario (the desktop version of FineReader or Cloud OCR SDK cannot provide you with all neccessary information). FineReader Engine is our “big” SDK which gives you the tools to integrate OCR technologies into your applications. You can get the recognized data and so called metadata via API (the CharParams object) or from some certain export formats, as XML or ALTO. Moreover it supports creating of the custom languages and the user patterns training that should be very useful for your case.

If you would like to try our SDK solution, please contact your region sales manager (all contacts can be found here) or simply fill the following form at our site.

link

answered 12 May '16, 13:01

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦
1.5k16

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×5
×5
×2
×2

Asked: 11 May '16, 02:42

Seen: 367 times

Last updated: 12 May '16, 13:01

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal