I'm seeing inconsistencies between the layout of an image and the text output from Abbyy.

I'm testing with image of an invoice. It has a tabular layout. My post-processing logic is looking at the whitespace as separators for the data and uses them to extract sections of the data for input into another program.

However, the horizontal whitespace is not always preserved, and Abbyy seems to have it's own idea of how to format it in a tabular way - it seems to recognize that there is a table, but is aligning/grouping some of the wrong data horizontally together.

Here's an image of the before and after: http://screencast.com/t/wYQI8h48I - the output from Abbyy is in the notepad document at the top, the source image is below.

I know that I can export as XML, which will have every character position apparently accurately recorded... but then I need to write a program to recompose the document into a text format... which is what the text output from Abbyy should already be providing!

Another, but not important to me, note is that all of the horizontal lines go missing when processed by abbyy. As you can see from the image, they are made up of hyphens and 'equal signs', which are valid characters. Why are they all stripped?

I suppose that you get such output due to some specifics of the document analysis. For example, the upper part of your document is recognized as a few separate blocks, and the lower part - as a table. However, as the table is implicit without borders, splitting into the columns is not always performed correctly.

alt text

In your case, I would recommend you to use the textExtraction profile. Then the document appearance and structure will be ignored, pictures and tables will not be detected, and you will get the output in the plain text format. Besides the recognized text will be exported to the file line by line from left to right, and the output will simulate original layout of a source document with the help of inserted spaces and empty lines.

As for the question about the horizontal lines, they are detected as the borders of the table, but as you export the recognition result to the TXT format, the tables cannot be supported in this export format by definition. This is an expected behavior.


