Processing receipts with ProcessImage

  • Last Post 23 July 2013
Mani posted this 03 June 2013

On some images of receipts why does the OCR extract the numbers separately from the text description of the item? For example if I have :

Butter 2.00$

Milk 3.00$

Total 5.00$

I would want Butter 2.00$ to be together in the extracted text. I end up with Butter, Milk, Total and 2.00, 3.00, 5.00 separately. I then have the difficult task of trying to match Butter with 2.00, Milk with 3.00, etc. The OCR should simply extract each line of text from the image.(like the Tesseract engine does)

  • Liked by
  • vivek
Order By: Standard | Newest | Votes
Anastasia Galimova posted this 13 July 2013

Dear Mani,

The issue occurs because the row with names and the row with prices are recognized as separate text blocks. At this moment to process receipts we recommend to export the recognized words with its coordinates to an XML-file, and then extract the necessary information from this XML-file on your side; in future we are going to implement a special module for receipts recognition.

Please read the details here.

  • Liked by
  • vivek
vivek posted this 23 July 2013

Anastasia: We have a similar problem as described in aanother thread "Ignore Layout". We can NOT use the XML and/or the co-ordinates. One method we are trying is converting the Image to PDF - which ignores spaces or borders - and then converting the PDF to Text. Would be nice if the CloudSDK could do this when converting to Text, thanks, vivek.

SDK_support posted this 23 July 2013

Dear Vivek,

Thank you for you question. Actually ABBYY Cloud OCR SDK can export the recognition results to TXT format, i.e. there you will obtain nothing but text. So you can skip converting to PDF and then from PDF to Text steps. If we understand correctly this is you main intend. Please find more details in our API reference

Best regards, Anastasiya.

  • Liked by
  • vivek
vivek posted this 23 July 2013

Anastasiya: We were using the Image to Text, but had the problems as described at the top of this thread by Mani. We tried different things and discovered that exporting to PDF ignores the Spaces, Borders, Layouts and the PDF has Text from Left to Right (and Top to Bottom) - and circumvents the problems that Mani describes above (also see our problem as described in the thread "Ignore Layout").

So we are toying with Image -> PDF and then doing PDF -> Text on our end as a work-around to the problem described above. Hopefully ABBY will address this in a future release. thanks, vivek.