1
1

On some images of receipts why does the OCR extract the numbers separately from the text description of the item? For example if I have :

Butter 2.00$

Milk 3.00$

Total 5.00$

I would want Butter 2.00$ to be together in the extracted text. I end up with Butter, Milk, Total and 2.00, 3.00, 5.00 separately. I then have the difficult task of trying to match Butter with 2.00, Milk with 3.00, etc. The OCR should simply extract each line of text from the image.(like the Tesseract engine does)

asked 03 Jun '13, 22:55

Mani's gravatar image

Mani
1112


Dear Mani,

The issue occurs because the row with names and the row with prices are recognized as separate text blocks. At this moment to process receipts we recommend to export the recognized words with its coordinates to an XML-file, and then extract the necessary information from this XML-file on your side; in future we are going to implement a special module for receipts recognition.

Please read the details here.

link

answered 13 Jul '13, 10:31

Anastasia%20Galimova's gravatar image

Anastasia Ga... ♦♦
790112

Dear Vivek,

Thank you for you question. Actually ABBYY Cloud OCR SDK can export the recognition results to TXT format, i.e. there you will obtain nothing but text. So you can skip converting to PDF and then from PDF to Text steps. If we understand correctly this is you main intend. Please find more details in our API reference

Best regards, Anastasiya.

link

answered 23 Jul '13, 17:00

SDK_support's gravatar image

SDK_support ♦♦
2763

Anastasia: We have a similar problem as described in aanother thread "Ignore Layout". We can NOT use the XML and/or the co-ordinates. One method we are trying is converting the Image to PDF - which ignores spaces or borders - and then converting the PDF to Text. Would be nice if the CloudSDK could do this when converting to Text, thanks, vivek.

link

answered 23 Jul '13, 04:44

vivek's gravatar image

vivek
113

Anastasiya: We were using the Image to Text, but had the problems as described at the top of this thread by Mani. We tried different things and discovered that exporting to PDF ignores the Spaces, Borders, Layouts and the PDF has Text from Left to Right (and Top to Bottom) - and circumvents the problems that Mani describes above (also see our problem as described in the thread "Ignore Layout").

So we are toying with Image -> PDF and then doing PDF -> Text on our end as a work-around to the problem described above. Hopefully ABBY will address this in a future release. thanks, vivek.

link

answered 23 Jul '13, 22:40

vivek's gravatar image

vivek
113

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×41

Asked: 03 Jun '13, 22:55

Seen: 2,999 times

Last updated: 23 Jul '13, 22:40

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal