On some images of receipts why does the OCR extract the numbers separately from the text description of the item? For example if I have :
I would want Butter 2.00$ to be together in the extracted text. I end up with Butter, Milk, Total and 2.00, 3.00, 5.00 separately. I then have the difficult task of trying to match Butter with 2.00, Milk with 3.00, etc. The OCR should simply extract each line of text from the image.(like the Tesseract engine does)
asked 03 Jun '13, 22:55
The issue occurs because the row with names and the row with prices are recognized as separate text blocks. At this moment to process receipts we recommend to export the recognized words with its coordinates to an XML-file, and then extract the necessary information from this XML-file on your side; in future we are going to implement a special module for receipts recognition.
Please read the details here.
answered 13 Jul '13, 10:31
Anastasia Ga... ♦♦
Thank you for you question. Actually ABBYY Cloud OCR SDK can export the recognition results to TXT format, i.e. there you will obtain nothing but text. So you can skip converting to PDF and then from PDF to Text steps. If we understand correctly this is you main intend. Please find more details in our API reference
Best regards, Anastasiya.
answered 23 Jul '13, 17:00
Anastasia: We have a similar problem as described in aanother thread "Ignore Layout". We can NOT use the XML and/or the co-ordinates. One method we are trying is converting the Image to PDF - which ignores spaces or borders - and then converting the PDF to Text. Would be nice if the CloudSDK could do this when converting to Text, thanks, vivek.
answered 23 Jul '13, 04:44
Anastasiya: We were using the Image to Text, but had the problems as described at the top of this thread by Mani. We tried different things and discovered that exporting to PDF ignores the Spaces, Borders, Layouts and the PDF has Text from Left to Right (and Top to Bottom) - and circumvents the problems that Mani describes above (also see our problem as described in the thread "Ignore Layout").
So we are toying with Image -> PDF and then doing PDF -> Text on our end as a work-around to the problem described above. Hopefully ABBY will address this in a future release. thanks, vivek.
answered 23 Jul '13, 22:40