Hi guys,

I use ABBYY OCR SDK with profile: "textExtraction", language: "english" and output format: "pdfSearchable". After I submit invoice (see attached screenshot) I have all words OK, but the words "Description" and "UnitPrice" in the header of the table have the whitespaces in the middle after recognizing.

So i have in result pdf the "Descr ip t ion" and "Un i tP r i ce". I don`t understand why? Is this a BUG?

The quality of original document is higher than attached screenshot. Screenshot is only because i can not post this invoice to forum (but if you want I can email original invoice to you by mail).

Thank you very much, Vitalie

alt text

asked 08 Mar '13, 14:06

Vitalie's gravatar image

Vitalie
451214

edited 08 Mar '13, 14:06


Am I right that you are getting this result when you copy&paste from the PDF? This in well-know "feature" of Acrobat that it sometimes "invents" spaces even if they do not exist in the output text. ABBYY OCR does some heuristics to work-around this "nice feature" of Acrobat by playing around with font metrics, but apparently it did not work in this case. It you can share your original image we will make sure we will test our technologies against it in the future releases.

Update: now there is available an option to export tagged PDFs, explained here. Using tagging solves this "creativity" problem of PDF viewer, it does not have to guess where spaces are, it just takes actual text from tagged text layer.

link

answered 11 Mar '13, 17:13

Andrey%20Isaev's gravatar image

Andrey Isaev ♦♦
2835

edited 30 Jul '13, 13:31

Hi, Andrey, I sent email containing invoices for your attention to cloudocrsdk@abbyy.com Please help us to solve this issue.

(11 Mar '13, 17:38) Vitalie

Hello Andrey, do you have some news for this issue? I think... the solving of this problem may be direct "normalization" of words. For example, if you have a word "description" where you have all simbols "at some near distance one frome another" you may make the bounds of "simbol rect" little bigger/wider or something else... This is very important to have a word "normalized". We do not use Acrobat, we use PDFBox to extract the words, but we have same issue as noticed... Vitalie

(05 Apr '13, 16:16) Vitalie

Hello Vitalie,

Thank you for the files. This issue has the different reasons in every case:

  1. For the word "Description": it was a bug in the recognition, we are happy to inform that it is already fixed, the fixes will be available after the technologies update around May 2013.

  2. For the words "Qty | UnitPrice": it is a bug in the regognition, the developers still investigating this issue.

  3. For the words "HAMPTON & CO. LTD.": the issue occurs because of the PDF-viewer (as Andrey has described above), we hope it will be fixed soon.

link

answered 08 Apr '13, 14:00

Anastasia%20Galimova's gravatar image

Anastasia Ga... ♦♦
790112

Thank you, Anastasia, you gave me good news.

(08 Apr '13, 14:29) Vitalie

Hello Anastasia,

I tried to use ABBYY OCR with "writeTags" options, but i steel have very unsatisfied result! :( The "Description" steel remain as "Descr ip t ion". Even if I try to use writeTags=dontWrite or writeTags=write. I examined resulted PDF file and finded that it steel have the whitespaces in middle of the words - even if I extract text with PDFBox or edit it with Adobe Acrobat.

Unfortunately I think the problem was not solved. How can I have text without these spaces? Or am I mistaken?

I submit documents using this options: "language=English,Spanish&profile=textExtraction&exportFormat=pdfSearchable&imageSource=scanner&pdf:writeTags=dontWrite"

or

same options with "writeTags=write", but results are steel remain the same! :(

PLEASE, can you help us? We have big problem with these whitespaces... Can I send you to email the documents with this problem?

Thank you very much!

Waiting for your reply...

PS: if I will send you my files - can you send me back 2,3 generated PDFs without whitespaces?

(14 Aug '13, 11:23) Vitalie
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×5
×4
×2
×1
×1

Asked: 08 Mar '13, 14:06

Seen: 2,194 times

Last updated: 20 Aug '13, 17:03

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal