I use ABBYY OCR SDK with profile: "textExtraction", language: "english" and output format: "pdfSearchable". After I submit invoice (see attached screenshot) I have all words OK, but the words "Description" and "UnitPrice" in the header of the table have the whitespaces in the middle after recognizing.
So i have in result pdf the "Descr ip t ion" and "Un i tP r i ce". I don`t understand why? Is this a BUG?
The quality of original document is higher than attached screenshot. Screenshot is only because i can not post this invoice to forum (but if you want I can email original invoice to you by mail).
Thank you very much, Vitalie
Am I right that you are getting this result when you copy&paste from the PDF? This in well-know "feature" of Acrobat that it sometimes "invents" spaces even if they do not exist in the output text. ABBYY OCR does some heuristics to work-around this "nice feature" of Acrobat by playing around with font metrics, but apparently it did not work in this case. It you can share your original image we will make sure we will test our technologies against it in the future releases.
Update: now there is available an option to export tagged PDFs, explained here. Using tagging solves this "creativity" problem of PDF viewer, it does not have to guess where spaces are, it just takes actual text from tagged text layer.
Thank you for the files. This issue has the different reasons in every case:
answered 08 Apr '13, 14:00
Anastasia Ga... ♦♦