Inserting whitespaces in the middle of words BUG

  • Last Post 20 August 2013
Vitalie posted this 08 March 2013

Hi guys,

I use ABBYY OCR SDK with profile: "textExtraction", language: "english" and output format: "pdfSearchable". After I submit invoice (see attached screenshot) I have all words OK, but the words "Description" and "UnitPrice" in the header of the table have the whitespaces in the middle after recognizing.

So i have in result pdf the "Descr ip t ion" and "Un i tP r i ce". I don`t understand why? Is this a BUG?

The quality of original document is higher than attached screenshot. Screenshot is only because i can not post this invoice to forum (but if you want I can email original invoice to you by mail).

Thank you very much, Vitalie

alt text

Attached Files

Order By: Standard | Newest | Votes
Andrey Isaev posted this 11 March 2013

Am I right that you are getting this result when you copy&paste from the PDF? This in well-know "feature" of Acrobat that it sometimes "invents" spaces even if they do not exist in the output text. ABBYY OCR does some heuristics to work-around this "nice feature" of Acrobat by playing around with font metrics, but apparently it did not work in this case. It you can share your original image we will make sure we will test our technologies against it in the future releases.

Update: now there is available an option to export tagged PDFs, explained here. Using tagging solves this "creativity" problem of PDF viewer, it does not have to guess where spaces are, it just takes actual text from tagged text layer.

Vitalie posted this 11 March 2013

Hi, Andrey, I sent email containing invoices for your attention to Please help us to solve this issue.

Vitalie posted this 05 April 2013

Hello Andrey, do you have some news for this issue? I think... the solving of this problem may be direct "normalization" of words. For example, if you have a word "description" where you have all simbols "at some near distance one frome another" you may make the bounds of "simbol rect" little bigger/wider or something else... This is very important to have a word "normalized". We do not use Acrobat, we use PDFBox to extract the words, but we have same issue as noticed... Vitalie

Anastasia Galimova posted this 08 April 2013

Hello Vitalie,

Thank you for the files. This issue has the different reasons in every case:

  1. For the word "Description": it was a bug in the recognition, we are happy to inform that it is already fixed, the fixes will be available after the technologies update around May 2013.

  2. For the words "Qty | UnitPrice": it is a bug in the regognition, the developers still investigating this issue.

  3. For the words "HAMPTON & CO. LTD.": the issue occurs because of the PDF-viewer (as Andrey has described above), we hope it will be fixed soon.

Vitalie posted this 08 April 2013

Thank you, Anastasia, you gave me good news.

Vitalie posted this 14 August 2013

Hello Anastasia,

I tried to use ABBYY OCR with "writeTags" options, but i steel have very unsatisfied result! :( The "Description" steel remain as "Descr ip t ion". Even if I try to use writeTags=dontWrite or writeTags=write. I examined resulted PDF file and finded that it steel have the whitespaces in middle of the words - even if I extract text with PDFBox or edit it with Adobe Acrobat.

Unfortunately I think the problem was not solved. How can I have text without these spaces? Or am I mistaken?

I submit documents using this options: "language=English,Spanish&profile=textExtraction&exportFormat=pdfSearchable&imageSource=scanner&pdf:writeTags=dontWrite"


same options with "writeTags=write", but results are steel remain the same! :(

PLEASE, can you help us? We have big problem with these whitespaces... Can I send you to email the documents with this problem?

Thank you very much!

Waiting for your reply...

PS: if I will send you my files - can you send me back 2,3 generated PDFs without whitespaces?

Anastasia Galimova posted this 20 August 2013

Hello Vitalie,

The spaces in the word "Description" should be fixed by the technologies update. Unfortunately, the technologies update was delayed because of the backward compatibility test results, currently it is planned on November-December'13. We apologize for the inconvenience.