Some parts of a specific PDF are not OCR-ed by ABBYY FineReader Engine

  • Last Post 3 days ago
Koen de Leijer posted this 3 weeks ago

We received a PDF from a supplier that needs to be OCR-ed: `esco_original.pdf`


After processing with Java through the `com.abbyy.FREngine`-API not all parts of the PDF are OCR-ed

See: esco_abbyy.pdf

F.Y.I. we use LoadPredefinedProfile("DocumentConversion_Accuracy")


When I manually print the PDF, scan it and process it through the API again, it is fully OCR-ed

See: esco_rescanned.pdf and the fully-OCR-ed variant: esco_rescanned_abbyy.pdf


The question is: why is the original PDF not fully OCR-ed by ABBYY ?

Attached Files

Order By: Standard | Newest | Votes
Nikolay Krivchanskiy posted this 3 days ago

 Hi Koen,

There is a number of methods to improve recognition quality in FineReader Engine. For example we managed to achieve much better results, setting options ObjectsExtractionParams::EnableAggressiveTextExtraction, ObjectsExtractionParams::DetectTextOnPictures to true.

Aдso you should manually set recognition language or languages of the document you are recognizing. You can do this with RecognizerParams::SetPredefinedTextLanguage.

For more information about object extraction options, please refer to Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → ObjectsExtractionParams.


Koen de Leijer posted this 3 days ago

Hi Nikolay

Thank you for your reply,

Just to make sure: my issue is not about the text-extraction itself, but about the way a PDF is OCR-ed.

The PDF returned from the PDFExport-module of ABBYY Finereader Engine will be processed by our own software,

and it is that that PDF is not fully OCR-ed.