Some parts of a specific PDF are not OCR-ed by ABBYY FineReader Engine

  • 128 Views
  • Last Post 25 September 2017
  • Topic Is Solved
Koen de Leijer posted this 08 September 2017

We received a PDF from a supplier that needs to be OCR-ed: `esco_original.pdf`

 

After processing with Java through the `com.abbyy.FREngine`-API not all parts of the PDF are OCR-ed

See: esco_abbyy.pdf

F.Y.I. we use LoadPredefinedProfile("DocumentConversion_Accuracy")

 

When I manually print the PDF, scan it and process it through the API again, it is fully OCR-ed

See: esco_rescanned.pdf and the fully-OCR-ed variant: esco_rescanned_abbyy.pdf

 

The question is: why is the original PDF not fully OCR-ed by ABBYY ?

Attached Files

Order By: Standard | Newest | Votes
Nikolay Krivchanskiy posted this 20 September 2017

 Hi Koen,

There is a number of methods to improve recognition quality in FineReader Engine. For example we managed to achieve much better results, setting options ObjectsExtractionParams::EnableAggressiveTextExtraction, ObjectsExtractionParams::DetectTextOnPictures to true.

Aдso you should manually set recognition language or languages of the document you are recognizing. You can do this with RecognizerParams::SetPredefinedTextLanguage.

For more information about object extraction options, please refer to Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → ObjectsExtractionParams.

    

Koen de Leijer posted this 20 September 2017

Hi Nikolay

Thank you for your reply,

Just to make sure: my issue is not about the text-extraction itself, but about the way a PDF is OCR-ed.

The PDF returned from the PDFExport-module of ABBYY Finereader Engine will be processed by our own software,

and it is that that PDF is not fully OCR-ed.

Koen de Leijer posted this 25 September 2017

Hi Nikolay

I eventually figured it out with your reply.

This actually worked for me:

            IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();   
            dpp.getPageProcessingParams().getPagePreprocessingParams().setCorrectOrientation(true);
            dpp.getPageProcessingParams().getObjectsExtractionParams().setEnableAggressiveTextExtraction(true);
            dpp.getPageProcessingParams().getObjectsExtractionParams().setDetectTextOnPictures(true);

Thanks !

  • Liked by
  • Nikolay Krivchanskiy
Close