What's the DocumentProcessingParams parameters to get the best accuracy extraction from pdf file

  • 96 Views
  • Last Post 21 June 2019
Javier Salas posted this 20 June 2019

Hi All

I'm very new using FineReader SDK with .NET/C#, I actually have to extract data from Scanned pdf files, not native (text), the problem that I'm facing is that some of the pdf files have a low/bad quality and some text is not very clear, when I'm executing the OCR, I'm extracting some garbage and some words are changed, the actual configuration that I'm using is:

    

          dpp.PageProcessingParams.ObjectsExtractionParams.EnableAggressiveTextExtraction = true;

          dpp.PageProcessingParams.ObjectsExtractionParams.DetectTextOnPictures = true;

          dpp.PageProcessingParams.ObjectsExtractionParams.RemoveGarbage = true;

          dpp.PageProcessingParams.PageAnalysisParams.AggressiveTableDetection = true;

          dpp.PageProcessingParams.RecognizerParams.SetPredefinedTextLanguage("English");

          dpp.PageProcessingParams.PageAnalysisParams.DetectPictures = false;

          dpp.PageProcessingParams.PageAnalysisParams.DetectText = true;

          dpp.PageProcessingParams.PageAnalysisParams.DetectTables = true;

          dpp.PageProcessingParams.PageAnalysisParams.EnableExhaustiveAnalysisMode = true;

 

There is a better parameters that can be used to extract more accurate the data and also I'm wondering if the text retrieved could be formatted as looks in pdf?

 

Thanks!

Order By: Standard | Newest | Votes
Koen de Leijer posted this 21 June 2019

Hi Javier

I think you have a proper setup,
each extra parameter descreases performance, which can become an issue.
 
The documentation describes them all, even some more than you are using.
See this post where to find the documentation:
https://forum.ocrsdk.com/thread/5178-abby-fine-reader-engine-11-sdk-c-documentation/

We are using the following which is pretty much what I think is needed (Java, but you should be able to read):

/*
  If orientation detection is performed during document processing
  (IPagePreprocessingParams::CorrectOrientation property is TRUE), you can select fast
  orientation detection mode: set the OrientationDetectionMode property of the
  OrientationDetectionParams object to ODM_Fast.
 */
IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();   
dpp.getPageProcessingParams().getPagePreprocessingParams().setCorrectOrientation(true);

//Agressive text-selection
dpp.getPageProcessingParams().getObjectsExtractionParams().setEnableAggressiveTextExtraction(true);
dpp.getPageProcessingParams().getObjectsExtractionParams().setDetectTextOnPictures(true);

//set language
dpp.getPageProcessingParams().getRecognizerParams().SetPredefinedTextLanguage(languages);
dpp.getPageProcessingParams().getRecognizerParams().setLanguageDetectionMode(com.abbyy.FREngine.ThreeStatePropertyValueEnum.TSPV_Yes);

As you can see we also use the CorrectOrientation-parameter which tries to rotate and/or correct skew scanned PDFs.
One other thing we have in addition to you, is that we also try to test the recognition languages and/or append the language as a parameter before processing (see last few lines in the code block). And not just using "English" as a hardcoded parameter.

Best regards
Koen de Leijer

  • Liked by
  • Javier Salas
Javier Salas posted this 21 June 2019

HI Koen!

Thanks for your quick reply, I think then that the problem is the pdf, even with the human eye is hard to read, so I can imagine how hard would be to the computer try to look at something that is not even easy to read by human eye.

I'll try all the tips you gave me, thank you very much Koen!

Close