Some PDFs become unreadable after OCR with ABBYY FREngine 11

  • 33 Views
  • Last Post 3 weeks ago
  • Topic Is Solved
Koen de Leijer posted this 3 weeks ago

Hi

We are currently using ABBYY FineReader Engine 11.1.14.707470 with Linux with the Java-API  (com.abbyy.FREngine.jar).
Almost all PDFs are processed correctly, but when OCR-ing the attached PDF it becomes unreadable

We use the folllowing code to perform the OCR


import com.abbyy.FREngine.Engine;
import com.abbyy.FREngine.FileExportFormatEnum;
import com.abbyy.FREngine.IDocumentProcessingParams;
import com.abbyy.FREngine.IEngine;
import com.abbyy.FREngine.IFRDocument;
import com.abbyy.FREngine.IFRPage;
import com.abbyy.FREngine.IFRPages;
import com.abbyy.FREngine.IPDFExportParams;
import com.abbyy.FREngine.PDFExportScenarioEnum;

public class ABBYY {

    public ABBYY() {}
    private IEngine engine = null;

    public void Run(String inputfilename, String dllFolder, String developerSn, String languages) throws Exception {
       
        // Load ABBYY FineReader Engine
        engine = Engine.GetEngineObject(dllFolder, developerSn);

        try {
            // Setup ABBYY FineReader Engine
            String profile = "DocumentConversion_Accuracy";
            engine.LoadPredefinedProfile(profile);

            // Process PDF
            processPDF(inputfilename, languages);
        } catch (Exception ex) {
            ex.printStackTrace();
        } finally {
            // Unload ABBYY FineReader Engine
            engine = null;
            Engine.DeinitializeEngine();
        }
    }

    private void processPDF(String inputfilename, String languages) {
        String imagePath = inputfilename;

        try {
            // Create document
            IFRDocument document = engine.CreateFRDocument();

            /*
                If orientation detection is performed during document processing
                (IPagePreprocessingParams::CorrectOrientation property is TRUE), you can select fast
                orientation detection mode: set the OrientationDetectionMode property of the
                OrientationDetectionParams object to ODM_Fast.
             */
            IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();   
            dpp.getPageProcessingParams().getPagePreprocessingParams().setCorrectOrientation(true);
            // Agressive text-selection
            dpp.getPageProcessingParams().getObjectsExtractionParams().setEnableAggressiveTextExtraction(true);
            dpp.getPageProcessingParams().getObjectsExtractionParams().setDetectTextOnPictures(true);
            // Set language
            dpp.getPageProcessingParams().getRecognizerParams().SetPredefinedTextLanguage(languages);
            dpp.getPageProcessingParams().getRecognizerParams().setLanguageDetectionMode(com.abbyy.FREngine.ThreeStatePropertyValueEnum.TSPV_Yes);

            try {
                // Add image file to document
                document.AddImageFile( imagePath, null, null );

                // Remove empty pages from inputfile
                boolean hasEmptyPages = false;
                IFRPages pages = document.getPages();
                for (int p = (pages.getCount() - 1); p >= 0; p--) {
                    IFRPage page = pages.getElement(p);
                    if (page.IsEmptyEx(null, null, null)) {
                        pages.DeleteAt(p);
                        hasEmptyPages = true;
                    }
                }
                if (hasEmptyPages) document.Synthesize(null);

                // Process document
                document.Process(dpp);

                // Save results to pdf using 'balanced' scenario
                IPDFExportParams pdfParams = engine.CreatePDFExportParams();
                pdfParams.setScenario( PDFExportScenarioEnum.PES_Balanced );

                /*
                    Specifies whether a linearized PDF file should be created. Linearized PDF files have internal data
                    arranged in a page order. A page of a linearized PDF file can be read in a web browser plug-in
                    without waiting for the whole file to be downloaded. Non-linearized PDFs have the data
                    necessary to assemble a document page scattered through the whole file. Non-linearized
                    PDF files are smaller, but they are slower to access.
                    Note: This property makes sense only for multipage PDF files. If the property is set to TRUE and
                    a one-page document is exported, a nonlinearized    file is created.
                    This property is FALSE by default.
                 */
                pdfParams.getPDFFeatures().setEnableLinearization(true);

                String pdfExportPath = inputfilename + "_ocrred.pdf";
                document.Export( pdfExportPath, FileExportFormatEnum.FEF_PDF, pdfParams );

            } finally {
                // Close document
                document.Close();
            }
        } catch( Exception ex ) {
            ex.printStackTrace();
        }
    }
}

Which parameters do we need to set in our Java-code to prevent this issue?
Any suggestions within the settings of FREngine itself?
Or is this a known issue in FREngine 11 and to be or already fixed in a more recent version?

Many thanks in advance

Koen de Leijer

Attached Files

Order By: Standard | Newest | Votes
Nadezhda A. Solovyeva posted this 3 weeks ago

Hi Koen,

This is a known issue for FineReader Engine in Linux usage. Please try to open the input PDF in your system default PDF viewer on the same computer which runs OCR. The result would be also broken.

The PDF without embedded fonts can be opened and read successfully only on the systems which do have the referenced fonts. For Windows machines, we can be sure that the fonts like "Arial" will be found and the files will be processed successfully (because the fonts come with Windows installation). But for Linux machines, reading such PDF required an additional font pack.


For further compatibility, we strongly advise against the creation of PDFs which are intended to be distributed among different environments without the embedded fonts. In order to continue working with already existing files, please choose any of the following options:

·         Install fonts and set up the compatibility options in your Linux system. You can read more about this on the Linux community page https://askubuntu.com/questions/651441/how-to-install-arial-font-in-ubuntu 

·         Alternatively, you may repair a PDF file and embed missing fonts as it's described on community page:  https://stackoverflow.com/questions/12857849/how-to-repair-a-pdf-file-and-embed-missing-fonts/13131101#13131101

 

  • Liked by
  • Koen de Leijer
Koen de Leijer posted this 3 weeks ago

Hi Nadezhda

Many thanks for your response.
Installing the MS Core Fonts ("apt-get install ttf-mscorefonts-installer") is the solution.

Best regards
Koen de Leijer

Close