[FREngine] How to limit the number of pages for processing

  • 247 Views
  • Last Post 23 November 2016
  • Topic Is Solved
Koen de Leijer posted this 17 November 2016

Hi,

I want to process only few pages of a large pdf. I can't get IFRDocument.ProcessPages to work because I'm not sure what to do with / how to set IIntsCollection.

For now I have the following snippet to OCR only the first and last page:

// Create document
IFRDocument document = engine.CreateFRDocument();

// Add image file to document
document.AddImageFile( imagePath, null, null );

// Get page-count
int pagesCount = document.getPages().getCount();
if (pagesCount > 2) {
    //only first and last page
    IIntsCollection indices=engine.CreateIntsCollection();
    indices.Add(0);
    indices.Add(pagesCount-1);
    document.ProcessPages(indices, null);
} else {
    //process full document
    document.Process( null );
}

But that gives me an error: Document synthesis has not been performed for the page with index 1

Regards

Oksana Serdyuk posted this 23 November 2016

Hi Koen,

Sorry for long silence. I've converted your question to the separate post, as you ask about our offline FREngine product, not Cloud OCR SDK.

The issue occurs because if you want to get a multipage output, you need to perform document synthesis of all pages before export (the Process… method includes document synthesis). Thus, you need to process all pages. For that you can OCR only the first and last pages of your document, and the other pages should be processed using a visible text layer of the source PDF file. Please use the SourceContentReuseMode property of the ObjectsExtractionParams object for this. Below there is a code snippet in C# (sorry that it is not in Java, but the idea is clear), how to implement this scenario:

// Get page-count
int pagesCount = document.Pages.Count;

if (pagesCount > 2)
{
        //only first and last page
        FREngine.IntsCollection indicesToOCR= engineLoader.Engine.CreateIntsCollection();
        indicesToOCR.Add(0);
        indicesToOCR.Add(pagesCount - 1);
        document.ProcessPages(indicesToOCR, null);

        FREngine.DocumentProcessingParams docProcessingParams = engineLoader.Engine.CreateDocumentProcessingParams();
        docProcessingParams.PageProcessingParams.ObjectsExtractionParams.SourceContentReuseMode = FREngine.SourceContentReuseModeEnum.CRM_ContentOnly;

        for (int i = 1; i < pagesCount - 1; i++)
        {
                FREngine.IntsCollection indicesWithContent = engineLoader.Engine.CreateIntsCollection();
                indicesWithContent.Add(i);
                document.ProcessPages(indicesWithContent, docProcessingParams);
        }
}
else
{
        //process full document
        document.Process(null);
}

Hope this will be useful!

  • Liked by
  • Koen de Leijer
Close