Hi,

I want to process only few pages of a large pdf. I can't get IFRDocument.ProcessPages to work because I'm not sure what to do with / how to set IIntsCollection.

For now I have the following snippet to OCR only the first and last page:

// Create document
IFRDocument document = engine.CreateFRDocument();

// Add image file to document
document.AddImageFile( imagePath, null, null );

// Get page-count
int pagesCount = document.getPages().getCount();
if (pagesCount > 2) {
    //only first and last page
    IIntsCollection indices=engine.CreateIntsCollection();
    indices.Add(0);
    indices.Add(pagesCount-1);
    document.ProcessPages(indices, null);
} else {
    //process full document
    document.Process( null );
}

But that gives me an error: Document synthesis has not been performed for the page with index 1

Regards

asked 17 Nov '16, 19:25

Koen%20de%20Leijer's gravatar image

Koen de Leijer
133

converted to question 22 Nov '16, 13:25

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦
1.5k16


Hi Koen,

Sorry for long silence. I've converted your question to the separate post, as you ask about our offline FREngine product, not Cloud OCR SDK.

The issue occurs because if you want to get a multipage output, you need to perform document synthesis of all pages before export (the Process… method includes document synthesis). Thus, you need to process all pages. For that you can OCR only the first and last pages of your document, and the other pages should be processed using a visible text layer of the source PDF file. Please use the SourceContentReuseMode property of the ObjectsExtractionParams object for this. Below there is a code snippet in C# (sorry that it is not in Java, but the idea is clear), how to implement this scenario:

// Get page-count
int pagesCount = document.Pages.Count;

if (pagesCount > 2)
{
        //only first and last page
        FREngine.IntsCollection indicesToOCR= engineLoader.Engine.CreateIntsCollection();
        indicesToOCR.Add(0);
        indicesToOCR.Add(pagesCount - 1);
        document.ProcessPages(indicesToOCR, null);

        FREngine.DocumentProcessingParams docProcessingParams = engineLoader.Engine.CreateDocumentProcessingParams();
        docProcessingParams.PageProcessingParams.ObjectsExtractionParams.SourceContentReuseMode = FREngine.SourceContentReuseModeEnum.CRM_ContentOnly;

        for (int i = 1; i < pagesCount - 1; i++)
        {
                FREngine.IntsCollection indicesWithContent = engineLoader.Engine.CreateIntsCollection();
                indicesWithContent.Add(i);
                document.ProcessPages(indicesWithContent, docProcessingParams);
        }
}
else
{
        //process full document
        document.Process(null);
}

Hope this will be useful!

link

answered 23 Nov '16, 13:55

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦
1.5k16

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×35
×6
×4

Asked: 17 Nov '16, 19:25

Seen: 222 times

Last updated: 23 Nov '16, 13:55

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal