[FREngine] How to limit the number of pages for processing

  • 1.6K Views
  • Last Post 05 September 2018
  • Topic Is Solved
Koen de Leijer posted this 17 November 2016

Hi,

I want to process only few pages of a large pdf. I can't get IFRDocument.ProcessPages to work because I'm not sure what to do with / how to set IIntsCollection.

For now I have the following snippet to OCR only the first and last page:

// Create document
IFRDocument document = engine.CreateFRDocument();

// Add image file to document
document.AddImageFile( imagePath, null, null );

// Get page-count
int pagesCount = document.getPages().getCount();
if (pagesCount > 2) {
    //only first and last page
    IIntsCollection indices=engine.CreateIntsCollection();
    indices.Add(0);
    indices.Add(pagesCount-1);
    document.ProcessPages(indices, null);
} else {
    //process full document
    document.Process( null );
}

But that gives me an error: Document synthesis has not been performed for the page with index 1

Regards

Order By: Standard | Newest | Votes
Oksana S. posted this 23 November 2016

Hi Koen,

Sorry for long silence. I've converted your question to the separate post, as you ask about our offline FREngine product, not Cloud OCR SDK.

The issue occurs because if you want to get a multipage output, you need to perform document synthesis of all pages before export (the Process… method includes document synthesis). Thus, you need to process all pages. For that you can OCR only the first and last pages of your document, and the other pages should be processed using a visible text layer of the source PDF file. Please use the SourceContentReuseMode property of the ObjectsExtractionParams object for this. Below there is a code snippet in C# (sorry that it is not in Java, but the idea is clear), how to implement this scenario:

// Get page-count
int pagesCount = document.Pages.Count;

if (pagesCount > 2)
{
        //only first and last page
        FREngine.IntsCollection indicesToOCR= engineLoader.Engine.CreateIntsCollection();
        indicesToOCR.Add(0);
        indicesToOCR.Add(pagesCount - 1);
        document.ProcessPages(indicesToOCR, null);

        FREngine.DocumentProcessingParams docProcessingParams = engineLoader.Engine.CreateDocumentProcessingParams();
        docProcessingParams.PageProcessingParams.ObjectsExtractionParams.SourceContentReuseMode = FREngine.SourceContentReuseModeEnum.CRM_ContentOnly;

        for (int i = 1; i < pagesCount - 1; i++)
        {
                FREngine.IntsCollection indicesWithContent = engineLoader.Engine.CreateIntsCollection();
                indicesWithContent.Add(i);
                document.ProcessPages(indicesWithContent, docProcessingParams);
        }
}
else
{
        //process full document
        document.Process(null);
}

Hope this will be useful!

  • Liked by
  • Koen de Leijer
  • Aman Gupta
Aman Gupta posted this 08 June 2018

How to use GetPagesToProcess Function of IFileAdapter in Hello C# Code of Finereader Engine 12 and can you explain why I have to do document synthesis in the above code

Daria Zvereva posted this 08 June 2018

Hi! 

As we have already answered you in the post you should see our standard BatchProcessing code sample in C#.

Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. At the document synthesis stage the font styles and the logical structure of the document are recreated. This stage is required before the export stage. During export recognized documents are saved in files in suitable formats. 

Hope this information will be usefull.

Rama Reddy posted this 30 July 2018

ho can we perform same using JAVA. I have multiple page pdf document on which i have to apply file reader to convert that into editable format?

 

Rama Reddy posted this 30 July 2018

ho can we perform same using JAVA. I have multiple page pdf document on which i have to apply file reader to convert that into editable format?

 

Rama Reddy posted this 30 July 2018

how can we perform same using JAVA. I have multiple page pdf document on which i have to apply file reader to convert that into editable format?

 

Aman Gupta posted this 31 July 2018

Hi we are trying page range limitation with below code.We will be giving start range and end range of the page to digitize but we are facing problem for some pdf if we are giving 1-2 as page range it is digitizing the whole document or for some pdf if we are giving 2-3 it is digitizing from 1 to 3 page but it should do from 2 to 3.I dont know what is going wrong please review below code for your reference.

document.AddImageFile(inPutFilePath, null, null);

// Get page-count
int pagesCount = document.Pages.Count;
FREngine.DocumentProcessingParams docProcessingParams = engine.CreateDocumentProcessingParams();

// Configure/Setup processing parameters for accuracy 
docProcessingParams.PageProcessingParams.ObjectsExtractionParams.EnableAggressiveTextExtraction = true;
docProcessingParams.PageProcessingParams.ObjectsExtractionParams.DetectTextOnPictures = true;
docProcessingParams.PageProcessingParams.PagePreprocessingParams.CorrectOrientation = true;
docProcessingParams.PageProcessingParams.PageAnalysisParams.EnableExhaustiveAnalysisMode = true;
docProcessingParams.PageProcessingParams.RecognizerParams.TextTypes = (int)TextTypeEnum.TT_Normal | (int)TextTypeEnum.TT_Matrix | (int)TextTypeEnum.TT_Typewriter; 

// Process pages based on Page range.
if (request.page_range != null) {
   FREngine.IIntsCollection pageIndices = engine.CreateIntsCollection();
string[] arrayPageRange = request.page_range.Split('-');
int[] digpages = new int[] { };
int startRange = Int32.Parse(arrayPageRange[0]);
int endRange = Int32.Parse(arrayPageRange[1]);
for (int i = startRange; i <= endRange; i++) {
      int rangeValue = i - 1;
digpages = digpages.Concat(new int[] { rangeValue }).ToArray();
pageIndices.Add(rangeValue);
document.ProcessPages(pageIndices, docProcessingParams);
}
FREngine.IIntsCollection indicesWithContent = engine.CreateIntsCollection();
FREngine.DocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();
dpp.PageProcessingParams.ObjectsExtractionParams.SourceContentReuseMode = FREngine.SourceContentReuseModeEnum.CRM_ContentOnly;
for (int i = 0; i < pagesCount; i++) {
if (!digpages.Contains(i)) {
indicesWithContent.Add(i);
document.ProcessPages(indicesWithContent, dpp);
}
}
}
else { //process full document document.Process(docProcessingParams); }

 

 

Rama Reddy posted this 05 September 2018

Hi,

What is the use of SourceContentReuseMode? 

Koen de Leijer posted this 05 September 2018

Hi Rama

SourceContentReuseMode is available in the documentation of ABBYY FineReader:
https://knowledgebase.abbyy.com/article/1581

SourceContentReuseModeEnum

SourceContentReuseModeEnum enumeration constants describe available modes of source PDF file contents reusing.

typedef enum {
    CRM_Auto,
    CRM_DoNotReuse,
    CRM_ContentOnly
} SourceContentReuseModeEnum; 

Elements

Name Description CRM_Auto ABBYY FineReader Engine uses both text and image layer of the source PDF file. CRM_ContentOnly Only visible text layer of the source PDF file is used, the image layer is not used.

Do not use this setting if the source file contains only raster information: for example, for image-only PDFs. To find out if the file contains any text layer use the IsPdfWithTextualContent method. However, note that if the document contains only invisible text layer detected by the IsPdfWithTextualContent method, this text layer will not be used in this mode.

CRM_DoNotReuse Text layer of the source PDF file is not used, the image layer is recognized by ABBYY FineReader Engine.

See also


And:

Best regards

Koen de Leijer

 

Close