temporary files when ocr large PDFs

  • Last Post 22 March 2017
Itze posted this 21 February 2017

We use ABBYY SDK 11 for Linux in our project with a c++ commandline tool. Sometimes we have to ocr large scanned PDF documents to PDF, with more than 500 pages. The problem is, that abbyy store a lot of data in the given tempdir for each page and delete these data when call DeinitializeEngine(). I have try to seperate the ocr process for each page by call frDocument->PreprocessPages, frDocument->AnalyzePages, frDocument->RecognizePages but the temporary files for each page exists in tempfolder until call DeinitializeEngine(). It is possible to delete the temporary files for a page after run ocr process for the page?

Thanks Jan

Order By: Standard | Newest | Votes
IvanPopov posted this 20 March 2017

Could you please clarify whether that 500 page PDF document is actually a single 500-page document, or rather multiple smaller documents?

Information about pages, including the information stored in temporary files, is used on each processing step. Having information about all pages is especially important during the document synthesis stage when separate pages are combined into a single document. Therefore, deleting temporary files would cause issues.

Now, if you are dealing with 500-page documents, you could try to set the value of the IFRDocument::PageFlushingPolicy property to PFP_KeepInMemory. This should reduce the amount of data stored in the temporary folder.

If, on the other hand, you are dealing with multiple documents stored in a single file, you could process them one by one. First, use the IFRDocument::AddImageFile() method to add pages corresponding to a single document (the PageIndices parameter of the method can be used to add only the specified pages of the source document to the FRDocument object). Then, process and export the document, e.g. by calling IFRDocument::Process() and IFRDocument::Export(). Finally, call the IFRDocument::Close() method to releases all resources used by the FRDocument objects, including the temporary files. Repeat these steps for all documents contained in the source PDF file.

Itze posted this 22 March 2017

Thanks for your anser.

I mean a single 500-page document. For export to pdf i have found a solution, i use the new ExportFileWriter, so i can loop over the pages and export page by page and the temporary directory does not grow. So far so good, unfortunately, the ExportFileWriter only works for PDF export.