I've pdf files and some of them are already OCR'ed. I want to rewrite existing text with Abbyy OCR result. The code is something like that :
FREngine.FRDocument document = Engine.ProcessStream(fs);
document.Export(newFilePath, FREngine.FileExportFormatEnum.FEF_PDF, null);
Result pdf has the new text layer (I don't know whether it includes old one also).
The issue is about the size of new pdf file. Output file is 10-15 times bigger than the input while result file image quality is worse.
What is the point I'm missing? How can I reduce the output file size by increasing the quality?
Thanks & regards,
asked 09 Jun '16, 19:18
When a recognized document is exported to PDF, the images used in the image layer of the exported document may differ from the original ones. Depending on the processing parameters, FineReader Engine might alter images, e.g. change contrast and brightness or even remove objects of certain color. Furthermore, FRE does not support export of vector images and converts them to raster images, which might make the resulting image appear less crisp. However, depending on what exactly your scenario is, there are a number of ways to deal with this.
First of all, instead of processing PDF files that already have a text layer, you could simply copy them. This can be done using the IEngine::IsPDFWithTextualContent() method. Basically, if this method returns True for one of your PDF files, you could copy the whole file or just skip it. Please refer to the ABBYY FineReader Engine 11 Developer’s Help article API Reference→Engine Object→Processing Methods, section Methods for working with images about this method and to the Hello code sample for an example of how it is used.
Secondly, if you would actually like to replace the text layer in PDF files that have already been OCR’ed, you could use the IEngine::InjectTextLayer() (FineReader Engine 11 R6) or IEngine::InjectTextLayerEx() (FineReader Engine 11 R7) methods. These methods allow you to create PDF files that use the same images as the original document, but also contain a text layer with the recognized text. This should solve the issue with lower image quality in exported documents. Please note that this approach does not utilize the FRDocument object or its methods, e.g. IFRDocument::Process(), so you will be unable to work with streams. Please refer to the ABBYY FineReader Engine 11 Developer’s Help article API Reference→Engine Object→Processing Methods, section Analysis, recognition and synthesis methods for more information about this method.
Finally, you could try to improve the image quality of exported documents by changing the value of the Scenario property of the PDFExportParams object passed to the IFRDocument::Export() method:
Although this might improve the quality of the images, the size of the document would most likely increase. You could also try changing the values of other properties of the PDFExportParams object and pick the ones that work better for you. Please refer to the ABBYY FineReader Engine 11 Developer’s Help article API Reference→Parameter Objects→Export Parameters→PDFExportParams for detailed information.
answered 14 Jun '16, 13:56