PDF recognition, extraction of text and individual images

  • 86 Views
  • Last Post 25 December 2018
  • Topic Is Solved
Olivier von Dach posted this 23 December 2018

Hi,

I am evaluating the ABBYY Finereader Engine 12 for Linux, and I would like to know how to extract the text and the individual image areas from a given PDF document, and then how to export the text into an HTML file and every image area to a JPG or PNG image file. This operation is available inside the desktop application for MacOS, is that possible with this SDK?

Moreover is it possible to find some source code?

Many thanks for an answer.

Kind regards

Order By: Standard | Newest | Votes
Nadezhda A. Solovyeva posted this 25 December 2018

Hi Oliver,

In FineReader Engine, you use the following source code (based on the "Hello" sample)

void processImage()
{
// Create document from image file
displayMessage( L"Loading image..." );
CBstr imagePath = Concatenate( GetSamplesFolder(), L"/SampleImages/Demo.tif" );
CSafePtr<IFRDocument> frDocument = 0;
CheckResult( FREngine->CreateFRDocumentFromImage( imagePath, 0, frDocument.GetBuffer() ) );

//Recognize document
displayMessage( L"Recognizing..." );
CheckResult( frDocument->Process() );

// Save results
displayMessage( L"Saving results..." );
CBstr exportPath = Concatenate( GetSamplesFolder(), L"/SampleImages/Demo.html" );
CheckResult( frDocument->Export(  exportPath, FEF_HTMLUnicodeDefaults, 0  ) );
}

 

 

  • Liked by
  • Olivier von Dach
Olivier von Dach posted this 25 December 2018

Hi Nadezhda,

Thanks for your answer.

I suppose I should continue my investigation using your sample source code.

I am still wondering if the image areas detected during the recognition process are also exported to individual files and saved into the sample folder, beside the Demo.html file, and then Demo.html should reference these individual image files. I cannot read any specific instruction for that, nor image file specification.

Kind regards.

 

Nadezhda A. Solovyeva posted this 25 December 2018

Hi Olivier,

Please use HTMLExportParams.PictureExportParams object for adjusting export picture formats. The HTMLExportParams object is a 3rd parameter of  frDocument.Export method.

  • Liked by
  • Olivier von Dach
Close