Layout analysis of a PDF in Java

  • 229 Views
  • Last Post 13 June 2018
divyavijay posted this 07 June 2018

Hi,

I am new to Abbyy SDK. I tried going through the documentation but the documentation does not provide much information related to Java.

My task is to load a pdf with multiple pages in it. For each page do layout analysis and I want the output to mention whether a particular block is a text, table, image, signature, header and footer etc. And OCR each block. The save each image separately as text.

I am able to load pdf. extract text from each page and save the complete file as xml.

I could not find:

- how can I process each page separately? Also, whether multithreading is possible for this.

- If the text in the image is on the same line but far from each other for example:

"name: abc                                      date:1/1/1"

abbyy gives: "name: abc date:1/1/1"

I want: "name: abc"; "date:1/1/1"

is it possible to do so?

 

Thanks for any kind of help.

 

Daria Zvereva posted this 13 June 2018

Hello!

Sorry for the delay in response.

In case of processing multipage documents with a large number of pages you can recognize pages of the document in parallel using the FRDocument object (please check Guided Tour → Advanced Techniques → Parallel Processing). Then you can access the layout of the recognized document and post-process each page separately as you need. Please find the information in the Developer’s Help: Guided Tour → Advanced Techniques → Working with Layout and Blocks. 

As for your second question, you may separate particular blocks manually using the ILayoutBlocks interface. It is possible to delete the old block (use the DeleteAt method of the LayoutBlocks object) and create another two blocks instead of it (use the AddNew method of the LayoutBlocks object). In this case, after processing you can get separate values as you wish ("name: abc"; "date:1/1/1"). 

Another way to get separate values after recognition is to apply some post-processing to the XML output to extract needed data. Please see the Specifications → Export Formats → XMLSchema Description article in the Developer’s Help for detailed description of the XML format. As the XML output includes the coordinates of each element, you can parse the words by coordinates on your side and extract the necessary data from the output. 

The basic idea is that the field value and the field name would be close to each other in a document. Therefore, you could, for example, search the XML for the keywords like the field names, get those keywords’ coordinates and then find other text blocks with the field values situated somewhere near those keywords (right below them, on the same level to the right, etc.). If you need to find the words from one line, you can use the baseline coordinate of the words. If the baseline coordinates are close, the words are from the same line.

In case you would like to recognize forms, you can also try FlexiCapture Engine.

 

Close