Which version to extract data from tables?

  • 46 Views
  • Last Post 2 weeks ago
  • Topic Is Solved
james.thomas@ucl.ac.uk posted this 3 weeks ago

Hi - I'm interested in extracting tables from pdf files - you could imagine the problem as wanting to have a list of tables for each file processed (e.g. as csv / HTML).

I'm not sure whether I should be using the Cloud OCR SDK or the Finereader application. Please could you advise?

Thanks, James.

Order By: Standard | Newest | Votes
Nadezhda A. Solovyeva posted this 3 weeks ago

Hi James,

Both products will work for your purpose. In the case of Cloud OCR SDK, you can choose XML result, which provides with TABLE tag (and then read this XML into .CSV or convert to HTML). In case of FineReader Engine, you may choose HTML export format to export your tables directly. The recognition result will be the same in both cases. The difference is that FineReader Engine supports more output formats, and if our HTML export result will be OK for you, then it would save you time coding. 

Please check our knowledgebase article to figure out how to force table detection - https://abbyy.zendesk.com/knowledge/articles/360002905660

james.thomas@ucl.ac.uk posted this 3 weeks ago

Thanks for your reply, Nadezhda.

I'm afraid I can't follow the knowledge base link - it just redirects to the home page.

Re using the XML and converting to HTML - this sounds doable. Do you have example code re how to do this? I can estimate some of the possible variants that I'd need to cover (e.g. 'merged' cells), but if you have an example, that would no doubt be a great help! (I'm writing in .net - so mostly C#)

thanks for your help, James.

Nadezhda A. Solovyeva posted this 2 weeks ago

Dear James,

Unfortunately, we don't have the XSLT sample for ABBYY XML → HTML transformation. Please check our XML export schema for full ABBYY XML description. 

Here is the referenced knowledgebase article copy.

 

In some cases, you may receive a corrupt layout, because the tables in the document were not detected. 

1. First of all, make sure that your images come in sufficient quality. Recommended is 300dpi, color or grayscale images.

2. If your images have good quality, then make sure that you did not use any of the parameters below because they turn off table detection:

  • IPageAnalysisParams.EnableTextExtractionMode = true;
  • IPageAnalysisParams.DetectTables = false;
  • FREngine.LoadPredefinedProfile("TextExtraction_Accuracy");

3. You may use the following parameter to make table detection a priority for the Analyser:

  • IPageAnalysisParams.AggressiveTableDetection = true;

4. In rare cases, FineReader Engine cannot detect tables even if forced. For example, this happens if your table has a lot of decorative formatting, does not have clear separators or decorate fonts are not detected clearly. 

There is one last method of table recognition, applicable only to the pages, which consist of the table alone (no pictures or text blocks outside the table). You may create a table block covering the whole page area and forcefully analyze that block. Below C# code sample:

FREngine.IRegion wholePageRegion = engineLoader.Engine.CreateRegion();
wholePageRegion.AddRect(0, 0, document.Pages[0].ImageDocument.BlackWhiteImage.Width, document.Pages[0].ImageDocument.BlackWhiteImage.Height);

FREngine.IBlock block = document.Pages[0].Layout.Blocks.AddNew(FREngine.BlockTypeEnum.BT_Table, wholePageRegion);
FREngine.ITableBlock tableBlock = block.GetAsTableBlock();
document.Pages[0].AnalyzeTable(0);

document.Recognize();
document.Synthesize();

 

 

james.thomas@ucl.ac.uk posted this 2 weeks ago

Dear Nadezhda,

Thanks - that's really helpful. So it sounds as though I should head towards the finengine library, rather than using the API then?

thanks, James.

Nadezhda A. Solovyeva posted this 2 weeks ago

Dear James,

Yes, this is correct. The processing of type, described above, is far beyond general configuration options, available in Cloud OCR SDK. Only FineReader Engine allows this deep configuration level.

Close