Reorder recurring pdf tables & lines

  • Last Post 11 March 2019
hoshour posted this 08 March 2019

I have a pdf of 1,000 pages on which each page has the same format with 10 small tables, most 2 or 3 lines long and a few cells wide. Some tables have a merged cell or two. Two of the tables may vary in length but those tables I don't need.

For it to be useful I need the OCR Editor to convert the pdf to an Excel format where all the information on each page is rearranged to be on the same row.

Further, I need FineReader to do this automatically as it comes to each new page so that instead of 1,000 pages of tables I end up with 1,000 rows.

Is this possible?

Attached Files

Nadezhda A. Solovyeva posted this 11 March 2019

FineReader Engine can convert the PDF with tables to Excel/CSV format. For your scenario implementation, I would suggest the following:

1) Read all pages using setting DocumentProcessingParams.PageProcessingParams.PageAnalysisParams.AggressiveTableDetection = true to make FREngine detect as many tables as possible

2) If your table structure allows that, then use TableAnalysisParams.SplitOnlyBySeparators = true and TableAnalysisParams.SingleLinePerCell = true to make cells detection more accurate

If these steps get your tables extracted, then you will be able to continue. Otherwise, please use a more advanced structured document OCR tool, such as FlexiCapture Engine.

3) Export the result to CSV

4) Post-process the CSV using text manipulation functions to achieve the result.