Tables only from PDF

  • 156 Views
  • Last Post 27 August 2018
Rama Reddy posted this 09 August 2018

 Hi,

 

I have a pdf with two tables and text. I want to extract only tables and leave text by using ABBYY Finereader JAVA API. how can we do that.Can you suggest me any java code ?

Order By: Standard | Newest | Votes
Helen Osetrova posted this 10 August 2018

Hello!

 

Please create the PageAnalysisParams object and tune its properties: set IPageAnalysisParams::DetectText = false and IPageAnalysisParams::DetectTables = true. Please learn more about the PageAnalysisParams object in Developer’s Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters →  PageAnalysisParams

 

You can also create a user profile with the  required settings, save it as an .ini file and then load it using the IEngine::LoadProfile method:

private IEngine engine = null;
...
engine = Engine.GetEngineObject( SamplesConfig.GetDllFolder(), SamplesConfig.GetDeveloperSN() );
...
engine.LoadProfile( "../profile.ini" );

 

“profile.ini” should contain the following strings:

[PageAnalysisParams]
DetectText = false
DetectTables = true

 

It is possible to specify also the recognition language and many other options in the profile. Please learn more about profiles usage in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.

 

You could find very comprehensive Java samples for different scenarios under the %ABBYY FineReader Engine folder%/Samples/Java directory.

 

Rama Reddy posted this 13 August 2018

how to use that profile.ini file to extract only tables from pdf?

Helen Osetrova posted this 14 August 2018

Hello,

 

Please note that the profiles usage is described in detail in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.

 

When some new objects are created, the properties of newly created objects are usually set to reasonable defaults. But default values are not always optimal for all usage scenarios. You may need to change these properties in some cases. This can be done either via the API or with the help of a profile. A profile contains a list of new default values for object properties. The LoadProfile() method of the Engine object allows you to load a user profile file (profile.ini). After this file is loaded, newly created objects will have the new default values specified in the file.

 

So, to use the profile for processing, you should implement the LoadProfile( String FileName ) with the only parameter FileName. FileName contains the path to the profile file. You can specify either a full path or a path relative to the current directory. 

 

Please find below Java code snippet based on our Java sample included to the FineReader Engine distribution pack:

 

    private void processImage() {

        String imagePath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\Demo.tif";
        String profilePath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\profile.ini";  // you should put profile.ini to the specified directory

        try {

            // Load Engine
            engine = Engine.GetEngineObject( SamplesConfig.GetDllFolder(), SamplesConfig.GetDeveloperSN() ); // you should specify a valid Developer Serial Number in SamplesConfig.java        

            // Load profile.ini
           Engine.LoadProfile(profilePath);

            // Create document
            IFRDocument document = engine.CreateFRDocument();            

            try {

                // Add image file to document
                displayMessage( "Loading image...");
                document.AddImageFile( imagePath, null, null );

                // Process document
                displayMessage( "Process...");
                document.Process();
            
                // Save results
                displayMessage( "Saving results...");

                // Save results to rtf with default parameters
                String rtfExportPath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\Demo.docx";
                document.Export( rtfExportPath, FileExportFormatEnum.FEF_DOCX, null );

            } finally {

                // Close document
                document.Close();

                displayMessage("Done ...");

                // Unload Engine
                engine = null;
                Engine.DeinitializeEngine();

           }

        } catch( Exception ex ) {

            displayMessage( ex.getMessage() );

        }
    }

 

Rama Reddy posted this 16 August 2018

can we do this using Layout and Blocks?

and how to implement that?

Rama Reddy posted this 17 August 2018

How to extract this table in proper way? The process we discussed above is extracting this table but it is dividing 'Firefox 1.0' as two cells and giving as two columns. How can I avoid that and get proper table?

Oksana Serdyuk posted this 17 August 2018

Hi Rama,

Please try to add following strings to the profile.ini file:

[RTFExportParams]
KeepLines = true
PageSynthesisMode = PSM_RTFEditableCopy

Rama Reddy posted this 20 August 2018

Rama Reddy posted this 1 minute ago

 

if i am using Blocks. I am able to identify blocks of type Table. But how can I collect all the blocks and how can I export them?

Rama Reddy posted this 21 August 2018

why i am not getting the blocks? it is showing Zero blocks.

 

IFRDocument document = engine.CreateFRDocument();

 

try {

// Add image file to document

displayMessage( "Loading image..." );

    IRegionsCollection reg=engine.CreateRegionsCollection();

 

document.AddImageFile( imagePath, null, null );

IFRPages pages=document.getPages();

IRegion region=engine.CreateRegion();

System.out.println(pages.getCount());

if (pages != null && pages.getCount() > 0)

{

for(int i=0; i<pages.getCount();i++)

  {   IFRPage page=pages.Item(i);

   ILayout lay_out= page.getLayout();

   System.out.println(page.getLayout());

   ILayoutBlocks blocks=lay_out.getBlocks();

   System.out.println(lay_out.getBlocks());

 

   System.out.println(document.getPages().Item(i).getLayout().getBlocks().getCount());

   document.getPages().Item(i).getLayout().getBlocks().DeleteAll();

   int c=0;

   System.out.println(blocks.getCount());

   if(blocks != null && blocks.getCount()>0)

   {

    for(int j=0;i<blocks.getCount();j++)

    { IBlock block=blocks.Item(j);

    System.out.println(block.getType());

      if(block.getType()==BlockTypeEnum.BT_Table)

     

      {System.out.println(c);

      ITableBlock tblock=block.GetAsTableBlock();

     

       region=block.getRegion();

       

   document.getPages().Item(i).getLayout().getBlocks().AddNew(block.getType(),region,c);

   

   

       c++;

       }

     }

     }

   }

   }

 

 

//document.ProcessPages(null,null,reg);

 

document.Recognize(null,null);

document.Synthesize(null);

 

String texExportPath = SamplesConfig.GetSamplesFolder() + "images/Emely_11111.xls";

document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

 

Helen Osetrova posted this 21 August 2018

Hi Rama!

 

Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. Getting access to the document layout is possible after the analysis stage. 

Please learn more about processing steps on our Technology Portal and in the Developer’s Help → Guided Tour → Advanced Techniques → Tuning Parameters of Page Preprocessing, Analysis, Recognition, and Synthesis. Please note that the IFRDocument::Process() method includes all stages of processing except the export.

 

So, before working with the page blocks you should apply the IFRDocument::Analyze() method for the whole document or the IFRPage::Analyze() method for each page. Otherwise, the Blocks collection will be empty.

 

 

After this, please do the following for each document page:

1. Create a new Layout instance and add on it every TableBlock that you are interested in.

2. Set the newly created layout as an actual layout for the page and perform page synthesis.

 

Please see the Java code sample below:

try {

                // Add image file to document
                displayMessage( "Loading image..." );
                document.AddImageFile( imagePath, null, null );
                document.Preprocess( null, null, null, null);
                document.Analyze( null, null, null);

                IFRPages frPages = document.getPages();
                int pagesCount = frPages.getCount();

                for (int j =0; j < pagesCount; j++) {
                    ILayout layout = engine.CreateLayout();
                    ILayoutBlocks layBlocks = layout.getBlocks();

                    IFRPage page = frPages.getElement(j);
                    ILayout pageLayout = page.getLayout();
                    ILayoutBlocks blocks = pageLayout.getBlocks();
                    int blocksCount = blocks.getCount();

                    for (int i = 0; i < blocksCount; i++) {
                        IBlock block = blocks.getElement(i);
                        BlockTypeEnum blockType = block.getType();

                        if (blockType == BlockTypeEnum.BT_Table) {
                            IRegion region = block.getRegion();
                            layBlocks.AddNew(BlockTypeEnum.BT_Table, region, 0);

                        }

                        page.setLayout(layout);
                        page.Recognize(null, null);
                     

                    }
                }

                document.Synthesize(null);
...
                document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

}

 

You can learn more about working with the document layout in the Developer’s Help → Guided Tour → Advanced Techniques → Working with Layout and Blocks section.

 

Rama Reddy posted this 22 August 2018

and still it is giving blocks as zero. 

It is giving error at engine.CreateLayout();

 and still i am getting zero blocks.

Helen Osetrova posted this 22 August 2018

Hello Rama!

 

Could you give us some more details about the issue that you face? What kind of error is it? What form does it take? 

 

In additional, I would like to apologize for a mistake slipped in the code sample. Please place following lines outside of the inner for (int i = 0; i < blocksCount; i++) { ... } block:

page.setLayout(layout);
page.Recognize(null, null);

 

So, the changed code should look in the following way: 

...
for (int j =0; j < pagesCount; j++) { ... for (int i = 0; i < blocksCount; i++) { ... if (blockType == BlockTypeEnum.BT_Table) { ... } } // end of the inner for block
page.setLayout(layout); page.Recognize(null, null);
} // end of the outer for block document.Synthesize(null); ... document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);
...

 

If you still have difficulties after the source code modification, please describe the issue as fully as possible to help us to assist you better. 

 

Rama Reddy posted this 22 August 2018

I am able to check blocks but how can we leave the all othe blocks and keep text blocks in document and export them?

private void processImage() {

 String imagePath = SamplesConfig.GetSamplesFolder() + "images/interest-notice.jpg";

try {

// Don't recognize PDF file with a textual content, just copy it

if( engine.IsPdfWithTextualContent( imagePath, null ) ) {

 displayMessage( "Copy results..." );

                 String resultPath = SamplesConfig.GetSamplesFolder() + "interest-notice.pdf";

Files.copy( Paths.get( imagePath ), Paths.get( resultPath ), StandardCopyOption.REPLACE_EXISTING );

}

 // Create document

IFRDocument document = engine.CreateFRDocument();

try {

// Add image file to document

displayMessage( "Loading image..." );

   //IRegionsCollection reg=engine.CreateRegionsCollection();

 

document.AddImageFile( imagePath,null,null);

document.Preprocess(null,null,null,null);

document.Analyze(null,null,null);

 IFRPages pages=document.getPages();

int page_cnt=pages.getCount();

//IRegion region=engine.CreateRegion();

//System.out.println(pages.getCount());

if (pages != null &&  page_cnt > 0)

{

for(int i=0; i< page_cnt;i++)

 

  {   ILayout layout=engine.CreateLayout();

      ILayoutBlocks layblocks=layout.getBlocks();

  IFRPage page=pages.getElement(i);

  page.Analyze(null,null,null);

  page.Recognize(null,null);

  page.Synthesize(null);

   

  ILayout lay_out= page.getLayout();

  System.out.println(page.getLayout());

  ILayoutBlocks blocks=lay_out.getBlocks();

System.out.println(lay_out.getBlocks());

layblocks.DeleteAll();

   int blocks_cnt=blocks.getCount();

  System.out.println(blocks.getCount());

   if(blocks != null && blocks_cnt>0)

   {

    for(int j=0;i<blocks_cnt;j++)

    { IBlock block=blocks.getElement(j);

    System.out.println(block.getType());

      if(block.getType()==BlockTypeEnum.BT_Table)

      {

       IRegion region =block.getRegion();

       layblocks.AddNew(BlockTypeEnum.BT_Table,region,0);

       }

       page.setLayout(layout);

       page.Recognize(null,null);

     }

     }

   }

   }

 

document.Synthesize(null);

 String texExportPath = SamplesConfig.GetSamplesFolder() + "images/interest-notice34.xls";

document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

 } 

finally {

// Close document

document.Close();

}

} catch( Exception ex ) {

displayMessage( ex.getMessage());

}

  }

 

 

 

Helen Osetrova posted this 23 August 2018

Hi Rama,

To make us able to assist you better, can you please clarify what version of ABBYY products do you use? 

 

In case if creating a new ILayout instance does not work for you, please try the following:

  • obtain an actual page layout;
  • check its blocks one by one and remove the ones which type is not BT_Table

 

Please find below the code snippet that illustrates the suggested approach: 

...

IFRPages frPages = document.getPages();
int pagesCount = frPages.getCount();

for (int j =0; j < pagesCount; j++) {

     IFRPage page = frPages.getElement(j);
     ILayout pageLayout = page.getLayout();
     ILayoutBlocks blocks = pageLayout.getBlocks();
     int blocksCount = blocks.getCount();
  int i =0;

      while (i < blocksCount) {

    IBlock block = blocks.getElement(i);
          displayMessage( "Checking blocks");
          BlockTypeEnum blockType = block.getType();

if (blockType != BlockTypeEnum.BT_Table) {

             displayMessage( "Delete block");
             blocks.DeleteAt(i);
             blocksCount = blocks.getCount();
             continue;

      }

i++;
} // iterating blocks

displayMessage( "Recognize page...");
page.Recognize(null, null);

} // iterating pages

...

 

 

Christopher Nolan posted this 25 August 2018

Hi - In FineReader 14 windows (corporate) 2 questions: 

1) When FineReader converts a HTML document to PDF, is there a way to avoid having any page breaks in the PDF document? My documents have both text and tables and I want to avoid tables being split between 2 pages in the PDF document.

2) can I associate pre-formatted table templates for a specific document in Hot Folder so when FineReader scans/OCR that document it automatically finds the table in the document associated with the template and applies the template to it? 

Rama Reddy posted this 27 August 2018

Hi Helen,

thank you its working. but the issue is some tables are not analyzed properly and some part of table is not identified as Table. What is all parameters I have to make it better and analyze the document well??

Helen Osetrova posted this 27 August 2018

Hi Rama,

 

Please try to tune the parameters of FRDocument::Analyze() method. For example, create the IPageAnalysisParams object and set its AggressiveTableDetection property to true. If the part of the table appears as a picture in the result file, try also to set the DetectPictures property of IPageAnalysisParams to false. Then pass the newly created object to the IFRDocument::Analyze() method:

...
IPageAnalysisParams pageAnalysisParams = engine.CreatePageAnalysisParams();
pageAnalysisParams.setAggressiveTableDetection(true);
pageAnalysisParams.setDetectPictures(false); 
document.Analyze(pageAnalysisParams,null,null);
...

 

You can also specify the particular block on a page as a table block and analyze its structure with the help of the IFRPage::AnalyzeTable() method. Please learn more about this method from the Developer's Help. 

Close