Analyzing two tables without whitespace between them

  • Last Post 03 December 2016
  • Topic Is Solved
akrenovo posted this 22 November 2016

I have documents that tend to have two tables but no whitespace between them. It could also be viewed as one table without uniform columns for every row. The FineReader SDK gets confused since it tries to treat this as a table, and when I try to extract the data I can't tell where one row ends and the next begins. For example:

alt text

The first two rows are divided into 7 columns. The second set of rows are divided into 8 columns. It appears as if the SDK is trying to treat it as one large table, adding invisible separators for the areas where the columns don't extend. Like this:

alt text

Which is obvious why the engine would get confused over this. If I split the tables visually using photoshop they get parsed perfectly. Any tips on how to handle this situation? I could hardcode the number of columns per document type, but that seems messy and I'd like to keep it more generic.

Order By: Standard | Newest | Votes
Anna Fedyushkina posted this 24 November 2016

Could you please specify your scenario: after recognizing the image, what do you want to do? Do you want to extract the data from the cells? In this case please use the TableBlock::Cells property. Or do you want to export the whole document with the table? In this case please clarify to what format you want to export it and what exactly the issue with export is.

If you could also provide any code sample showing how you process the image, that would be very helpful.

Thank you in advance!

akrenovo posted this 28 November 2016

Yes, I want to extract the data from the cells and process it further. I'm using the TableBlock.Cells property, but the SDK gets confused with the use case above and there are 14 VSeparators for the entire table, which makes parsing the table hard unless I hard code the number of columns for each row (which I'd rather avoid if possible).

Anna Fedyushkina posted this 02 December 2016

You are absolutely right when you say that FineReader Engine creates 14 separators as you have drawn on the second image. To handle this situation please note that Type property of separators that cross through the merged cells is TST_Absent. Separator type is not an attribute of the whole separator but of a single separator segment between the adjacent intersections with perpendicular separators.

If you need to get number of separators to understand how many columns are there in the row you can use the code sample below:

//tableBlock is a block with BT_Table type
CheckResult( block->GetAsTableBlock( &tableBlock ) );

CSafePtr<ITableSeparators> hSeparators, vSeparators;
CheckResult( tableBlock->get_HSeparators( &hSeparators ) );
CheckResult( tableBlock->get_VSeparators( &vSeparators ) );

int hSeparatorCount;
int vSeparatorCount;

CheckResult( hSeparators->get_Count( &hSeparatorCount ) );
CheckResult( vSeparators->get_Count( &vSeparatorCount ) );

for ( int i = 0; i < hSeparatorCount - 1; i++ )
    int columnCount = 0;

    for ( int j = 0; j < vSeparatorCount - 1; j++ )
        //Get current vertical separator
        CSafePtr<ITableSeparator> vSeparator;
        CheckResult( vSeparators->get_Element(j, &vSeparator) );

        //Get current vertical separator type
        TableSeparatorTypeEnum type;
        CheckResult( vSeparator->get_Type( i, &type ) );

        //if type = TST_Absent then cells are merged and we shouldn't increment columnCount;
        if (type != TST_Absent) columnCount ++;
    //columnCount now is the number of cells in the row

  • Liked by
  • akrenovo
akrenovo posted this 03 December 2016

Thanks this worked great. I had to modify the code a little bit (since I'm extracting data not just counting columns) but this pointed me in the right direction.