I have documents that tend to have two tables but no whitespace between them. It could also be viewed as one table without uniform columns for every row. The FineReader SDK gets confused since it tries to treat this as a table, and when I try to extract the data I can't tell where one row ends and the next begins. For example:

alt text

The first two rows are divided into 7 columns. The second set of rows are divided into 8 columns. It appears as if the SDK is trying to treat it as one large table, adding invisible separators for the areas where the columns don't extend. Like this:

alt text

Which is obvious why the engine would get confused over this. If I split the tables visually using photoshop they get parsed perfectly. Any tips on how to handle this situation? I could hardcode the number of columns per document type, but that seems messy and I'd like to keep it more generic.

asked 22 Nov '16, 22:04

akrenovo's gravatar image

akrenovo
133

Could you please specify your scenario: after recognizing the image, what do you want to do? Do you want to extract the data from the cells? In this case please use the TableBlock::Cells property. Or do you want to export the whole document with the table? In this case please clarify to what format you want to export it and what exactly the issue with export is.

If you could also provide any code sample showing how you process the image, that would be very helpful.

Thank you in advance!

(24 Nov '16, 18:36) Anna Fedyush... ♦♦

Yes, I want to extract the data from the cells and process it further. I'm using the TableBlock.Cells property, but the SDK gets confused with the use case above and there are 14 VSeparators for the entire table, which makes parsing the table hard unless I hard code the number of columns for each row (which I'd rather avoid if possible).

(28 Nov '16, 19:23) akrenovo

You are absolutely right when you say that FineReader Engine creates 14 separators as you have drawn on the second image. To handle this situation please note that Type property of separators that cross through the merged cells is TST_Absent. Separator type is not an attribute of the whole separator but of a single separator segment between the adjacent intersections with perpendicular separators.

If you need to get number of separators to understand how many columns are there in the row you can use the code sample below:

//tableBlock is a block with BT_Table type
CheckResult( block->GetAsTableBlock( &tableBlock ) );

CSafePtr<ITableSeparators> hSeparators, vSeparators;
CheckResult( tableBlock->get_HSeparators( &hSeparators ) );
CheckResult( tableBlock->get_VSeparators( &vSeparators ) );

int hSeparatorCount;
int vSeparatorCount;

CheckResult( hSeparators->get_Count( &hSeparatorCount ) );
CheckResult( vSeparators->get_Count( &vSeparatorCount ) );

for ( int i = 0; i < hSeparatorCount - 1; i++ )
{
    int columnCount = 0;

    for ( int j = 0; j < vSeparatorCount - 1; j++ )
    {
        //Get current vertical separator
        CSafePtr<ITableSeparator> vSeparator;
        CheckResult( vSeparators->get_Element(j, &vSeparator) );

        //Get current vertical separator type
        TableSeparatorTypeEnum type;
        CheckResult( vSeparator->get_Type( i, &type ) );

        //if type = TST_Absent then cells are merged and we shouldn't increment columnCount;
        if (type != TST_Absent) columnCount ++;
    }
    //columnCount now is the number of cells in the row
}
link

answered 02 Dec '16, 17:29

Anna%20Fedyushkina's gravatar image

Anna Fedyush... ♦♦
362

Thanks this worked great. I had to modify the code a little bit (since I'm extracting data not just counting columns) but this pointed me in the right direction.

(03 Dec '16, 00:21) akrenovo
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×41
×11
×8

Asked: 22 Nov '16, 22:04

Seen: 838 times

Last updated: 03 Dec '16, 00:21

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal