Field identification

  • 3.1K Views
  • Last Post 07 June 2012
IgetOra posted this 06 June 2012

I was thinking of using the service for generating records from received invoices, which have the same information in multiple different possible formats, including freetext documents written by consultants. As they share the fields required by the law, I was thinking of using regExp queries, but failed noticing that field extraction works well only when field position is accurately specified.

Are you going to work in the direction of identifying fields by the expected characteristics such as nearby titles or expected type and dimension, or should I think of mapping the entire page to a database o words and their positions and manage the search myself?

Thanks

Order By: Standard | Newest | Votes
Chudik79 posted this 07 June 2012

Hi!

Current Cloud OCR SDK API is about text recognition only, would it be full text or just a field. It deals nothing with finding a zone on an image.

We have data capture SDK (ABBYY FlexiCapture Engine) which has required ability. It is not mapped to the Cloud yet, but we are thinking about that. That will take certain time.

Right now I see two possible ways of doing what you want:

  1. Do full-text recognition and then apply regExp search to recognized text.
  2. If data you work with is structured or semi-structured you can pre-sort it and then apply known layouts of fields. Pre-sorting may be done using full-text recognition and applying key-word search. To save time and efforts only part of a document could be OCRed (first page of a multi-page document or a zone of single-page document).

Best regards, Dmitry. ABBYY, Lead Product Analyst, SDK products.

  • Liked by
  • SDK_support
Andrey Isaev posted this 07 June 2012

Actually, ABBYY is long time working in that direction. We have product called FlexiCapture and SDK called FlexiCapture Engine They all salve taks you have just described - they can help extracting particular data from semi-structured documents. Using FlexiLayout Studio you can define fields you want to extract and rules how to locate them on image. It is not just regular expression, it can define complicate dependencies with voting amond different layout hypotises, and even fields cross-checking and database look-ups for values.

Unfortunately this is not yet available in the Cloud since it does require special training on FlexiLayout programming.

So just please contact nearest ABBYY representative to talk about FlexiCapture product or Engine.

Close