Human Verification after Cloud SDK processing

  • 830 Views
  • Last Post 31 December 2015
Steph posted this 30 December 2015

I'd much appreciate some advice on identifying FineReader scanned elements that might require human verification. I am working on a web-based application in which tax forms are sent to the Cloud OCR. The output is then scraped for export to a database.

100% accuracy is required for numbers in the tax forms, so I will have to supplement the ABBYY Cloud OCR output with human intervention. Since the ABBYY FineReader Engine Visual Components module is only available for Windows installations, I plan to code up a web-based human verification engine.

For this verification engine, I had planned to rely on the charConfidence attribute from the ABBY XML Export to identify numbers requiring human verification, based on some pre-specified confidence threshold. However, I'm not sure if this is proper, since many numbers that FineReader reads correctly (100% accuracy in my sample forms, incidentally) have low charConfidence numbers.

Can anyone give advice on this? I see two possible options:

  1. Only require human verification for very low (say, <20%) charConfidence values. My impression is that charConfidence is, anyways, low on average for numbers relative to letters (even as ABBYY FineReader is so far giving me 100% accuracy for numbers in sample documents).
  2. Use the suspicious attribute to identify numbers requiring subsequent human verification. Essentially, I'm deferring to ABBYY's algorithms' binary judgment to identify potentially misread numbers.

Many thanks.

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 30 December 2015

Please see my answer to the similar question here.

Also for processing of numbers it can be useful to limit the alphabet of the recognition language, for example, using only the Digits language.

  • Liked by
  • Steph
Steph posted this 31 December 2015

Oksana,

Many thanks for your rapid response! I do appreciate it. What I gather from your answer is that the charConfidence value is NOT sufficient to identify whether a character requires subsequent human verification.

Based on this, I have 2 follow-up questions:

  1. If humans subsequently verify characters deemed suspicious, do you believe this will get me close to 100% accuracy level? Would doing so be considered "best practice" among ABBYY FineReaders' users?
  2. Since my forms are a mix of letters and numbers, if I am going to properly use the 'Digits' language, I believe I will have to use the processFields method, yes? That way, I can identify which parts of the document should use Language=English, and which parts of the document should use Language=Digits.

Many thanks.

Oksana Serdyuk posted this 31 December 2015

  1. Yes, you can use the suspicious attribute for implementation of the verification step, but you should understand that the suspicious character may be either correctly recognized or not.
  2. If your usage scenario allows to use the processTextField or the processFields methods, then you can significantly improve the recognition result by using the letterSet and regExp parameters. Please also refer to the following topic for more information - How to improve handprinted recognition? All recommendation are actual for printed text, too.

Close