I'd much appreciate some advice on identifying FineReader scanned elements that might require human verification. I am working on a web-based application in which tax forms are sent to the Cloud OCR. The output is then scraped for export to a database.

100% accuracy is required for numbers in the tax forms, so I will have to supplement the ABBYY Cloud OCR output with human intervention. Since the ABBYY FineReader Engine Visual Components module is only available for Windows installations, I plan to code up a web-based human verification engine.

For this verification engine, I had planned to rely on the charConfidence attribute from the ABBY XML Export to identify numbers requiring human verification, based on some pre-specified confidence threshold. However, I'm not sure if this is proper, since many numbers that FineReader reads correctly (100% accuracy in my sample forms, incidentally) have low charConfidence numbers.

Can anyone give advice on this? I see two possible options:

  1. Only require human verification for very low (say, <20%) charConfidence values. My impression is that charConfidence is, anyways, low on average for numbers relative to letters (even as ABBYY FineReader is so far giving me 100% accuracy for numbers in sample documents).
  2. Use the suspicious attribute to identify numbers requiring subsequent human verification. Essentially, I'm deferring to ABBYY's algorithms' binary judgment to identify potentially misread numbers.

Many thanks.

asked 30 Dec '15, 03:25

Steph's gravatar image

Steph
62


Please see my answer to the similar question here.

Also for processing of numbers it can be useful to limit the alphabet of the recognition language, for example, using only the Digits language.

link

answered 30 Dec '15, 16:04

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦
1.4k16

Oksana,

Many thanks for your rapid response! I do appreciate it. What I gather from your answer is that the charConfidence value is NOT sufficient to identify whether a character requires subsequent human verification.

Based on this, I have 2 follow-up questions:

  1. If humans subsequently verify characters deemed suspicious, do you believe this will get me close to 100% accuracy level? Would doing so be considered "best practice" among ABBYY FineReaders' users?
  2. Since my forms are a mix of letters and numbers, if I am going to properly use the 'Digits' language, I believe I will have to use the processFields method, yes? That way, I can identify which parts of the document should use Language=English, and which parts of the document should use Language=Digits.

Many thanks.

link

answered 31 Dec '15, 03:42

Steph's gravatar image

Steph
62

  1. Yes, you can use the suspicious attribute for implementation of the verification step, but you should understand that the suspicious character may be either correctly recognized or not.
  2. If your usage scenario allows to use the processTextField or the processFields methods, then you can significantly improve the recognition result by using the letterSet and regExp parameters. Please also refer to the following topic for more information - How to improve handprinted recognition? All recommendation are actual for printed text, too.
link

answered 31 Dec '15, 12:19

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦
1.4k16

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×16
×4
×1

Asked: 30 Dec '15, 03:25

Seen: 755 times

Last updated: 31 Dec '15, 12:19

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal