I'd much appreciate some advice on identifying FineReader scanned elements that might require human verification. I am working on a web-based application in which tax forms are sent to the Cloud OCR. The output is then scraped for export to a database.
100% accuracy is required for numbers in the tax forms, so I will have to supplement the ABBYY Cloud OCR output with human intervention. Since the ABBYY FineReader Engine Visual Components module is only available for Windows installations, I plan to code up a web-based human verification engine.
For this verification engine, I had planned to rely on the charConfidence attribute from the ABBY XML Export to identify numbers requiring human verification, based on some pre-specified confidence threshold. However, I'm not sure if this is proper, since many numbers that FineReader reads correctly (100% accuracy in my sample forms, incidentally) have low charConfidence numbers.
Can anyone give advice on this? I see two possible options:
- Only require human verification for very low (say, <20%) charConfidence values. My impression is that charConfidence is, anyways, low on average for numbers relative to letters (even as ABBYY FineReader is so far giving me 100% accuracy for numbers in sample documents).
- Use the suspicious attribute to identify numbers requiring subsequent human verification. Essentially, I'm deferring to ABBYY's algorithms' binary judgment to identify potentially misread numbers.