I am evaluating Abbyy Cloud OCR sdk to extract data from PDF. Right now I am using ProcessTextField method to extract data from a specified region. In the output i get the complete text as well as XML for each character with its Confidence and Suspicious value (true or false). As per my findings sometimes the OCR reads a character incorrectly and does not shows it as suspicious rather the confidence value is lower than 50% and in other cases it shows Suspicious but with more confidence value.
So can you tell me how exactly i can use Confidence/Suspicious values to intimate the user that a particular character is doubtful for its correctness.
Any guidance would be helpful.
Thanks Nayan Parekh
asked 29 Dec '15, 09:09
During the layout analysis the text areas, lines and single characters coordinates are detected. After the character separation each character is recognized with different text recognition classifiers.
The recognition confidence of a character image is a numerical estimate of the probability that the image does in fact represent this character. When recognizing a character, the program provides several recognition variants which are ranked by their confidence values. For example, an image of the letter "e" may be recognized
The hypothesis with the highest confidence rating is selected as the recognition result. But the choice also depends on the context (i.e. the word to which the character belongs) and the results of a differential comparison. For example, if the word with the "e" hypothesis is not a dictionary word while the word with the "c" hypothesis is a dictionary word, the latter will be selected as the recognition result, even though its confidence rating will still be 85. The rest of the recognition variants can be obtained as hypotheses.
The suspicious property is the Boolean property. This property set to TRUE means that the character was recognized unreliably. This property is determined by an algorithm which takes into account a number of parameters, such as recognition confidence of a character, nearby characters and their recognition confidence, hypotheses and their recognition confidence, the geometric parameters of a character, and context (i.e. the word to which a character belongs).
Also please refer to the OCR Accuracy Measurement article for some more details.
This answer is marked "community wiki".