29 December 2015
- Last edited 30 December 2015
During the layout analysis the text areas, lines and single characters coordinates are detected. After the character separation each character is recognized with different text recognition classifiers.
The recognition confidence of a character image is a numerical estimate of the probability that the image does in fact represent this character. When recognizing a character, the program provides several recognition variants which are ranked by their confidence values. For example, an image of the letter "e" may be recognized
- as the letter "e" with a confidence of 95,
- as the letter "c" with a confidence of 85,
- as the letter "o" with a confidence of 65, etc.
The hypothesis with the highest confidence rating is selected as the recognition result. But the choice also depends on the context (i.e. the word to which the character belongs) and the results of a differential comparison. For example, if the word with the "e" hypothesis is not a dictionary word while the word with the "c" hypothesis is a dictionary word, the latter will be selected as the recognition result, even though its confidence rating will still be 85. The rest of the recognition variants can be obtained as hypotheses.
The suspicious property is the Boolean property. This property set to TRUE means that the character was recognized unreliably. This property is determined by an algorithm which takes into account a number of parameters, such as recognition confidence of a character, nearby characters and their recognition confidence, hypotheses and their recognition confidence, the geometric parameters of a character, and context (i.e. the word to which a character belongs).
Also please refer to the OCR Accuracy Measurement article for some more details.