Hi,

I am evaluating Abbyy Cloud OCR sdk to extract data from PDF. Right now I am using ProcessTextField method to extract data from a specified region. In the output i get the complete text as well as XML for each character with its Confidence and Suspicious value (true or false). As per my findings sometimes the OCR reads a character incorrectly and does not shows it as suspicious rather the confidence value is lower than 50% and in other cases it shows Suspicious but with more confidence value.

So can you tell me how exactly i can use Confidence/Suspicious values to intimate the user that a particular character is doubtful for its correctness.

Any guidance would be helpful.

Thanks Nayan Parekh

asked 29 Dec '15, 09:09

Nayan_32's gravatar image

Nayan_32
112


During the layout analysis the text areas, lines and single characters coordinates are detected. After the character separation each character is recognized with different text recognition classifiers.

The recognition confidence of a character image is a numerical estimate of the probability that the image does in fact represent this character. When recognizing a character, the program provides several recognition variants which are ranked by their confidence values. For example, an image of the letter "e" may be recognized

  • as the letter "e" with a confidence of 95,
  • as the letter "c" with a confidence of 85,
  • as the letter "o" with a confidence of 65, etc.

The hypothesis with the highest confidence rating is selected as the recognition result. But the choice also depends on the context (i.e. the word to which the character belongs) and the results of a differential comparison. For example, if the word with the "e" hypothesis is not a dictionary word while the word with the "c" hypothesis is a dictionary word, the latter will be selected as the recognition result, even though its confidence rating will still be 85. The rest of the recognition variants can be obtained as hypotheses.

The suspicious property is the Boolean property. This property set to TRUE means that the character was recognized unreliably. This property is determined by an algorithm which takes into account a number of parameters, such as recognition confidence of a character, nearby characters and their recognition confidence, hypotheses and their recognition confidence, the geometric parameters of a character, and context (i.e. the word to which a character belongs).

Also please refer to the OCR Accuracy Measurement article for some more details.

link
This answer is marked "community wiki".

answered 29 Dec '15, 14:18

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦
1.5k16

edited 30 Dec '15, 15:54

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×160
×36
×7
×3

Asked: 29 Dec '15, 09:09

Seen: 861 times

Last updated: 30 Dec '15, 15:54

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal