charConfidence randomness

  • 59 Views
  • Last Post 10 January 2018
evoflo posted this 21 December 2017

Please take a look at the following row from my table.

The cells contain either HG1/+1° or HG1/0°.

Although the texts quality is good and looks the same in each cell, I get many different results.

hgi/i° is recognized but HGI/+I° is shown (in the text editor of the visual components).  Looking at the charConfidence of the charRecVariants, I can see that 1, i, I, l all have charConfidence = 100

  • I understand that "h" may be chosen because it "fits" better but why does it have charConfidence = 100 in the first place (which is the same as "H" which obviously has higher similarity)?
  • The same goes for the letters "l", "I" and "i" which have charConfidence = 100 but are not as similar as "1" to what is shown in the image.
  • If I copy the displayed "H" from the editor and paste it, I get "h". Why?

HGi/+r and HG1/+T are recognized instead of HG1/+1°

  • T, t and r all have a confidence value of 12 which is very low. Why does the recognition of ° work in some cells but not here?

HG1/00 is recognized instead of HG1/0°.

  • I understand that the engine might think that another 0 would make more sense. However the confidence values differ very much. "0" has a value of 13 while ° has 77...

I guess my main questions are:

  • Is this a known issue? Is someone working on it?
  • Can I do something about this?
  • Would it help to send the file to you?

 

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 25 December 2017

We wish you a Merry Christmas and a Happy New Year!

There are few different notions in FineReader Engine that represent the recognition confidence of each character:

- ICharacterRecognitionVariant::CharConfidence is a numerical estimate of the probability that the image does in fact represent this character. Every character recognition variant has this property.

- Also you can set the IRecognizerParams::ExactConfidenceCalculation property to TRUE, in order to define the character confidence more accurately, but please note that in this case the recognition speed may get slower.

If we go into the details of implementation, we have several classifiers: a cache, a raster, an omnifont and an outline. When we recognize the character, we do it sequentially with each classifier. If we get a high internal confidence for any classifier, we do not start the other classifiers. Thus, for the character can be obtained weight of the following classifiers (very conditionally): cache, cache + raster, cache + raster + omnifont or cache + raster + omnifont + outline. We do not start the others classifiers inside the recognizer if any classifier has output the high confidence because we have enough information for recognition needs. However, our customers often wish to compare results of recognition. For this purpose we need the external confidence of characters. The external confidence is calculated in the following way (very conditionally): (cache + raster + omnifont + outline)/4. If the weight of any classifier is not present, the external confidence will be counted inaccurately (because instead of absent classifier we will take any average value). The ExactConfidenceCalculation flag forces the program calculate weight of all classifiers for the character before count of the external confidence. It is necessary that the external confidence is identical despite the set of classifiers by which the character has been really recognized inside the recognizer.

- ICharParams::IsSuspicious is a property of 1 character. If the property is set to TRUE, the character was recognized unreliable. Otherwise the property is set to FALSE. However, note that the IsSuspicious flag set to TRUE does not always mean that the character has been recognized incorrectly, it only means that it has been recognized uncertainly.

Please find more information about the difference between the CharConfidence and the IsSuspicious properties at http://knowledgebase.abbyy.com/article/712 or in the Developer’s Help Frequently Asked Questions What is the difference between the CharConfidence and the IsSuspicious properties? Once again please pay your attention that the CharConfidence property does not connected directly with the recognition accuracy. The biggest part of the recognition is the usage of the context. So the right symbol recognition variant can have the low confidence, but inspire of it be the right one, in case the wrong variant will have the same confidence level or smaller.

evoflo posted this 09 January 2018

Thank you for the detailed answer. I think it's very interesting to know how the algorithm works

Unfortunately, this does not answer my question. I will try to make my issue clearer

If I have two identical cells (HG1/+1°), how can it be that I receive two results which are completely different. I guess that in the case of "+T" the part "1°" is analysed as one character and therefore the variants 1, I, i, l do not occur.

The questions from my first post mainly refer to this issue.

Oksana Serdyuk posted this 10 January 2018

It is possible that somewhere near a character there is a little noise that during binarization might be considered as a part of the recognized character. Thus, basing on the total amount of black and white regions the characters are separated/cropped differently, and as a result one character recognition variant could be more confident in some cases than another. Moreover in your situation they are not the words from the language dictionary, so wrong recognition variant cannot be rejected during context checking.

For more details it is necessary to look into the concrete image and analyze the case. Could you please send your source image to your local Technical Support to TechSupport_eu@abbyy.com?

Close