confidence attribute, processing profile

  • 4.5K Views
  • Last Post 18 May 2012
stephenab posted this 18 May 2012

I am processing mobile photos.

I have noticed that the "confidence" attribute is not provided for chars when I use processImage. Is this only provided for processing text fields?

Also, I usually get the best results for processImage with a profile of "documentConversion" -- this usually includes correct text, and skips incorrect text. When I switch to a "textExtraction" profile I expect better text, but instead it just adds a lot of noise. Is this unexpected?

Order By: Standard | Newest | Votes
Vasily Panferov posted this 18 May 2012

The only format that allows getting confidence information for processImage is xml. So you need to parse xml and there will be "suspicious="1"" attribute for uncertain characters.

E.g.:

<charParams b="64" r="214" t="51" l="205">T</charParams> 
<charParams b="64" r="229" t="52" l="216" suspicious="1">H</charParams>

The "textExtraction" profile is optimized to extract as much text from document as possible. The text after recognition is intended to be used in search scenarios. E.g. when you need to add some image to full-text search database. After that you can find the document by typing one or more words from it. So it is usual to get more noise because noise is not considered very harmful in this scenario.

The "documentConversion" profile is optimized for text reuse. It allows reconstruction of page layout, formatting and other page elements. That is why it is default processing profile.

stephenab posted this 18 May 2012

Thanks for your answer, that is helpful. Regarding confidence, I am wondering about the difference between "suspicious" and "confidence." In your example here you provide confidence as a number between 1 and 100:

http://ocrsdk.com/documentation/quick-start/text-fields/

However, suspicious seems to be 1 or not-present. What is the reason for the difference?

Vasily Panferov posted this 18 May 2012

"Suspicous" is a bit-flag. It is either present or not. If it is present, it means recognition engine is not sure whether the recognition of it was correct.

Confidence is int from 1 to 100. It represents the amount of similarity between recognized character and how recognizer expects it too look.

"Confidence" attribute is quite confusing, we have plans to replace it with "suspicious" in all text-field processing.

rokahn posted this 07 October 2013

How feasible is it to annotate PDF output with confidence metrics? For example, by producing both XML and PDF, may one reasonably extract low confidence ranges from XML and figure out where this attribute should be inserted into PDF? Do I assume correctly that XML tells you on just what page text appears (not where on page)...or does layout analysis break down page into text blocks so recognition confidence issues will be associated with a text block? Thanks for any help.

Close