I am processing mobile photos.

I have noticed that the "confidence" attribute is not provided for chars when I use processImage. Is this only provided for processing text fields?

Also, I usually get the best results for processImage with a profile of "documentConversion" -- this usually includes correct text, and skips incorrect text. When I switch to a "textExtraction" profile I expect better text, but instead it just adds a lot of noise. Is this unexpected?

asked 18 May '12, 06:55

stephenab's gravatar image

stephenab
4124

edited 09 Jun '12, 11:27

Vasily%20Panferov's gravatar image

Vasily Panferov ♦♦
5422516


The only format that allows getting confidence information for processImage is xml. So you need to parse xml and there will be "suspicious="1"" attribute for uncertain characters.

E.g.:

<charParams b="64" r="214" t="51" l="205">T</charParams> 
<charParams b="64" r="229" t="52" l="216" suspicious="1">H</charParams>

The "textExtraction" profile is optimized to extract as much text from document as possible. The text after recognition is intended to be used in search scenarios. E.g. when you need to add some image to full-text search database. After that you can find the document by typing one or more words from it. So it is usual to get more noise because noise is not considered very harmful in this scenario.

The "documentConversion" profile is optimized for text reuse. It allows reconstruction of page layout, formatting and other page elements. That is why it is default processing profile.

link

answered 18 May '12, 08:00

Vasily%20Panferov's gravatar image

Vasily Panferov ♦♦
5422516

Thanks for your answer, that is helpful. Regarding confidence, I am wondering about the difference between "suspicious" and "confidence." In your example here you provide confidence as a number between 1 and 100:

http://ocrsdk.com/documentation/quick-start/text-fields/

However, suspicious seems to be 1 or not-present. What is the reason for the difference?

(18 May '12, 08:12) stephenab

"Suspicous" is a bit-flag. It is either present or not. If it is present, it means recognition engine is not sure whether the recognition of it was correct.

Confidence is int from 1 to 100. It represents the amount of similarity between recognized character and how recognizer expects it too look.

"Confidence" attribute is quite confusing, we have plans to replace it with "suspicious" in all text-field processing.

(18 May '12, 08:20) Vasily Panferov ♦♦

How feasible is it to annotate PDF output with confidence metrics? For example, by producing both XML and PDF, may one reasonably extract low confidence ranges from XML and figure out where this attribute should be inserted into PDF? Do I assume correctly that XML tells you on just what page text appears (not where on page)...or does layout analysis break down page into text blocks so recognition confidence issues will be associated with a text block? Thanks for any help.

link

answered 07 Oct '13, 23:58

rokahn's gravatar image

rokahn
111

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×49
×42
×37
×24
×7

Asked: 18 May '12, 06:55

Seen: 4,313 times

Last updated: 07 Oct '13, 23:58

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal