I am processing mobile photos.
I have noticed that the "confidence" attribute is not provided for chars when I use processImage. Is this only provided for processing text fields?
Also, I usually get the best results for processImage with a profile of "documentConversion" -- this usually includes correct text, and skips incorrect text. When I switch to a "textExtraction" profile I expect better text, but instead it just adds a lot of noise. Is this unexpected?
The only format that allows getting confidence information for processImage is xml. So you need to parse xml and there will be "suspicious="1"" attribute for uncertain characters.
The "textExtraction" profile is optimized to extract as much text from document as possible. The text after recognition is intended to be used in search scenarios. E.g. when you need to add some image to full-text search database. After that you can find the document by typing one or more words from it. So it is usual to get more noise because noise is not considered very harmful in this scenario.
The "documentConversion" profile is optimized for text reuse. It allows reconstruction of page layout, formatting and other page elements. That is why it is default processing profile.
answered 18 May '12, 08:00
Vasily Panferov ♦♦
How feasible is it to annotate PDF output with confidence metrics? For example, by producing both XML and PDF, may one reasonably extract low confidence ranges from XML and figure out where this attribute should be inserted into PDF? Do I assume correctly that XML tells you on just what page text appears (not where on page)...or does layout analysis break down page into text blocks so recognition confidence issues will be associated with a text block? Thanks for any help.
answered 07 Oct '13, 23:58