The image of a phonebook page column is here: http://digitalfire.com/culiacan/pictures/309.jpg It was scanned at 600 dpi and resized in Photoshop to 300 (without resampling). We are passing a parameter to read Spanish.

The recognizer is not getting the phone numbers correct on the last 30 or 40 lines (they are chopped off on the right, more digits are missing on numbers nearer the bottom). Also, we are getting a high frequency of errors (for the 60 or so OCRs we have tested so far) where it is reading '-0' as '4)' (7494)211 instead of 749-0211), 'LI' as 'U' and '96' as '%'. This same image reads with much fewer errors using other recognition service. It is also failing to interpret the period-tab as a tab.

Any ideas? Thanks.

asked 07 Apr '12, 08:54

thansen's gravatar image

thansen
1114

edited 09 Jun '12, 11:40

Vasily%20Panferov's gravatar image

Vasily Panferov ♦♦
5422516


Hi there! We've thoroughly examined your case. Your scanned image looks a lot like being taken from camera, so our image preprocessing engine tries to enhance it with some photo preprocessing algorithms which occasionally corrupt a bit of details.

The first thing you need to do is to add &imageSource=scanner to your processImage call. That would disable photo preprocessing and slightly improve your results.

We're also currently working on implementing another option that would increase results even more for your type of images. I'll let you know as soon as it's released to production (that would take several days i think). Please let me know if you have any additional questions.

link

answered 09 Apr '12, 12:00

Nikolay_Kh's gravatar image

Nikolay_Kh ♦♦
1817

edited 09 Apr '12, 12:01

Have done this, it does seem better, will check more pages. However the phones numbers on the last 50 or so lines are still being cut off. The page is at a light angle, is it sensitive to this?

(09 Apr '12, 22:08) thansen

You might want to try http://digitalfire.com/culiacan/pictures/270.jpg We are getting strange ZZZZZ sequrences where the period-tabs are.

(10 Apr '12, 07:24) thansen

The ZZZZ appear because there is a light skew in a document. The engine tries to fix it, makes wrong guess because of uncommon shape of the image and gets ZZZZ and other recognition errors. There will be an option to disable automatic deskewing soon and it will make your results better.

However, we are unable to reproduce your problem with missing numbers on last lines. If you set "imageSource=scanner", all the lines and numbers from top to bottom appear in the result text file. Our server logs show that for your application there were no tasks with imageSource=scanner option for the last day

(10 Apr '12, 11:35) Vasily Panferov ♦♦

There is now an option to disable automatic skew correction so you get most of your images.

Specify the following parameters: "?language=Spanish&imageSource=scanner&correctSkew=false"

(12 Apr '12, 15:56) Vasily Panferov ♦♦

I tried this and it seems alot better (309.jpg). We got about 6 or 8 hyphens interpreted as bullets. Shouldn't the fact that there are a hundred other hyphens on the page influence the recognizer's judgement deciding whether to interpret as a hyphen or bullet? Another thing it is still doing is failing to recognize some of the spaces between the words when they seem obvious to the eye.

(13 Apr '12, 08:45) thansen
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×25

Asked: 07 Apr '12, 08:54

Seen: 2,033 times

Last updated: 09 Jun '12, 11:40

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal