The image of a phonebook page column is here: http://digitalfire.com/culiacan/pictures/326.jpg It was scanned at 600 dpi, auto-leveled and resized in Photoshop to 300 (without resampling). Params are: language=Spanish&exportFormat=txt&imageSource=scanner&correctSkew=false
The 18th last line is missing, it starts 'Clz Heroico ...'. Also, we are continuing to get alot of errors with '-0' (eg. 7144)556 instead of 714-0556).
I opend your image in FineReader 11 and for some reasons the text lines have a scew (even if the image looks correct in the browser) - no idea why. After descewing withing FineReader the recognition was much better.
Before: 11% (624/5877) uncertain characters After: 7% (431/6244) uncertain characters Because of the scew some areas were not correctly identified as text this is why the absolut number is very different.
Also chaging the resolution form 600 dpi (4,78 cm x 24,45 cm) to 300 dpi (9,55 48,9 cm) - without re-calculation of pixel - and saved it as a bmp. The relolustion change (virtualy enlargemtn) and the now not scewd text lines made it easier for the OCR to analyse the image. The 8% (472/6207) uncertain characters is at same level as the corrected image in FineReader.
Note: I am not recomending scanning in 300 dpi - you need 600 dpi to get enough pixels form the rather small characters.
The number of characters is not always the same because the ....... areas sometimes were interpreted as an image snippet.
answered 17 Apr '12, 10:52
Next step to improve:
Double pixel size of the image and set 600 dpi (from 1128x5776 to 2256x11552). When submitting to the server, use all the previous options: scanner and deskew.
answered 17 Apr '12, 16:55
Vasily Panferov ♦♦