junk chars are detected

  • 1.8K Views
  • Last Post 07 February 2014
Anastasia Galimova posted this 03 February 2014

mpraining: Some junk chars are detected, e.g.: "PINOT NOIR" - this is the first line of the result of the attached image. Another one "Joan d’Anguera". Here we need the text after such junk char removed. So is there any option to avoid such characters?

Image

Order By: Standard | Newest | Votes
Anastasia Galimova posted this 03 February 2014

The issue is not reproduced on our side. We recommend to recognize your image with the URL "http://cloud.ocrsdk.com/processImage?language=english,french&profile=textextraction&exportFormat=txt". In this case the result is

PINOT NOIR
BURGUNDY
A1020 Roblet-Monnot “Vieilles Vignes" 2010
72
Al 021 Paul Pernot et ses Fils 2008
122
Pommard-Noizons
A1022 Domaine Antonin Guyon 2009
Clos de la Chaume Gaufriot, Beaune
A1023 Domaine Ardhuy 2009
Gevrey-Chambertin
U5
172
C10-24 Domaine de Lambrays Grand Cru 2009
Clos des Lambrays, Morey
260
C1025 Camille Giroud Grand Cru 2008
Chapelle-Chambertin
430
an 18% gratuity is included on all checks

mpraining posted this 05 February 2014

Hello Anastasia, Thanks for your feedback, I got it working better, but still there is one thing I do not understand is that, please check the following entry which I got from my result

A1022 Domaine Antonin Guyon 2009
Clos de la Chaume Gaufriot, Beaune
A1023 Domaine Ardhuy 2009
Gevrey-Chambertin
145
172

Here actually, we expect something like this,

A1022 Domaine Antonin Guyon 2009
Clos de la Chaume Gaufriot, Beaune
145
A1023 Domaine Ardhuy 2009
Gevrey-Chambertin
172

But result is not fine, can you please check why this is happening otherwise my algorithm to detect this line will fail due to this OCR mistake. And I checked the xml format, that is not suitable for us. I'm just expecting the contents as in the image. Please check and help me.

Anastasia Galimova posted this 07 February 2014

The automatic analysis recognize this picture as several separate areas, that's why the text order is not from left to right and from top to bottom. Unfortunately, now it's impossible to export text in this order automatically. So the only way to get this order is to sort the words using its coordinates on your side.

Close