Cloud OCR SDK - how to disable dictionary "auto correction"

  • 186 Views
  • Last Post 07 April 2017
  • Topic Is Solved
Licho posted this 31 March 2017

Hello,

I'm trying to recognize a text field (using http://ocrsdk.com/documentation/apireference/processTextField/ )  which has simple word written in it in a very clear font:

"TRKA"

However, if language is set to Czech, it gets automatically "corrected" to "TRKÁ".

If is set language to English no auto correction occurs.

How do I disable dictionary based auto corrections or at least extract confidence of characters before dictionary pass from cloud OCR?

 

This feature is very annoying and currently makes it impossible for us to use ABBYY for OCR.

 

Thank you.  

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 03 April 2017

Return of a collection of variants of character recognition and their confidence is supported only for full-page recognition for the XML export format. This is not supported for the field-level recognition mode.

When you use the English language, you do not get the "Á" character as a result, because the English alphabet does not include this letter. Thus, the same you can do with the Czech recognition language. For example, you can limit the characters which should be used during recognition using the letterSet parameter. Additionally, you could check potential field values against appropriate regular expressions (e.g. date format for dates, words starting with capital letters for names, etc.) to accept or reject different variants. To specify the regular expression, which defines what words are allowed in the field, please use the regExp parameter of the processTextField method. Please also pay your attention on the How to Recognize Text Fields article with some details.

Licho posted this 03 April 2017

Thank, you, but Czech alphabet normally contains characters like "Á", I cannot restrict it.

In this case, the scanned image DOES NOT contain it, but is auto corrected using dictionary to version with "Á".

If I use English language, and set allowed characters to contain "Á", the word is still OCR correctly as "TRKA" and other words which really have "Á" work correctly too.

 

So the BUG is obviously in too eager dictionary checks which convert word "TRKA" to "TRKÁ" for no reason at all

 

To illustrate:

The field input: http://i.imgur.com/FsyeaRv.png

OCR with Czech language, result: "TRKÁ" (fail)  "Á" confidence 100! even if its not there.

OCR with English language, with "Á" in alphabet, result: "TRKA" (correct)

 

The field input: http://i.imgur.com/4cSSlvs.png

OCR with Czech language, result: "MALEGOVÁ" (correct)

OCR with English language, with "Á" in alphabet, result: "MALEGOVÁ" (correct)

 

So it is obvious some dictionary based processing transforms correctly OCR word to something else. "TRKA" is a surname (not in dictionary) while "TRKÁ" is verb that is likely to be in a dictionary.

 

Oksana Serdyuk posted this 05 April 2017

Thank you for the images! I have reproduced the issue, we shall analyze the situation, and then I will return with our comments.

Oksana Serdyuk posted this 07 April 2017

Sorry for the delay. Do I understand correctly that you process passports and the fields are always printed by the capital letters? If so, please try to use the following recognition settings for the processTextField method:

Language = "Czech";

TextType = TextType.Normal; //Use the TextType.OcrB for extracting MRZ data

Letterset = "ABCDEFGHIJKLMNOPQRSTUVWXYZÁÉÍÓÚÝČĎĚŇŘŠŤŮŽ";

In this case both your images are recognized accurately.

 

 

Licho posted this 07 April 2017

Thank you!   

 

Close