language versus character set

  • Last Post 14 August 2014
JuanTanamera posted this 21 July 2014

I'm looking at some xml that was produced by (I guess) ABBYY 6, as it includes this schema information at the start:

The encoding is UTF8 and the lang throughout is given as EnglishUnitedStates.

I realise that there is an interplay between character level recognition and a dictionary, but can anyone explain please to what extent the lang affects recognition results? I'm particularly interested in how ligatured characters might be treated, since none appear in the output, though there are many in the text.

Order By: Standard | Newest | Votes
SDK_support posted this 22 July 2014

Can you please specify your scenario in more details? Do you use Cloud OCR SDK or FineReader Engine?

Do we understand correctly that you are recognizing Arabic and some characters, like ligatured one, are not presented in the output? If so, can you please send some examples of such images and your serial number to

JuanTanamera posted this 27 July 2014

Thank you for your response.

I'm looking at output that I didn't produce and which I won't be able to reproduce, as given here, e.g.

(See "Download Contents", in this case

All I really have to go on is the information in the xml file.

SDK_support posted this 28 July 2014

Unfortunately, second provided link is not working. Can you please be more specific in your scenario? Do we understand correctly that you need to recognize the file and save it to XML?

JuanTanamera posted this 14 August 2014

The URL needs to have the right bracket removed from it...

No, I am not trying to recognize a file -- the first link shows an example page of a journal that has been scanned and recognised with ABBYY, as shown by the xml output at the second link. My question was regarding ABBYY's failure to recognise ligatured (English) characters in these files and whether this has any connection with the selected lang as indicated in the xml file (EnglishUnitedStates), or whether this is just a limitation of ABBYY in general, or a setting that was used when scanning. Note, I am not interested in running the OCR myself, I am just trying to understand why ligature isn't recognised in these files, and the extent of that issue.

Anastasia Galimova posted this 15 August 2014

The issue occurs because of the limitation of ABBYY FineReader 6 in general (not because of the wrong settings). It can recognize ligatures, but it was too difficult to recognize it in this book because of the text quality.

Note that ABBYY FineReader 6 is an old product. With the current technologies the recognition quality should be better. You can test it here: