I'm looking at some xml that was produced by (I guess) ABBYY 6, as it includes this schema information at the start: http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml
The encoding is UTF8 and the lang throughout is given as EnglishUnitedStates.
I realise that there is an interplay between character level recognition and a dictionary, but can anyone explain please to what extent the lang affects recognition results? I'm particularly interested in how ligatured characters might be treated, since none appear in the output, though there are many in the text.
asked 21 Jul '14, 13:22
The issue occurs because of the limitation of ABBYY FineReader 6 in general (not because of the wrong settings). It can recognize ligatures, but it was too difficult to recognize it in this book because of the text quality.
Note that ABBYY FineReader 6 is an old product. With the current technologies the recognition quality should be better. You can test it here: http://finereaderonline.com/