I'm looking at some xml that was produced by (I guess) ABBYY 6, as it includes this schema information at the start: http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml

The encoding is UTF8 and the lang throughout is given as EnglishUnitedStates.

I realise that there is an interplay between character level recognition and a dictionary, but can anyone explain please to what extent the lang affects recognition results? I'm particularly interested in how ligatured characters might be treated, since none appear in the output, though there are many in the text.

asked 21 Jul '14, 13:22

JuanTanamera's gravatar image

JuanTanamera
113

Can you please specify your scenario in more details? Do you use Cloud OCR SDK or FineReader Engine?

Do we understand correctly that you are recognizing Arabic and some characters, like ligatured one, are not presented in the output? If so, can you please send some examples of such images and your serial number to SDK_Support@abbyy.com?

(22 Jul '14, 16:33) SDK_support ♦♦

Thank you for your response.

I'm looking at output that I didn't produce and which I won't be able to reproduce, as given here, e.g.

http://www.biodiversitylibrary.org/item/14589#page/15/mode/1up

(See "Download Contents", in this case http://ia700602.us.archive.org/28/items/mobotbca_03_01_00/mobotbca_03_01_00_abbyy.gz)

All I really have to go on is the information in the xml file.

(27 Jul '14, 20:12) JuanTanamera

Unfortunately, second provided link is not working. Can you please be more specific in your scenario? Do we understand correctly that you need to recognize the file http://www.biodiversitylibrary.org/item/14589#page/15/mode/1up and save it to XML?

(28 Jul '14, 12:53) SDK_support ♦♦

The URL needs to have the right bracket removed from it...

http://ia700602.us.archive.org/28/items/mobotbca_03_01_00/mobotbca_03_01_00_abbyy.gz

No, I am not trying to recognize a file -- the first link shows an example page of a journal that has been scanned and recognised with ABBYY, as shown by the xml output at the second link. My question was regarding ABBYY's failure to recognise ligatured (English) characters in these files and whether this has any connection with the selected lang as indicated in the xml file (EnglishUnitedStates), or whether this is just a limitation of ABBYY in general, or a setting that was used when scanning. Note, I am not interested in running the OCR myself, I am just trying to understand why ligature isn't recognised in these files, and the extent of that issue.

(14 Aug '14, 13:24) JuanTanamera

The issue occurs because of the limitation of ABBYY FineReader 6 in general (not because of the wrong settings). It can recognize ligatures, but it was too difficult to recognize it in this book because of the text quality.

Note that ABBYY FineReader 6 is an old product. With the current technologies the recognition quality should be better. You can test it here: http://finereaderonline.com/

link

answered 15 Aug '14, 16:27

Anastasia%20Galimova's gravatar image

Anastasia Ga... ♦♦
790112

edited 15 Aug '14, 16:28

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×2
×1
×1

Asked: 21 Jul '14, 13:22

Seen: 1,254 times

Last updated: 15 Aug '14, 16:28

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal