In the Latin language, long vowels often have a macron over it: "arma virumque canō, Troiae quī prīmus ab ōrīs"
I indicated that the input language of my pdf is Latin, but the macrons are lost in the output sent to Word: "arma virumque cano, Troiae qui primus ab oris"
Is there a way to retain the macrons? Many thanks!
asked 28 Apr '15, 19:40
If you are using FineReader Engine, it is possible to recognize all Unicode characters. However, you might first need to define your own language with a custom alphabet that includes all necessary symbols. CustomLanguage sample found in the Code Samples Library might help you do just that. If you look at the source code of that sample, you will find a function called makeTextLanguage(). If you modify it to look something like this, it will return a TextLanguage object with a custom alphabet (I use C# syntax in the code fragments below):
You should then pass this custom language to a recognizing method, e.g. as a Process() of a FRDocument object:
This should allow you to recognize macrons. If you need to recognize symbols other than macrons, make sure to add them to the alphabet of your custom language as well. To improve recognition of these symbols further, you might also try recognizing with training (see Developer’s Help -> Guided Tour -> Advanced Techniques -> Using GUI Elements -> Recognizing with Training).
Hi! I cannot work out the following issue: I need to OCR texts containing pinyin diacritics ( o ā ɑ̄ ē ī ō ū ǖ / Ā Ē Ī Ō Ū Ǖ /á ɑ́ é í ó ú ǘ / Á É Í Ó Ú Ǘ / ǎ ɑ̌ ě ǐ ǒ ǔ ǚ / Ǎ Ě Ǐ Ǒ Ǔ Ǚ / à ɑ̀ è ì ò ù ǜ / À È Ì Ò Ù Ǜ / a ɑ e i o u ü / A E I O U o ā ɑ̄ ē ī ō ū ǖ / á ɑ́ é í ó ú ǘ /ǎ ɑ̌ ě ǐ ǒ ǔ ǚ / à ɑ̀ è ì ò ù ǜ / a ɑ e i o u ü) which the software either does not recognize or even mix up. In previous versions of such software, and of the others included in the comparison table, I tried training, creating user specific languages, adding every character to their dictionaries etc, finding no success at all. I've even asked the companies for a solution which seems not to exist. Therefore, I think this situation should really be mentioned as the Achilles’ heel in the OCR field. I would really appreciate some advice on how to solve this problem if possible or even to be corrected if I am wrong.
This answer is marked "community wiki".
answered 22 May '15, 15:33