Can ABBYY Finereader OCR Latin macrons?

  • 2.3K Views
  • Last Post 18 June 2015
Magister posted this 28 April 2015

In the Latin language, long vowels often have a macron over it: "arma virumque canō, Troiae quī prīmus ab ōrīs"

I indicated that the input language of my pdf is Latin, but the macrons are lost in the output sent to Word: "arma virumque cano, Troiae qui primus ab oris"

Is there a way to retain the macrons? Many thanks!

  • Liked by
  • GEORGEJUNG
Order By: Standard | Newest | Votes
IvanPopov posted this 06 May 2015

If you are using FineReader Engine, it is possible to recognize all Unicode characters. However, you might first need to define your own language with a custom alphabet that includes all necessary symbols. CustomLanguage sample found in the Code Samples Library might help you do just that. If you look at the source code of that sample, you will find a function called makeTextLanguage(). If you modify it to look something like this, it will return a TextLanguage object with a custom alphabet (I use C# syntax in the code fragments below):

private FREngine.TextLanguage makeTextLanguage()
{
FREngine.LanguageDatabase languageDatabase = engineLoader.Engine.CreateLanguageDatabase();
FREngine.TextLanguage textLanguage = languageDatabase.CreateTextLanguage();
// Copy all attributes from the predefined Latin language
FREngine.TextLanguage latinLanguage = engineLoader.Engine.PredefinedLanguages.Find("Latin").TextLanguage;
textLanguage.CopyFrom( latinLanguage );
textLanguage.InternalName = "SampleTextLanguage";
// Add necessary symbols to the first (and single) BaseLanguage object within TextLanguage
FREngine.BaseLanguage baseLanguage = textLanguage.BaseLanguages[0];
baseLanguage.InternalName = "SampleBaseLanguage";
baseLanguage.set_LetterSet( FREngine.BaseLanguageLetterSetEnum.BLLS_Alphabet, baseLanguage.get_LetterSet(FREngine.BaseLanguageLetterSetEnum.BLLS_Alphabet) + "ĀĒĪŌŪȲāēīōūȳ" );
return textLanguage;
}

You should then pass this custom language to a recognizing method, e.g. as a Process() of a FRDocument object:

FREngine.FRDocument document = engineLoader.Engine.CreateFRDocument();
…
// Create a custom TextLanguage
FREngine.TextLanguage textLanguage = makeTextLanguage();
// Pass your custom language to the Process() method
FREngine.DocumentProcessingParams documentProcessingParams = engineLoader.Engine.CreateDocumentProcessingParams();
documentProcessingParams.PageProcessingParams.RecognizerParams.TextLanguage = textLanguage;
document.Process( documentProcessingParams );

This should allow you to recognize macrons. If you need to recognize symbols other than macrons, make sure to add them to the alphabet of your custom language as well. To improve recognition of these symbols further, you might also try recognizing with training (see Developer’s Help -> Guided Tour -> Advanced Techniques -> Using GUI Elements -> Recognizing with Training).

GEORGEJUNG posted this 22 May 2015

Hi! I cannot work out the following issue: I need to OCR texts containing pinyin diacritics ( o ā ɑ̄ ē ī ō ū ǖ / Ā Ē Ī Ō Ū Ǖ /á ɑ́ é í ó ú ǘ / Á É Í Ó Ú Ǘ / ǎ ɑ̌ ě ǐ ǒ ǔ ǚ / Ǎ Ě Ǐ Ǒ Ǔ Ǚ / à ɑ̀ è ì ò ù ǜ / À È Ì Ò Ù Ǜ / a ɑ e i o u ü / A E I O U o ā ɑ̄ ē ī ō ū ǖ / á ɑ́ é í ó ú ǘ /ǎ ɑ̌ ě ǐ ǒ ǔ ǚ / à ɑ̀ è ì ò ù ǜ / a ɑ e i o u ü) which the software either does not recognize or even mix up. In previous versions of such software, and of the others included in the comparison table, I tried training, creating user specific languages, adding every character to their dictionaries etc, finding no success at all. I've even asked the companies for a solution which seems not to exist. Therefore, I think this situation should really be mentioned as the Achilles’ heel in the OCR field. I would really appreciate some advice on how to solve this problem if possible or even to be corrected if I am wrong.

IvanPopov posted this 18 June 2015

Recognition quality depends heavily on the quality of both printing and scanning of the original document. For example, if diacritic symbols are too small, they might be treated as garbage and not regarded as meaningful content. Similarly, printing defects might lead to different diacritics being treated as the same one. Therefore, more often than not OCR results are as good as the images that are recognized. So far, as time-consuming and mundane as it may be, pattern training and custom dictionaries are still the best way to improve OCR results.

IvanPopov posted this 18 June 2015

Do we understand correctly, that you are using FineReader 12? In that case, you can contact FineReader support team with this question. If you are using OCR SDK, you should contact SDK support team instead. You can find their contact information on this page: http://www.abbyy.com/support/contacts/

Close