I need to define a region with regex and another region with a custom dictionary in an image. For regex region, I tried to implement the logic based on the section `How to attach a dictionary to a recognition language` in the user guide as follows but it does not affect the result at all. May I know if the following code snippet is correct?
IRecognizerParams recognizerParams = engine.CreateRecognizerParams(); ILanguageDatabase languageDatabase = engine.CreateLanguageDatabase(); ITextLanguage textLanguage = languageDatabase.CreateTextLanguage(); IBaseLanguages baseLanguages = textLanguage.getBaseLanguages(); IBaseLanguage baseLanguage = baseLanguages.AddNew(); IDictionaryDescriptions dictionaryDescriptions = baseLanguage.getDictionaryDescriptions(); IDictionaryDescription dictionaryDescription = dictionaryDescriptions.AddNew(DictionaryTypeEnum.DT_RegularExpression); IRegExpDictionaryDescription regExpDictionaryDescription = dictionaryDescription.GetAsRegExpDictionaryDescription(); regExpDictionaryDescription.SetText("(((|0)[1-9])|([12][0-9])|(30)|(31))\\-(((|0)[1-9])|(10)|(11)|(12))\\-((((19)|(20))[0-9][0-9])|([0-9][0-9]))"); baseLanguage.setAllowWordsFromDictionaryOnly(true); // baseLanguage.setLetterSet(type, result); // no idea what the result parameter should be recognizerParams.setTextLanguage(textLanguage);
region.AddRect(0, 100, 500, 125);
region.AddRect(0, 200, 500, 225);
document.getPages().getElement(0).getLayout().getBlocks().AddNew(BlockTypeEnum.BT_Text, region, 0);
document.Recognize( null, null );
For custom dictionary, we have a word list in tesseract's .user-words format (one word per line). What is the proper way to consume the .user-words file?
Thanks very much.
Related topic: https://forum.ocrsdk.com/thread/how-to-only-recognize-specified-region-of-the-image-in-java/