Setting Regex and Custom Dictionary for Different Regions in an Image in JAVA

Barry Choi posted this 4 weeks ago

I need to define a region with regex and another region with a custom dictionary in an image. For regex region, I tried to implement the logic based on the section `How to attach a dictionary to a recognition language` in the user guide as follows but it does not affect the result at all. May I know if the following code snippet is correct?

IRecognizerParams recognizerParams = engine.CreateRecognizerParams();
ILanguageDatabase languageDatabase = engine.CreateLanguageDatabase();
ITextLanguage textLanguage = languageDatabase.CreateTextLanguage();
IBaseLanguages baseLanguages = textLanguage.getBaseLanguages();
IBaseLanguage baseLanguage = baseLanguages.AddNew();
IDictionaryDescriptions dictionaryDescriptions = baseLanguage.getDictionaryDescriptions();
IDictionaryDescription dictionaryDescription = dictionaryDescriptions.AddNew(DictionaryTypeEnum.DT_RegularExpression);
IRegExpDictionaryDescription regExpDictionaryDescription = dictionaryDescription.GetAsRegExpDictionaryDescription();
// baseLanguage.setLetterSet(type, result); // no idea what the result parameter should be
region.AddRect(0, 100, 500, 125);
region.AddRect(0, 200, 500, 225);
document.getPages().getElement(0).getLayout().getBlocks().AddNew(BlockTypeEnum.BT_Text, region, 0);
document.Recognize( null, null );

For custom dictionary, we have a word list in tesseract's .user-words format (one word per line). What is the proper way to consume the .user-words file?

Thanks very much.

Barry Choi posted this 4 weeks ago

From the interface file IFRDocument, it seems that only document.Analyze would accept recognizerParams as input so I added 

document.Analyze(null, null, recognizerParams);

before the document.Recognize( null, null ); statement.

During execution, the following error occurred:

The page cannot be analyzed. No basic languages with an alphabet are available. Please specify an alphabet.

Having search 'alphabet' in the user manual and interface files, I am unable to find any clue to resolve this. 

May I know if I'm on the right track?

Thanks very much.

Denis Gusak posted this 3 weeks ago


Firstly, if you add regions manually using AddNew() method you have to specify recognition parameters for each of them manually too:


Secondly, when creating a new BaseLanguage object, it is necessary not only to create and set dictionaries, but also set an alphabet via setLetterSet method:

baseLanguage.setLetterSet(BaseLanguageLetterSetEnum.BLLS_Alphabet, "abcdefghi123456");

Hope it helps!