Setting Regex and Custom Dictionary for Different Regions in an Image in JAVA

  • 125 Views
  • Last Post 01 March 2018
Barry Choi posted this 21 February 2018

I need to define a region with regex and another region with a custom dictionary in an image. For regex region, I tried to implement the logic based on the section `How to attach a dictionary to a recognition language` in the user guide as follows but it does not affect the result at all. May I know if the following code snippet is correct?

IRecognizerParams recognizerParams = engine.CreateRecognizerParams();
ILanguageDatabase languageDatabase = engine.CreateLanguageDatabase();
ITextLanguage textLanguage = languageDatabase.CreateTextLanguage();
IBaseLanguages baseLanguages = textLanguage.getBaseLanguages();
IBaseLanguage baseLanguage = baseLanguages.AddNew();
IDictionaryDescriptions dictionaryDescriptions = baseLanguage.getDictionaryDescriptions();
IDictionaryDescription dictionaryDescription = dictionaryDescriptions.AddNew(DictionaryTypeEnum.DT_RegularExpression);
IRegExpDictionaryDescription regExpDictionaryDescription = dictionaryDescription.GetAsRegExpDictionaryDescription();
regExpDictionaryDescription.SetText("(((|0)[1-9])|([12][0-9])|(30)|(31))\\-(((|0)[1-9])|(10)|(11)|(12))\\-((((19)|(20))[0-9][0-9])|([0-9][0-9]))");
baseLanguage.setAllowWordsFromDictionaryOnly(true);
// baseLanguage.setLetterSet(type, result); // no idea what the result parameter should be
recognizerParams.setTextLanguage(textLanguage);
region.AddRect(0, 100, 500, 125);
region.AddRect(0, 200, 500, 225);
document.getPages().getElement(0).getLayout().getBlocks().AddNew(BlockTypeEnum.BT_Text, region, 0);
document.Recognize( null, null );

For custom dictionary, we have a word list in tesseract's .user-words format (one word per line). What is the proper way to consume the .user-words file?

Thanks very much.

Related topic: https://forum.ocrsdk.com/thread/how-to-only-recognize-specified-region-of-the-image-in-java/

Order By: Standard | Newest | Votes
Barry Choi posted this 22 February 2018

From the interface file IFRDocument, it seems that only document.Analyze would accept recognizerParams as input so I added 

document.Analyze(null, null, recognizerParams);

before the document.Recognize( null, null ); statement.

During execution, the following error occurred:

The page cannot be analyzed. No basic languages with an alphabet are available. Please specify an alphabet.

Having search 'alphabet' in the user manual and interface files, I am unable to find any clue to resolve this. 

May I know if I'm on the right track?

Thanks very much.

Denis Gusak posted this 01 March 2018

Hi!

Firstly, if you add regions manually using AddNew() method you have to specify recognition parameters for each of them manually too:

document.getPages().getElement(0).getLayout().getBlocks().getElement(0).GetAsTextBlock().getRecognizerParams().setTextLanguage(textLanguage);

Secondly, when creating a new BaseLanguage object, it is necessary not only to create and set dictionaries, but also set an alphabet via setLetterSet method:

baseLanguage.setLetterSet(BaseLanguageLetterSetEnum.BLLS_Alphabet, "abcdefghi123456");

Hope it helps!

Close