Regular Expressions in arabic

  • 94 Views
  • Last Post 6 days ago
Andreea OLaru posted this 4 weeks ago

Hello,

how can i write regular expressions for the arabic language, precisely numbers with a . and a comma? ex: 1,021.00 ( i used the processFields method with the regex in the xml file) i tried to write it in Unicode but it doesn't seem to work. Here s the regex:

 [U\+0660-U\+0669]+,[U\+0660-U\+0669]+\.[U\+0660-U\+0669]+

when i use this regex it recognized the point and comma but for the digits the were only some "<<>><>"

Thanks :) 

 

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 4 weeks ago

Please try the following regular expression:

(\d{1,3})((\,\d{3})*)(\.\d{1,2})?

Also I would recommend you to limit the characters which should be used during recognition using the Digits recognition language and the letterSet element, for example:

<language>Digits</language>

<textType>normal</textType>

<letterSet>0123456789.,</letterSet>

Oksana Serdyuk posted this 4 weeks ago

The Regex101 service should be helpful for checking your regular expressions.

Andreea OLaru posted this 4 weeks ago

Hey, 

For the digits in arabic ( ٠١٢٣٤٥٦٧٨٩ ) it doesn't work. I did replaced the normal digits with the arabic ones and no luck either.

Oksana Serdyuk posted this 4 weeks ago

Then please try the following settings:

<language>Arabic</language>

<textType>normal</textType>

<letterset>٠١٢٣٤٥٦٧٨٩,.</letterset>

<regExp>(\p{N}{1,3})((\,\p{N}{3})*)(\.\p{N}{1,2})?</regExp>

Andreea OLaru posted this 3 weeks ago

Nope, doesn't work, where there is a comma the ocr sees it as a م in most cases 

Andreea OLaru posted this 3 weeks ago

Also could you suggest any solution for the same problem but using Abbyy FineReader 14. I've tried pattern training and also creating a new custom language with regular expressions, to work alongside the programs arabic language.

The problem is the program not makind the difference between a point ( . ) and a zero ( witch in arabic is ٠ ).

Thank you!

Oksana Serdyuk posted this 3 weeks ago

Could you please share your images for which the issues can be reproduced?

Andreea OLaru posted this 3 weeks ago

 

this would be the image, all the others have the same template, the only issue is that the output doesn't make a difference between 0 and point, besides this it is very accurate.

Andreea OLaru posted this 3 weeks ago

the pdf has a better quality tho, an example of a number would be this:

 

Oksana Serdyuk posted this 3 weeks ago

Thank you for this information!

Regarding the first image, its quality is very poor, it cannot be used for OCR. The resolution of the image is low, the image is blurred, the text is fuzzy. Even human eyes cannot read the text from it. Possibly the image has worsened during enclosing to this post.

If you manage to improve the quality of the input images in accordance with the Best Practices article, the recognition results might be better. Please try it.

Concerning the second text fragment, I will test it in Cloud OCR SDK and write you later. The support specialists of ABBYY desktop products should send you some recommendations about using FR 14 by email.

Oksana Serdyuk posted this 6 days ago

Sorry for the delay. I've reproduced the issue with a point and an Arabic zero using the image fragment. I've created the corresponding reclamation and sent the information to our R&D Department for further investigation. This is really a difficult case, because these characters are very similar, and our OCR technology mixes them up.

The regular expressions do not help, because they do not strictly limit the set of characters of the output result, i.e. the recognized value may contain characters which are not included into the regular expression. During recognition all hypotheses of a word recognition are checked against the specified regular expression. If a given recognition variant conforms to the expression, it has higher probability of being selected as final recognition output. But if there is no variant that matches regular expression, the result will not conform to the expression.

 

Andreea OLaru posted this 6 days ago

Thank you very much for the answer and effort Oksana.  I'm looking forward to your solution! :) 

Close