Regular Expressions in arabic

  • 138 Views
  • Last Post 12 October 2017
Andreea OLaru posted this 20 September 2017

Hello,

how can i write regular expressions for the arabic language, precisely numbers with a . and a comma? ex: 1,021.00 ( i used the processFields method with the regex in the xml file) i tried to write it in Unicode but it doesn't seem to work. Here s the regex:

 [U\+0660-U\+0669]+,[U\+0660-U\+0669]+\.[U\+0660-U\+0669]+

when i use this regex it recognized the point and comma but for the digits the were only some "<<>><>"

Thanks :) 

 

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 21 September 2017

Please try the following regular expression:

(\d{1,3})((\,\d{3})*)(\.\d{1,2})?

Also I would recommend you to limit the characters which should be used during recognition using the Digits recognition language and the letterSet element, for example:

<language>Digits</language>

<textType>normal</textType>

<letterSet>0123456789.,</letterSet>

Oksana Serdyuk posted this 21 September 2017

The Regex101 service should be helpful for checking your regular expressions.

Andreea OLaru posted this 22 September 2017

Hey, 

For the digits in arabic ( ٠١٢٣٤٥٦٧٨٩ ) it doesn't work. I did replaced the normal digits with the arabic ones and no luck either.

Oksana Serdyuk posted this 22 September 2017

Then please try the following settings:

<language>Arabic</language>

<textType>normal</textType>

<letterset>٠١٢٣٤٥٦٧٨٩,.</letterset>

<regExp>(\p{N}{1,3})((\,\p{N}{3})*)(\.\p{N}{1,2})?</regExp>

Andreea OLaru posted this 02 October 2017

Nope, doesn't work, where there is a comma the ocr sees it as a م in most cases 

Andreea OLaru posted this 02 October 2017

Also could you suggest any solution for the same problem but using Abbyy FineReader 14. I've tried pattern training and also creating a new custom language with regular expressions, to work alongside the programs arabic language.

The problem is the program not makind the difference between a point ( . ) and a zero ( witch in arabic is ٠ ).

Thank you!

Oksana Serdyuk posted this 03 October 2017

Could you please share your images for which the issues can be reproduced?

Andreea OLaru posted this 03 October 2017

 

this would be the image, all the others have the same template, the only issue is that the output doesn't make a difference between 0 and point, besides this it is very accurate.

Andreea OLaru posted this 03 October 2017

the pdf has a better quality tho, an example of a number would be this:

 

Oksana Serdyuk posted this 03 October 2017

Thank you for this information!

Regarding the first image, its quality is very poor, it cannot be used for OCR. The resolution of the image is low, the image is blurred, the text is fuzzy. Even human eyes cannot read the text from it. Possibly the image has worsened during enclosing to this post.

If you manage to improve the quality of the input images in accordance with the Best Practices article, the recognition results might be better. Please try it.

Concerning the second text fragment, I will test it in Cloud OCR SDK and write you later. The support specialists of ABBYY desktop products should send you some recommendations about using FR 14 by email.

Oksana Serdyuk posted this 12 October 2017

Sorry for the delay. I've reproduced the issue with a point and an Arabic zero using the image fragment. I've created the corresponding reclamation and sent the information to our R&D Department for further investigation. This is really a difficult case, because these characters are very similar, and our OCR technology mixes them up.

The regular expressions do not help, because they do not strictly limit the set of characters of the output result, i.e. the recognized value may contain characters which are not included into the regular expression. During recognition all hypotheses of a word recognition are checked against the specified regular expression. If a given recognition variant conforms to the expression, it has higher probability of being selected as final recognition output. But if there is no variant that matches regular expression, the result will not conform to the expression.

 

Andreea OLaru posted this 12 October 2017

Thank you very much for the answer and effort Oksana.  I'm looking forward to your solution! :) 

Close