[ProcessTextField] OCR Regex doesn't work

  • 338 Views
  • Last Post 17 January 2017
Dang Vinh posted this 23 December 2016

Hello,

Currently, we use API Process Text Fields in Cloud OCR API to recognize our application form. I defined some templates setting and region to OCR, but the result returned from API seem to be doesn't match with my Regex in the templates. Below is my setting, please take a look and help us.

Thanks in advance for your help!

Ex: 1. Settings : <text id="phone"> <language>English</language> <letterset>0123456789</letterset> <regexp>([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])</regexp> <texttype>handprinted</texttype> <placeholderscount>11</placeholderscount> <markingtype>partitionedFrame</markingtype> <onetextline>true</onetextline> <onewordpertextline>true</onewordpertextline> </text> <text id="phone"> <language>English</language> <letterset>0123456789</letterset> <regexp>([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])</regexp> <texttype>handprinted</texttype> <placeholderscount>11</placeholderscount> <markingtype>partitionedFrame</markingtype> <onetextline>true</onetextline> <onewordpertextline>true</onewordpertextline> </text>

  1. Results: Phone: 12832537427212 (Exptected: 8325342122) Date : 4217212056/ (Expected: 12/21/2016)

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 23 December 2016

Please share your image, the used processing settings and your Application ID. Kindly send this info to CloudOCRSDK@abbyy.com.

Dang Vinh posted this 24 December 2016

Hi Oksana Serdyuk, I already sent email to you guys. Thank you!

Oksana Serdyuk posted this 27 December 2016

Hi, I have received your message. Your settings are fine, I have reproduced the issue and now I am consulting with the developers. I will let you know as soon as I get their answer.

Oksana Serdyuk posted this 27 December 2016

Could you please explain how critical this issue is for you?

Also please specify what volumes you plan to process using ABBYY Cloud OCR SDK?

What is your usage scenario?

Dang Vinh posted this 13 January 2017

Hi Oksana Serdyuk, Sorry for late get back, We developed a system for our client. So this is our LIVE product. Please support us to get it done asap.

Here is our purchased history: "Volume Pack L (5000 pages) for Application TLS-Enrollment 14 Nov 2016 42686-00003 $199.99"

Thanks in advance!

Oksana Serdyuk posted this 16 January 2017

Hi, I am consulting with the developers regarding this issue now. I will let you know about the progress.

Oksana Serdyuk posted this 16 January 2017

Please sorry for the delay. Our team has investigated the issue and concluded that there is no bug, this behavior is due to the peculiarities of our recognition technology.

Note that the regular expressions and the placeholdersCount parameter do not strictly limit the set of characters of the output result, i.e. the recognized value may contain characters which are not included into the regular expression and they can be more or less then you specified in placeholdersCount. These parameters are necessary for more accurate detection and recognition of the text field.

In this particular case the issue is connected with the fact that during binarization the field markup is destroyed and therefore it is not defined properly. So, you can find that the recognized value contains more characters, and the most of extra characters are "1" (the borders of markup is recognized as "1" if it was not properly deleted).

The image after binarization is the following:

alt text

However, our developers recommend to try to increase the brightness during scanning to make the image brighter.

Also it is recommended to set the field region most closely. For example, if we process the "credit_card_number" text field with the following settings:

...
  <fieldTemplates>
    <text id="credit_card_number" bottom="0" left="0" right="0" top="0">
      <language>Digits</language>
      <letterSet>0123456789</letterSet>
      <textType>handprinted</textType>
      <oneTextLine>true</oneTextLine>
      <oneWordPerTextLine>true</oneWordPerTextLine>
      <markingType>partitionedFrame</markingType>
      <placeholdersCount>16</placeholdersCount>
    </text>
  </fieldTemplates>
  <page applyTo="0">
    <!--Credit Card-->
    <text id="credit_card_number" bottom="562" right="1361" top="493" left="72" template="credit_card_number"/>
    <!--End Credit Card-->
  </page>
</document>

alt text

it is recognized accurately:

<text bottom="562" right="1361" top="493" left="72" id="credit_card_number">
    <value>4373740000796405</value>
    <line bottom="551" right="1344" top="494" left="86">
        <char bottom="551" right="141" top="499" left="86">4</char>
        <char bottom="551" right="198" top="497" left="173">3</char>
        <char bottom="546" right="295" top="494" left="243" suspicious="true">7</char>
        <char bottom="551" right="374" top="501" left="336">3</char>
        <char bottom="545" right="476" top="502" left="415" suspicious="true">7</char>
        <char bottom="551" right="535" top="503" left="499">4</char>
        <char bottom="539" right="604" top="503" left="577">0</char>
        <char bottom="540" right="685" top="499" left="657">0</char>
        <char bottom="541" right="771" top="508" left="745">0</char>
        <char bottom="540" right="849" top="511" left="818">0</char>
        <char bottom="551" right="944" top="505" left="889" suspicious="true">7</char>
        <char bottom="550" right="1017" top="501" left="978">9</char>
        <char bottom="551" right="1092" top="506" left="1062">6</char>
        <char bottom="551" right="1190" top="511" left="1135">4</char>
        <char bottom="551" right="1254" top="509" left="1225">0</char>
        <char bottom="551" right="1344" top="507" left="1299">5</char>
    </line>
</text>

Attached Files

Dang Vinh posted this 17 January 2017

Hi Oksana Serdyuk.Thanks for your help! I will work with team to try to improve image quality and fields setting.

Close