Extracted text often not accurate

  • 84 Views
  • Last Post 4 weeks ago
prw56 posted this 17 October 2017

Source Code:

 

engineLoader.Engine.LoadPredefinedProfile("DocumentConversion_Accuracy");

//create document
FR.FRDocument document = engineLoader.Engine.CreateFRDocument();

//get and add screenshot
System.Drawing.Image screenShot = this.GetScreenShot();

using (MemoryStream m = new MemoryStream())
{
    screenShot.Save(m, System.Drawing.Imaging.ImageFormat.Png);
    m.Position = 0;
    document.AddImageFileFromStream(new ABBYReadStream(m));
}

//process and synthesize
document.Process();
document.Synthesize();

//find the text
int posX = 0;
int posY = 0;
for (int x = 0; x < document.Pages.Count; x++)
{
    FR.LayoutBlocks blocks = document.Pages[x].Layout.Blocks;
    for (int y = 0; y < blocks.Count; y++)
    {
        FR.IBlock block = blocks[y];
        if (block.Type == FR.BlockTypeEnum.BT_Text)
        {
            FR.TextBlock textBlock = block.GetAsTextBlock();
            for (int z = 0; z < textBlock.Text.Paragraphs.Count; z++)
            {
                //need to use the options & regex in UIAutomationHelper
                FR.Paragraph paragraph = textBlock.Text.Paragraphs[z];
                if (paragraph.Text != text)
                    continue;

                //find middle point of text
                posX = paragraph.Left + (paragraph.Right - paragraph.Left) / 2;
                posY = paragraph.Top + (paragraph.Bottom - paragraph.Top) / 2;
            }
        }
    }
}

I've been having issues with the accuracy of extracted text. I outputted the attached pdf with one of the examples of text being extracted incorrectly (instead of "Click", it always extracts "3233").

I think the text being highlighted could be causing this, I have also noticed that surrounding the word click with random letters also often causes incorrect text to be extracted (as in "awdawdaClickfawd" will not have "Click" anywhere in the extracted text string).

Also, once and a while there is no text extracted at all, instead just RasterImage blocks, even though I always process and synthesize the document.

I have tried the TextExtraction_Accuracy profile, and I have tried setting the EnableAggressiveTextExtraction to false.

Is there some way to increase the accuracy of extracted text?

Attached Files

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 20 October 2017

The reason of this issue is that the quality of your image is very poor. Please look at the binarized image:

"Click" is unreadable, so the program also cannot OCR this word.

Kindly review our tips to scan or photograph the documents to achieve the best recognition results in the following articles of our Help file: Help→Best practices→Source Image Recommendations and Help→Best practices→Improving Recognition Quality.

prw56 posted this 20 October 2017

Thank you for the response! This image was actually a screenshot of an application, it was not scanned in using anything physical. Is it possible that the export to pdf degraded it?

Here is the image itself that was OCRed:

Does this image have the same issue when binarized?

Oksana Serdyuk posted this 4 weeks ago

This image has still poor quality, and it is not common scenario for using our OCR technologies. Unfortunately, due to this I cannot find the settings for extracting all text from the screenshot accurately: either "Form1" is extracted, or "Click" is recognized, but never together.

Generally, ABBYY OCR technologies are mostly used for recognition scans of documents and photos containing text. We have not specially trained for processing screenshots.

prw56 posted this 4 weeks ago

What settings did you use to get Click to be recognized?

Oksana Serdyuk posted this 4 weeks ago

I used the combination of the settings:

  • predefined profile: TextExtraction_Accuracy
  • profile.ini:

    [PrepareImageMode]
    InvertImage = TRUE
    UseFastBinarization = TRUE
    EnhanceLocalContrast = TRUE

and received the following recognition result:

»       lr3HIO 
               
Click          
               
         L    

prw56 posted this 4 weeks ago

I assume that InvertImage and EnhanceLocal contrast are here to make up for the image quality and the text being highlighted, but what is the purpose of UseFastBinarization?

Oksana Serdyuk posted this 4 weeks ago

When you use the UseFastBinarization property, the binarization is different:

Based on my experience, this property often works well with the color images of low quality as screenshot. However, this is not a rule that for another image these settings will be also acceptable. Basically, you should find out your optimal "quality" and recognition settings by experimenting with the most typical images you are going to recognize.

Close