Text is ignored half the time

  • 23 Views
  • Last Post 6 days ago
prw56 posted this 2 weeks ago

version: 11.1.19.72

I run this code using the EngineLoader used in the sample code:

engineLoader.Engine.LoadPredefinedProfile("DocumentConversion_Accuracy");

//create document
FR.FRDocument document = engineLoader.Engine.CreateFRDocument();

//get and add screenshot
System.Drawing.Image screenShot = this.GetScreenShot();

using (MemoryStream m = new MemoryStream())
{
    screenShot.Save(m, System.Drawing.Imaging.ImageFormat.Png);
    m.Position = 0;
    document.AddImageFileFromStream(new ABBYReadStream(m));
}

//process and synthesize
document.Process();
document.Synthesize();

//find the text
int posX = 0;
int posY = 0;
for (int x = 0; x < document.Pages.Count; x++)
{
    FR.LayoutBlocks blocks = document.Pages[x].Layout.Blocks;
    for (int y = 0; y < blocks.Count; y++)
    {
        FR.IBlock block = blocks[y];
        if (block.Type == FR.BlockTypeEnum.BT_Text)
        {
            FR.TextBlock textBlock = block.GetAsTextBlock();
            for (int z = 0; z < textBlock.Text.Paragraphs.Count; z++)
            {
                //need to use the options & regex in UIAutomationHelper
                FR.Paragraph paragraph = textBlock.Text.Paragraphs[z];
                if (paragraph.Text != text)
                    continue;

                //find middle point of text
                posX = paragraph.Left + (paragraph.Right - paragraph.Left) / 2;
                posY = paragraph.Top + (paragraph.Bottom - paragraph.Top) / 2;
            }
        }
    }
}

The image I add to the document is the screenshot provided, but half the time the parts circled in red are not found in the text after document synthesis takes place. I have also tried using the "Default" engine profile.

Any ideas why these parts of the image are sometimes ignored?

Edit: Also I have verified that the whole image is added to the document by outputting the document afterwards as a pdf, so its not cutting off part of the image.

Order By: Standard | Newest | Votes
Koen de Leijer posted this 2 weeks ago

It looks like the same question I've asked recently;

Have you already checked the answer here: https://forum.ocrsdk.com/thread/some-parts-of-a-specific-pdf-are-not-ocr-ed-by-abbyy-finereader-engine/

prw56 posted this 6 days ago

I had not seen that answer, thank you for pointing it out. Switching to a profile where the parameters mentioned in that thread are set to true (in my case DocumentArchiving_Accuracy) solved the issue.

Close