I'm processing some PDF documents, in some all the important stuff is text and I don't use ABBYY, in some everything is graphical and I am using ABBYY.

The problem is the hybrid.

By that I mean that a lot of what I need is proper easily extracted text (I'm on linux, using Poppler/pdftotext). But some of the textual information is graphical, like a column heading of "Quantity" for example, this is being probably produced by someone who is using a scan of their old form as a background image to the PDF pages.

Now what I'd like to do is send up the PDF, have ABBYY convert the graphics, leave the text alone, get an "improved" searchable PDF back, discard the original and then go through my normal processing, but unfortunately the library is doing something that seems rather odd, it's rendering the textual content to graphics and then OCRing that as well.

Other than the fact that this seems a waste of resources the big problem is that sometimes it's gets things wrong. The fields that were pristine, proper text become slightly garbled.

I hope I've explained that well enough.

Now I'm getting round this at the moment by using Poppler/pdftohtml to convert to html, scan the files, find the background images to the created html (which doesn't contain the text), ship those background images up, OCR them using processDocument, then pull them back as searchable PDF (not using the xml output because I need words not letters), do my normal processing on both my original PDF and the new one and then merge the results.

It sort of works but it's far from ideal.

Is this a bug or just a suggestion?

asked 21 Feb '13, 20:18

AndyA's gravatar image

AndyA
1112

Can you please provide sample image and name version of OCR product you use?

(21 Feb '13, 21:24) Andrey Isaev ♦♦

Hi yes, client confidentiality unfortunately prevents me shipping the PDF in it's entirety in any way, but I can show you a snippet.

The following grab from the original PDF shows the word "Details" and a black border both of which are part of a large background image, and the text beneath it starting "DRIVERS ..." is actually real text in the PDF.

alt text

Just as further confirmation, if I use the Poppler tool pdftohtml to get just the page background image the same snippet is as follows:

alt text

A cut and paste of this area on the original PDF obviously returns:

DRIVERS PLEASE COLLECT EMPTIES

I then use the cloud OCR SDK, profile = textExtraction, outputFile = searchablePDF and use submitImage(). It seems to OCR both the graphic "Details" (which it gets perfectly correct, no complaints), but also the other text. Although the appearance of the output PDF is the same as the original a cut and paste on the same area now returns the following:

Details
d r iv e r s please COLLECT EMPTIES

I hope thats clear.

link

answered 25 Feb '13, 18:31

AndyA's gravatar image

AndyA
1112

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×49

Asked: 21 Feb '13, 20:18

Seen: 3,342 times

Last updated: 25 Feb '13, 18:31

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal