I'm processing some PDF documents, in some all the important stuff is text and I don't use ABBYY, in some everything is graphical and I am using ABBYY.
The problem is the hybrid.
By that I mean that a lot of what I need is proper easily extracted text (I'm on linux, using Poppler/pdftotext). But some of the textual information is graphical, like a column heading of "Quantity" for example, this is being probably produced by someone who is using a scan of their old form as a background image to the PDF pages.
Now what I'd like to do is send up the PDF, have ABBYY convert the graphics, leave the text alone, get an "improved" searchable PDF back, discard the original and then go through my normal processing, but unfortunately the library is doing something that seems rather odd, it's rendering the textual content to graphics and then OCRing that as well.
Other than the fact that this seems a waste of resources the big problem is that sometimes it's gets things wrong. The fields that were pristine, proper text become slightly garbled.
I hope I've explained that well enough.
Now I'm getting round this at the moment by using Poppler/pdftohtml to convert to html, scan the files, find the background images to the created html (which doesn't contain the text), ship those background images up, OCR them using processDocument, then pull them back as searchable PDF (not using the xml output because I need words not letters), do my normal processing on both my original PDF and the new one and then merge the results.
It sort of works but it's far from ideal.
Is this a bug or just a suggestion?
asked 21 Feb '13, 20:18
Hi yes, client confidentiality unfortunately prevents me shipping the PDF in it's entirety in any way, but I can show you a snippet.
The following grab from the original PDF shows the word "Details" and a black border both of which are part of a large background image, and the text beneath it starting "DRIVERS ..." is actually real text in the PDF.
Just as further confirmation, if I use the Poppler tool pdftohtml to get just the page background image the same snippet is as follows:
A cut and paste of this area on the original PDF obviously returns:
I then use the cloud OCR SDK, profile = textExtraction, outputFile = searchablePDF and use submitImage(). It seems to OCR both the graphic "Details" (which it gets perfectly correct, no complaints), but also the other text. Although the appearance of the output PDF is the same as the original a cut and paste on the same area now returns the following:
I hope thats clear.
answered 25 Feb '13, 18:31