Check if PDF is scanned image or contains text

  • 8.1K Views
  • Last Post 2 weeks ago
Jacob posted this 24 April 2012

We have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.

Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files already?

We are using C# .NET 4.

Thanks

Order By: Standard | Newest | Votes
Nikolay_Kh posted this 24 April 2012

ABBYY Cloud OCR SDK currently doesn't provide API for the task you describe. You can try using Adobe Reader COM API for saving pdf as text or look for some other solution. Have a look at this for example. Please let me know if you have any more questions.

  • Liked by
  • Vasily Panferov
mattopson posted this 26 April 2012

You can use OCR program & see who better it is ? Hopefully you will get the a good result. good luck.

Nikolay_Kh posted this 27 April 2012

Hello Mat, please avoid discussing non-ABBYY OCR software unless you provide a solution for the described task. The link you provide doesn't clearly state how Jacob could look for text layer in his PDF files.

You can refer to our FAQ page for details: http://forum.ocrsdk.com/faq/

AJW posted this 15 May 2012

Use iTextSharp to pre-process/check your PDF. We do this before we send anything to OCR with our own servers, because it saves a lot of time and reduces our queue.

(I am looking at this service as a replacement for our standard installation, but that is what we do right now.)

-AJ

  • Liked by
  • ibr.a
ibr.a posted this 2 weeks ago

Hi, AJW
That is exactly what I'm looking for right now, do you know a link or some documentation on how to use the  iTextSharp for pre-process?
I have downloaded "itextsharp-all-5.5.10" and found inside it Zipped files with bunch of libraries
Thanks in advance

Close