We have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files already?
We are using C# .NET 4.
ABBYY Cloud OCR SDK currently doesn't provide API for the task you describe. You can try using Adobe Reader COM API for saving pdf as text or look for some other solution. Have a look at this for example. Please let me know if you have any more questions.
answered 24 Apr '12, 13:05
Use iTextSharp to pre-process/check your PDF. We do this before we send anything to OCR with our own servers, because it saves a lot of time and reduces our queue.
(I am looking at this service as a replacement for our standard installation, but that is what we do right now.)
answered 15 May '12, 01:53