PDF to HTML in any way.

  • Last Post 03 September 2014
ZaferK posted this 20 September 2013


I'm trying to use Cloud OCR SDK to convert PDF file to text in order to be able to have a structural HTML instead of XML which contains (almost) a tag/position information per character.

I'd like to ask that : - Do you know a XLST map that we can use to convert XML to HTML ? - Does Abbyy have any intention to provide such direct feature in the near future ?

Thanks! Zaf.

Order By: Standard | Newest | Votes
Anastasia Galimova posted this 25 September 2013

It is possible to make a support of the HTML export format without pictures.

To make a solution, our analyst has asked for the following information:

  1. Do we understand correctly, that you do not need to export pictures (as the XML format does not contain it as well)?
  2. What is the purpose of converting PDF to HTML?
  3. Do you need to save the text formatting (font size etc.)?
  4. Do you need to save the document formatting (margins etc.)?

ZaferK posted this 26 September 2013

Hello Anastasia,

We are not interesting in the image part of the PDFs such as backgrounds, logos, footers, separators. The important part for us is the text parts which we can use text-based information extractions.

  1. yes, we don't need to export images. Structural HTML with tables, paragraphs are fine.
  2. main purpose is to prepare PDF data to the information extraction and data mining.
  3. no, we don't need to save font sizes or any other CSS mainly. (they could be good separators to be used in the data extraction, but we can detect with structured/ordered HTML tags too)
  4. same answer as the previous ones.

Thank you.

ZaferK posted this 08 October 2013


is there any progress or any development that you can share on this subject ?

Thank you.

Anastasia Galimova posted this 10 October 2013

The analyst said that HTML export format should be added, but it will take some time, so he recommends to use the following workaround:

  1. recognize your file and perform export to pdf using ABBYY Cloud OCR SDK,
  2. convert the pdf with the recognized text to HTML as it is described in this post.

ukrainecmk posted this 25 August 2014

Hello. Can you please tell me - is PDF to HTML conversion implemented for now? If so - can you point me to documentation, samples or any other info that will help me to make such conversion?

Regards, Alexey.

ukrainecmk posted this 25 August 2014

"convert the pdf with the recognized text to HTML as it is described in this post." But it is saying that it's impossible.

ukrainecmk posted this 25 August 2014

Any progress in this issue?

Julia Anikushina posted this 27 August 2014

You can convert PDF TextAndImages to HTML5 by means of PDF to HTML5 Converter.

Is this method appropriate for you?

ukrainecmk posted this 30 August 2014

Hello. Thank you for responce. So, exactly Abbyy do not have such service, am I right?

Julia Anikushina posted this 02 September 2014

Unfortunately at the moment we don't have such functionality.

Please create a feature request and describe your scenario there. Do you need to save formatting, pictures?

ukrainecmk posted this 02 September 2014

Hello. Yes, actually - what I need - to send pdf document to service, and get html code, for each page, just formatted as in pdf, but without: 1. javascript 2. any global selectors, if there will be css styles - they should be applied only for page html, and do not touch any html elements out of page (as I will insert this html code - into my html page, and do not use it as separate page) 3. it should have unique IDs for elements, or no IDs at all 4. I shoud be able after this to insert all pages html - in one final my own html page

Julia Anikushina posted this 03 September 2014

I have created a feature request for HTML export. Please vote there. Hope this functionality will be added in the future.

ukrainecmk posted this 03 September 2014

Thank you, I didn't found - how to vote there, I just placed new comment, hope this will help.