Hello,

I'm trying to use Cloud OCR SDK to convert PDF file to text in order to be able to have a structural HTML instead of XML which contains (almost) a tag/position information per character.

I'd like to ask that : - Do you know a XLST map that we can use to convert XML to HTML ? - Does Abbyy have any intention to provide such direct feature in the near future ?

Thanks! Zaf.

asked 20 Sep '13, 19:34

ZaferK's gravatar image

ZaferK
111


It is possible to make a support of the HTML export format without pictures.

To make a solution, our analyst has asked for the following information:

  1. Do we understand correctly, that you do not need to export pictures (as the XML format does not contain it as well)?
  2. What is the purpose of converting PDF to HTML?
  3. Do you need to save the text formatting (font size etc.)?
  4. Do you need to save the document formatting (margins etc.)?
link

answered 25 Sep '13, 21:00

Anastasia%20Galimova's gravatar image

Anastasia Ga... ♦♦
790112

Hello Anastasia,

We are not interesting in the image part of the PDFs such as backgrounds, logos, footers, separators. The important part for us is the text parts which we can use text-based information extractions.

  1. yes, we don't need to export images. Structural HTML with tables, paragraphs are fine.
  2. main purpose is to prepare PDF data to the information extraction and data mining.
  3. no, we don't need to save font sizes or any other CSS mainly. (they could be good separators to be used in the data extraction, but we can detect with structured/ordered HTML tags too)
  4. same answer as the previous ones.

Thank you.

(26 Sep '13, 11:58) ZaferK

Hello,

is there any progress or any development that you can share on this subject ?

Thank you.

(08 Oct '13, 12:21) ZaferK

Any progress in this issue?

(25 Aug '14, 21:07) ukrainecmk

The analyst said that HTML export format should be added, but it will take some time, so he recommends to use the following workaround:

  1. recognize your file and perform export to pdf using ABBYY Cloud OCR SDK,
  2. convert the pdf with the recognized text to HTML as it is described in this post.
link
This answer is marked "community wiki".

answered 10 Oct '13, 17:30

Anastasia%20Galimova's gravatar image

Anastasia Ga... ♦♦
790112

"convert the pdf with the recognized text to HTML as it is described in this post." But it is saying that it's impossible.

(25 Aug '14, 21:06) ukrainecmk

Hello. Can you please tell me - is PDF to HTML conversion implemented for now? If so - can you point me to documentation, samples or any other info that will help me to make such conversion?

Regards, Alexey.

link

answered 25 Aug '14, 21:01

ukrainecmk's gravatar image

ukrainecmk
111

You can convert PDF TextAndImages to HTML5 by means of PDF to HTML5 Converter.

Is this method appropriate for you?

link

answered 27 Aug '14, 14:36

Julia%20Anikushina's gravatar image

Julia Anikus... ♦♦
3628

edited 02 Sep '14, 16:15

Hello. Thank you for responce. So, exactly Abbyy do not have such service, am I right?

link

answered 30 Aug '14, 12:32

ukrainecmk's gravatar image

ukrainecmk
111

Unfortunately at the moment we don't have such functionality.

Please create a feature request and describe your scenario there. Do you need to save formatting, pictures?

link

answered 02 Sep '14, 16:46

Julia%20Anikushina's gravatar image

Julia Anikus... ♦♦
3628

edited 02 Sep '14, 18:16

Hello. Yes, actually - what I need - to send pdf document to service, and get html code, for each page, just formatted as in pdf, but without: 1. javascript 2. any global selectors, if there will be css styles - they should be applied only for page html, and do not touch any html elements out of page (as I will insert this html code - into my html page, and do not use it as separate page) 3. it should have unique IDs for elements, or no IDs at all 4. I shoud be able after this to insert all pages html - in one final my own html page

link

answered 02 Sep '14, 16:50

ukrainecmk's gravatar image

ukrainecmk
111

I have created a feature request for HTML export. Please vote there. Hope this functionality will be added in the future.

(03 Sep '14, 12:18) Julia Anikus... ♦♦

Thank you, I didn't found - how to vote there, I just placed new comment, hope this will help.

(03 Sep '14, 12:43) ukrainecmk
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×49
×6
×5
×1

Asked: 20 Sep '13, 19:34

Seen: 3,022 times

Last updated: 05 May '15, 10:05

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal