I'm looking for an output format preserving

  • text formatting,
  • layout (rudimentary), and
  • images

which also allows for being processed afterwards without tremendous effort.

As far as I can judge, right now, the options are as follows

  • XML - nicely provides processable layout, but omits text formatting and images (if any)
  • Alto XML - same here (does not make use of the FILEID attribute of type IllustrationType)
  • docx, xlsx, pptx - proprietary formats hard to process
  • txt - does not preserve layout, text formatting and images
  • rtf - does not preserve any images
  • PDF (pdfSearchable or pdfa) - does not provide any layout information
  • PDF (pdfTextAndImages) - preserves layout, text formatting and images, but extracting any information (especially layout) from the resulting PDF is nearly impossible

Unfortunately, all mentioned formats do not satisfy my need for the reasons given.

Am I missing something here? Any help is highly appreciated.

Thanks, Nico

asked 03 May '13, 19:45

Nico's gravatar image


edited 10 Mar, 12:55

Oksana%20Serdyuk's gravatar image

Oksana Serdyuk ♦♦

Hello, Nico,

Have you tried to use the XML export together with the searchable PDF export? You could get layout info from xml, then get images and font info from pdf.

As it is mentioned here "setting multiple export formats does not affect the cost of task processing".


answered 05 Sep '14, 11:44

Natalia%20Karaseva's gravatar image

Natalia Kara...

That might work -- provided I'd spend a lot of extra effort to apply a post-processing merging both output files. What's more, I guess sometimes this would turn out to be quite shaky. Hence, at least for me, it's not the way to go. What I'm looking for is one output file preserving all requirements mentioned above.

(20 Sep '14, 14:50) Nico

Could you please specify why you can’t use the RTF export format? When you use our RTF export format pictures are embedded in the output file.

(24 Sep '14, 16:25) Oksana Serdyuk ♦♦

When I was evaluating ABBYY OCR SDK in April 2013, I observed the situation stated in my initial question above: The RTF output file did not preserve any images contained in the submitted input. Does that mean it changed in the meanwhile? Do RTF output files now contain pictures as well? In all cases?

(09 Feb '15, 14:18) Nico

Could you please provide us with the images you're processing? We tested our system's RTF export on the sample documents containing images and they were OK. If it is more convenient, you can contact us at cloudocrsdk@abbyy.com.

(03 Mar '15, 11:57) Eugenia Mesh... ♦♦
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 03 May '13, 19:45

Seen: 4,802 times

Last updated: 10 Mar, 12:55

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal