Processable output format preserving text formatting, layout, and images

  • Last Post 03 March 2015
Nico posted this 03 May 2013

I'm looking for an output format preserving

  • text formatting,
  • layout (rudimentary), and
  • images

which also allows for being processed afterwards without tremendous effort.

As far as I can judge, right now, the options are as follows

  • XML - nicely provides processable layout, but omits text formatting and images (if any)
  • Alto XML - same here (does not make use of the FILEID attribute of type IllustrationType)
  • docx, xlsx, pptx - proprietary formats hard to process
  • txt - does not preserve layout, text formatting and images
  • rtf - does not preserve any images
  • PDF (pdfSearchable or pdfa) - does not provide any layout information
  • PDF (pdfTextAndImages) - preserves layout, text formatting and images, but extracting any information (especially layout) from the resulting PDF is nearly impossible

Unfortunately, all mentioned formats do not satisfy my need for the reasons given.

Am I missing something here? Any help is highly appreciated.

Thanks, Nico

Order By: Standard | Newest | Votes
Natalia Karaseva posted this 05 September 2014

Hello, Nico,

Have you tried to use the XML export together with the searchable PDF export? You could get layout info from xml, then get images and font info from pdf.

As it is mentioned here "setting multiple export formats does not affect the cost of task processing".

Nico posted this 20 September 2014

That might work -- provided I'd spend a lot of extra effort to apply a post-processing merging both output files. What's more, I guess sometimes this would turn out to be quite shaky. Hence, at least for me, it's not the way to go. What I'm looking for is one output file preserving all requirements mentioned above.

Oksana S. posted this 24 September 2014

Could you please specify why you can’t use the RTF export format? When you use our RTF export format pictures are embedded in the output file.

Nico posted this 09 February 2015

When I was evaluating ABBYY OCR SDK in April 2013, I observed the situation stated in my initial question above: The RTF output file did not preserve any images contained in the submitted input. Does that mean it changed in the meanwhile? Do RTF output files now contain pictures as well? In all cases?

Eugenia Meshcheryakova posted this 03 March 2015

Could you please provide us with the images you're processing? We tested our system's RTF export on the sample documents containing images and they were OK. If it is more convenient, you can contact us at