I'm looking for an output format preserving
- text formatting,
- layout (rudimentary), and
which also allows for being processed afterwards without tremendous effort.
As far as I can judge, right now, the options are as follows
- XML - nicely provides processable layout, but omits text formatting and images (if any)
- Alto XML - same here (does not make use of the
FILEIDattribute of type
- docx, xlsx, pptx - proprietary formats hard to process
- txt - does not preserve layout, text formatting and images
- rtf - does not preserve any images
- PDF (
pdfa) - does not provide any layout information
- PDF (
pdfTextAndImages) - preserves layout, text formatting and images, but extracting any information (especially layout) from the resulting PDF is nearly impossible
Unfortunately, all mentioned formats do not satisfy my need for the reasons given.
Am I missing something here? Any help is highly appreciated.