Converting from Image PDF to text with page breaks

  • Last Post 12 September 2017
Viktor Mikho posted this 06 September 2017


We are converting Image PDF to text - and result is quite good but no page breaks (form feed ) characters are there.And we need those as our documents quite large and we want to know which page text is located on.

We have tried a workaround of converting to Searchable PDF and then extracting text using pdfBox or Tikka. In this way we get the pages but all tables are misaligned. This applies to any text in the original file which looks like a table. Internally ABBYY creates a PDF with a table - and when running pdfBox we get columns one by one instead of plain text.

Again .... how do we convert image PDF containing some table-like text into a plain array of text pages?





Nikolay Krivchanskiy posted this 12 September 2017

Hi Viktor,

For your purpose, please try to use documentConversion profile and export files to .docx. This way pages are split is no different from the original document and at the same time the table alignment is preserved.

If it does not help, please attach the example of your document so that we could perform additional testing. You can also send us examples to if you find it more convenient.