We are converting Image PDF to text - and result is quite good but no page breaks (form feed ) characters are there.And we need those as our documents quite large and we want to know which page text is located on.
We have tried a workaround of converting to Searchable PDF and then extracting text using pdfBox or Tikka. In this way we get the pages but all tables are misaligned. This applies to any text in the original file which looks like a table. Internally ABBYY creates a PDF with a table - and when running pdfBox we get columns one by one instead of plain text.
Again .... how do we convert image PDF containing some table-like text into a plain array of text pages?