Hi,

We just noticed that when exporting to a UTF-8 text file, Fine Reader Engine adds a BOM (Byte Order Mark) character at the beginning of the file.

page.Export(tempTxtFile.getAbsolutePath(), FileExportFormatEnum.FEF_TextUnicodeDefaults, exportParams);

This BOM character (EF BB BF) indicates the Unicode representation of the text.

But when using UTF-8 it is optionnal and not recommended (ref. Unicode Standard 5.0) . Especially for Java which assumes that UTF8 files don't have a BOM. When reading the file, BOM character will be interpreted as ? in Java which is really annoying.

More infos here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

Currently we have a workaround ( http://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker) but it would be nice to condiser removing it in the future or make it optional ;)

asked 20 Apr '15, 11:41

maol's gravatar image

maol
2911


Sorry for the delay with response.

We have passed your suggestion to our analysts and created reclamation to make BOM character optional. Unfortunately, so far we do nоt have information when this feature will be available and we hope that will be implementing in the future versions.

link

answered 01 Jul '15, 12:44

Julia%20Anikushina's gravatar image

Julia Anikus... ♦♦
3628

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×39
×28
×21
×1
×1

Asked: 20 Apr '15, 11:41

Seen: 1,140 times

Last updated: 01 Jul '15, 12:44

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal