I need to perform some text extraction. After selecting the predefined recognition language, I would like to exclude some characters.

Example: I would like to exclude the & ( ) | characters.

How can I do that ?

After reading the documentation I'm not sure if I need to create a dictionnary or modify the letter set of the language. The documentation is not very clear on this subject.

asked 27 Mar '15, 17:45

maol's gravatar image



Yes, you could specify the letter set. Please see the article about how to do it with the help of regular expressions : http://knowledgebase.ocrsdk.com/article/1188.

Also, you can iterate layout and remove all unwanted characters by means of the Remove method of the Paragraph Object. The similar example (but just for replacing curly quotes) could be found here: http://knowledgebase.ocrsdk.com/article/1468. This way looks more simple, however, if the documents have a large amount of characters iterating may take more time.

Hope it helps!


answered 30 Mar '15, 09:31

Natalia%20Karaseva's gravatar image

Natalia Kara...

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: 27 Mar '15, 17:45

Seen: 853 times

Last updated: 30 Mar '15, 09:31

© 2016 ABBYY. All rights Reserved. www.ABBYY.com | Privacy Policy | Legal