I need to perform some text extraction. After selecting the predefined recognition language, I would like to exclude some characters.
Example: I would like to exclude the & ( ) | characters.
How can I do that ?
After reading the documentation I'm not sure if I need to create a dictionnary or modify the letter set of the language. The documentation is not very clear on this subject.
asked 27 Mar '15, 17:45
Yes, you could specify the letter set. Please see the article about how to do it with the help of regular expressions : http://knowledgebase.ocrsdk.com/article/1188.
Also, you can iterate layout and remove all unwanted characters by means of the Remove method of the Paragraph Object. The similar example (but just for replacing curly quotes) could be found here: http://knowledgebase.ocrsdk.com/article/1468. This way looks more simple, however, if the documents have a large amount of characters iterating may take more time.
Hope it helps!
answered 30 Mar '15, 09:31