Extract English and Thai text

  • Last Post 11 October 2017
Deepak Hariharan posted this 10 October 2017

I want to convert pdf and images to text file and extract data. All of my documents will contain both English and Thai language. I have tried various options to extract text.

Option 1 : --lang=English,Thai --profile=textExtraction

Option 2 : --lang=Thai --profile=textExtraction

Option 3 : --lang=English,Thai --profile=documentConversion

Option 4 : --lang=Thai --profile=documentConversion

There was a lot of mismatches between the input data and the output text. Option 4 gives the most accurate conversion. But the English text will be lost in this case. Is there any way were I can upload a single file and receive two output files. One for English and one for Thai. Otherwise I will have to upload the file twice.

Oksana S. posted this 11 October 2017

If you need to extract at first English text from your image and then Thai, you can call the task twice with different settings. In this case, re-recognition will be performed for free.

Anyway, it seems that the issue may be connected with the source image quality. Please check if your source image has appropriate quality for OCR and review our Best Practices section where you can find our tips how to scan or photograph the documents to achieve the best recognition results.
If the structure of your documents is not very important, it is better to use the textExtraction profile and export the result to the TXT or XML export formats (if you need to perform further processing on your side).

And to get our additional recommendations, please send the images for which the issue can be reproduced to CloudOCRSDK@abbyy.com.