Xml parse error of processImage results file

  • Last Post 12 August 2015
nwixsom posted this 12 August 2015

I'm running a processImage on a PDF file. Using the following URL http://cloud.ocrsdk.com/processImage?correctOrientation=true&language=English&exportFormat=xml&profile=textExtraction The PDF is rotated so am passing the correctOrientation (which it seems to do just fine). I have AsyncProcessTask processImage outPutFormat = xml The output file snippet image is attached below (sure would be nice if one could attach an XML file!). It is running on an Android device so I'm using the XmlPullParser class. I get the following error from the parser: AsyncProcessTask.parseOCRResults exception = org.xmlpull.v1.XmlPullParserException: Unexpected token (position:TEXT ?@1:2 in java.io.FileReader@4210f5d0) I then loaded the full xml file into XMLPad, choose XML->Validate and get the following errors. What am I missing? Is the namespace incorrect? Its set to xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" Thanks for any help. alt text alt text

Attached Files

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 12 August 2015

Thank you for this information. We've reproduced the issue and now we are consulting with our developers in order to clarify the situation.

Oksana Serdyuk posted this 14 August 2015

I would like to inform you that we have just update the version of XML scheme and now the issue should be solved.

nwixsom posted this 18 August 2015

I just tried to run it again and get the same error. The xml file that results from the processImage contains the tag rotation within the page tag. I looked at your schema (http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml) and it does not contain a tag called rotation which is causing my parser to fail.

<document xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" version="1.0" producer="ABBYY FineReader Engine 11" languages="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml &lt;a href=" http:="" www.abbyy.com="" finereader_xml="" finereader10-schema-v1.xml"="">">http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml"> <page width="7200" height="10800" resolution="300" originalcoords="1" rotation="RotatedCounterclockwise">

</page> </document>

Oksana Serdyuk posted this 19 August 2015

I've just processed your Restrooms1.pdf file using the same recognition settings and gotten a correct XML file. WMHelp XMLPad validates it successfully:

alt text

Would you mind processing the file once again?

Attached Files

nwixsom posted this 19 August 2015

Thanks, I found the problem was in the XmlPullParser. If I pass a FileReader to the CTOR it generates that error (and I have no idea why). If I use a FileInputStream instead I'm able to parse the file. Thanks again for all your help.

Dell Mercant posted this 02 May 2016

More ways to Parse XML