How to detect Bold/Italic characters in abbyy ocr cloud service ?

  • 2K Views
  • Last Post 3 days ago
amirkhan posted this 06 November 2015

I am using ocr cloud sdk in my application for ocr. I want to know that how to detect bold/italic characters in xml returned by cloud service. I am using following code to get xml

String url = string.Format("http://cloud.ocrsdk.com/processImage?language={0}&exportFormat={1},{2}", language, exportFormat, "xml");

var request = CreateRequest(url, "POST", Credentials, Proxy);

In this link following answer is posted by a user.

The parameter xcf (--xmlWriteCharFormatting) is neccessary to get the font size.

I want to know if this parameter also works in cloud ocr service or there is some other parameters/way to detect bold/italic characters.

Order By: Standard | Newest | Votes
Oksana Serdyuk posted this 06 November 2015

The XML output format that ABBYY Cloud OCR SDK creates is the same that ABBYY FineReader Engine creates with the default options. It contains information on text and characters, but does not support the character formatting information, such as bold/italic/underlined font styles. Unfortunately, now the XML scheme in Cloud OCR SDK can only be expanded by setting the parameter xml:writeRecognitionVariants to true (it specifies whether the variants of characters recognition should be written to the output file).

Please vote if you want to have the information about text font styles for the feature request: http://forum.ocrsdk.com/questions/3693/feature-request-font-info-in-xml. Hope it will be implemented in future.

  • Liked by
  • amirkhan
Oksana Serdyuk posted this 22 January 2018

Hi,

We are happy to inform you that the requested functionality has been recently implemented in ABBYY Cloud OCR SDK. Now it is possible to get information about the paragraph and character styles in the XML export format. For this please use the xml:writeFormatting parameter of the processImage or processDocument methods and set it to true (by default it is false).

Vishnu posted this 19 September 2018

Hi,

I'm trying to use xml:writeFormatting parameter to get paragraph or line format information. But i came across style attribute in par and formatting tags. What does it represent?

Example :-  

 <par align="Justified" style="{FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF}">   

   <line>..

       <formatting lang="EnglishUnitedStates" ff="Arial" fs="10." underline="1" style="{99A6515E-DE65-4325-9F9D-99D8674C0010}">

   ..

..

   </line>

</par>

 

 

Thanks,

Vishnu

Vishnu posted this 19 September 2018

I could see some italic attribute detected but i can't able to find any bold attribute in formatting/charParams tag from the xml response, though the uploaded documents(clear ones) contain bold words.

Helen Osetrova posted this 4 weeks ago

Hi Vishnu,

 

The information about the font attributes could be found under the fontStyle tag of the output XML document. If there is no bold attributes in the output XML, it means that text of the source document has not been treated as bold.

 

For more specific recommendations, could you post here the source document?

 

Vishnu posted this 4 weeks ago

Hi Vishnu,

The information about the font attributes could be found under the fontStyle tag of the output XML document. If there is no bold attributes in the output XML, it means that text of the source document has not been treated as bold.

For more specific recommendations, could you post here the source document?

 Hi helen,

Here is a sample image in which only italic is detected.

 

But i could able to retrieve bold words when i try as suggested by oksana here

Is it possible that i could able to get formatting information(bold, etc) using "text detection" profile itself?

Thanks,

Vishnu

Helen Osetrova posted this 4 weeks ago

Hi Vishnu,

 

It is possible to get the information about formatting attributes using the textExtraction  profile. For your document, we can also suggest using the imageSource=scanner option, so the request to the server will look as follows:

string url = "http://cloud.ocrsdk.com/processImage?profile=textExtraction&imageSource=scanner&
              exportFormat=xml&xml:writeFormatting=true";

 

Please find attached the XML file obtained using these settings.

 

Attached Files

Vishnu posted this 3 weeks ago

Hi Helen,

   In the documentation it says "auto" mode is capable of detecting imageSource of the document automatically. But it sometimes treats scanned image as a photo/captured image.

Thanks,

vishnu

Helen Osetrova posted this 3 weeks ago

Hi Vishnu,

 

 

The thing is that Cloud OCR SDK is designed with the assumption that most of users uploads photographed documents. For this reason, with the imageSource=auto parameter Cloud OCR SDK sometimes treats the scanned documents as photos. To avoid such behavior kindly apply the imageSource=scanner setting.

 

 

Hope this information will be helpful!

Vishnu posted this 2 weeks ago

Hi Helen,

I see line spacing option in some paragraph tags like


<par lineSpacing="3600" style="{FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF}">

What does 3600 represents? Also I would like to know how it can be useful and it is not available for every par tags.

Thanks,

Vishnu

Helen Osetrova posted this 3 days ago

Hi Vishnu,

 

The lineSpacing attribute represents the space between two lines in the paragraph. Kindly learn the description of main XML tags used in Cloud OCR SDK from the Output XML document article. 

 

Close