PDF to JSON

  • 73 Views
  • Last Post 12 February 2020
joshi.pankaj112@gmail.com posted this 31 December 2019

Hi Team,

How can i convert PDF file into JSON format using Python script.

I can able to convert PDF file into XML format using Python script.

Order By: Standard | Newest | Votes
Koen de Leijer posted this 31 December 2019

Hi

Take a look at this post, https://forum.ocrsdk.com/thread/how-to-use-api/
I you're using the Cloud Wrapper you should be able to set the outputFormat to XML.

Json is not an option: https://www.ocrsdk.com/documentation/specifications/export-formats/

How far did you get? Post some of your code

Best regards

Koen de Leijer

JonsonSmith posted this 12 February 2020

Not thus pretty, however this could get the work done, I think. you'd get a dictionary that then gets printed by the json parser in a very nice, pretty format.

import json    

def get_data(page_content):
    _dict = {}
    page_content_list = page_content.splitlines()
    for line in page_content_list:
        if ':' not in line:
            continue
        key, value = line.split(':')
        _dict[key.strip()] = value.strip()
    return _dict

page_data = get_data(page_content)
json_data = json.dumps(page_data, indent=4)
print(json_data)

or, rather than those last three lines, simply do this:

print(json.dumps(get_data(page_content), indent=4))

Close