I'd like to read multiple JSON objects from a file/stream in Python, one at a time. Unfortunately json.load()
just .read()
s until end-of-file; there doesn't seem to be any way to use it to read a single object or to lazily iterate over the objects.
Is there any way to do this? Using the standard library would be ideal, but if there's a third-party library I'd use that instead.
At the moment I'm putting each object on a separate line and using json.loads(f.readline())
, but I would really prefer not to need to do this.
Example Use
example.py
import my_json as json import sys for o in json.iterload(sys.stdin): print("Working on a", type(o))
in.txt
{"foo": ["bar", "baz"]} 1 2 [] 4 5 6
example session
$ python3.2 example.py < in.txt Working on a dict Working on a int Working on a int Working on a list Working on a int Working on a int Working on a int
11 Answers
Answers 1
Here's a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.
def stream_read_json(fn): import json start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except json.JSONDecodeError as e: f.seek(start_pos) json_str = f.read(e.pos) obj = json.loads(json_str) start_pos += e.pos yield obj
Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.
def stream_read_json(fn): import json import re start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except ValueError as e: f.seek(start_pos) end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)', e.args[0]).groups()[0]) json_str = f.read(end_pos) obj = json.loads(json_str) start_pos += end_pos yield obj
Answers 2
JSON generally isn't very good for this sort of incremental use; there's no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.
The object per line solution that you're using is seen elsewhere too. Scrapy calls it 'JSON lines':
- http://doc.scrapy.org/topics/exporters.html#jsonlinesitemexporter
- http://www.enricozini.org/2011/tips/python-stream-json/
You can do it slightly more Pythonically:
for jsonline in f: yield json.loads(jsonline) # or do the processing in this loop
I think this is about the best way - it doesn't rely on any third party libraries, and it's easy to understand what's going on. I've used it in some of my own code as well.
Answers 3
Sure you can do this. You just have to take to raw_decode
directly. This implementation loads the whole file into memory and operates on that string (much as json.load
does); if you have large files you can modify it to only read from the file as necessary without much difficulty.
import json from json.decoder import WHITESPACE def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs): if isinstance(string_or_fp, file): string = string_or_fp.read() else: string = str(string_or_fp) decoder = cls(**kwargs) idx = WHITESPACE.match(string, 0).end() while idx < len(string): obj, end = decoder.raw_decode(string, idx) yield obj idx = WHITESPACE.match(string, end).end()
Usage: just as you requested, it's a generator.
Answers 4
This is a pretty nasty problem actually because you have to stream in lines, but pattern match across multiple lines against braces, but also pattern match json. It's a sort of json-preparse followed by a json parse. Json is, in comparison to other formats, easy to parse so it's not always necessary to go for a parsing library, nevertheless, how to should we solve these conflicting issues?
Generators to the rescue!
The beauty of generators for a problem like this is you can stack them on top of each other gradually abstracting away the difficulty of the problem whilst maintaining laziness. I also considered using the mechanism for passing back values into a generator (send()) but fortunately found I didn't need to use that.
To solve the first of the problems you need some sort of streamingfinditer, as a streaming version of re.finditer. My attempt at this below pulls in lines as needed (uncomment the debug statement to see) whilst still returning matches. I actually then modified it slightly to yield non-matched lines as well as matches (marked as 0 or 1 in the first part of the yielded tuple).
import re def streamingfinditer(pat,stream): for s in stream: # print "Read next line: " + s while 1: m = re.search(pat,s) if not m: yield (0,s) break yield (1,m.group()) s = re.split(pat,s,1)[1]
With that, it's then possible to match up until braces, account each time for whether the braces are balanced, and then return either simple or compound objects as appropriate.
braces='{}[]' whitespaceesc=' \t' bracesesc='\\'+'\\'.join(braces) balancemap=dict(zip(braces,[1,-1,1,-1])) bracespat='['+bracesesc+']' nobracespat='[^'+bracesesc+']*' untilbracespat=nobracespat+bracespat def simpleorcompoundobjects(stream): obj = "" unbalanced = 0 for (c,m) in streamingfinditer(re.compile(untilbracespat),stream): if (c == 0): # remainder of line returned, nothing interesting if (unbalanced == 0): yield (0,m) else: obj += m if (c == 1): # match returned if (unbalanced == 0): yield (0,m[:-1]) obj += m[-1] else: obj += m unbalanced += balancemap[m[-1]] if (unbalanced == 0): yield (1,obj) obj=""
This returns tuples as follows:
(0,"String of simple non-braced objects easy to parse") (1,"{ 'Compound' : 'objects' }")
Basically that's the nasty part done. We now just have to do the final level of parsing as we see fit. For example we can use Jeremy Roman's iterload function (Thanks!) to do parsing for a single line:
def streamingiterload(stream): for c,o in simpleorcompoundobjects(stream): for x in iterload(o): yield x
Test it:
of = open("test.json","w") of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 { } 2 9 78 4 5 { "animals" : [ "dog" , "lots of mice" , "cat" ] } """) of.close() // open & stream the json f = open("test.json","r") for o in streamingiterload(f.readlines()): print o f.close()
I get these results (and if you turn on that debug line, you'll see it pulls in the lines as needed):
[u'hello'] {u'goodbye': 1} 1 2 {} 2 9 78 4 5 {u'animals': [u'dog', u'lots of mice', u'cat']}
This won't work for all situations. Due to the implementation of the json
library, it is impossible to work entirely correctly without reimplementing the parser yourself.
Answers 5
A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.
After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream
module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.
Example:
from splitstream import splitfile for jsonstr in splitfile(sys.stdin, format="json")): yield json.loads(jsonstr)
Answers 6
I'd like to provide a solution. The key thought is to "try" to decode: if it fails, give it more feed, otherwise use the offset information to prepare next decoding.
However the current json module can't tolerate SPACE in head of string to be decoded, so I have to strip them off.
import sys import json def iterload(file): buffer = "" dec = json.JSONDecoder() for line in file: buffer = buffer.strip(" \n\r\t") + line.strip(" \n\r\t") while(True): try: r = dec.raw_decode(buffer) except: break yield r[0] buffer = buffer[r[1]:].strip(" \n\r\t") for o in iterload(sys.stdin): print("Working on a", type(o), o)
========================= I have tested for several txt files, and it works fine. (in1.txt)
{"foo": ["bar", "baz"] } 1 2 [ ] 4 {"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}] } 5 6
(in2.txt)
{"foo" : ["bar", "baz"] } 1 2 [ ] 4 5 6
(in.txt, your initial)
{"foo": ["bar", "baz"]} 1 2 [] 4 5 6
(output for Benedict's testcase)
python test.py < in.txt ('Working on a', <type 'list'>, [u'hello']) ('Working on a', <type 'dict'>, {u'goodbye': 1}) ('Working on a', <type 'int'>, 1) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'dict'>, {}) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'int'>, 9) ('Working on a', <type 'int'>, 78) ('Working on a', <type 'int'>, 4) ('Working on a', <type 'int'>, 5) ('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})
Answers 7
I used @wuilang's elegant solution. The simple approach -- read a byte, try to decode, read a byte, try to decode, ... -- worked, but unfortunately it was very slow.
In my case, I was trying to read "pretty-printed" JSON objects of the same object type from a file. This allowed me to optimize the approach; I could read the file line-by-line, only decoding when I found a line that contained exactly "}":
def iterload(stream): buf = "" dec = json.JSONDecoder() for line in stream: line = line.rstrip() buf = buf + line if line == "}": yield dec.raw_decode(buf) buf = ""
If you happen to be working with one-per-line compact JSON that escapes newlines in string literals, then you can safely simplify this approach even more:
def iterload(stream): dec = json.JSONDecoder() for line in stream: yield dec.raw_decode(line)
Obviously, these simple approaches only work for very specific kinds of JSON. However, if these assumptions hold, these solutions work correctly and quickly.
Answers 8
I believe a better way of doing it would be to use a state machine. Below is a sample code that I worked out by converting a NodeJS code on below link to Python 3 (used nonlocal keyword only available in Python 3, code won't work on Python 2)
Edit-1: Updated and made code compatible with Python 2
Edit-2: Updated and added a Python3 only version as well
https://gist.github.com/creationix/5992451
Python 3 only version
# A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): i = 0 length = len(bytes_data) def _constant(byte_data): nonlocal i if byte_data != bytes_data[i]: i += 1 raise Exception("Unexpected 0x" + str(byte_data)) i += 1 if i < length: return _constant return emit(value) return _constant def string_machine(emit): string = "" def _string(byte_data): nonlocal string if byte_data == 0x22: # " return emit(string) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + str(byte_data)) string += chr(byte_data) return _string def _escaped_string(byte_data): nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / string += chr(byte_data) return _string if byte_data == 0x62: # b string += "\b" return _string if byte_data == 0x66: # f string += "\f" return _string if byte_data == 0x6e: # n string += "\n" return _string if byte_data == 0x72: # r string += "\r" return _string if byte_data == 0x74: # t string += "\t" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): nonlocal string string += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): left = 0 num = 0 def _utf8(byte_data): nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) left = left - 1 num |= (byte_data & 0x3f) << (left * 6) if left: return _utf8 return emit(num) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character left = 1 num = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character left = 2 num = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character left = 3 num = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): left = 4 num = 0 def _hex(byte_data): nonlocal num, left if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") left -= 1 num |= i << (left * 4) if left: return _hex return emit(num) return _hex def number_machine(byte_data, emit): sign = 1 number = 0 decimal = 0 esign = 1 exponent = 0 def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): nonlocal number if 0x30 <= byte_data < 0x40: number = number * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + str(byte_data)) if byte_data == 0x2d: # - sign = -1 return _start def _decimal(byte_data): nonlocal decimal if 0x30 <= byte_data < 0x40: decimal = (decimal + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - esign = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): nonlocal exponent if 0x30 <= byte_data < 0x40: exponent = exponent * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = sign * (number + decimal) if exponent: value *= math.pow(10, esign * exponent) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): array_data = [] def _array(byte_data): if byte_data == 0x5d: # ] return emit(array_data) return json_machine(on_value, _comma)(byte_data) def on_value(value): array_data.append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(array_data) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): object_data = {} key = None def _object(byte_data): if byte_data == 0x7d: # return emit(object_data) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_key(result): nonlocal key key = result return _colon def _colon(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): object_data[key] = value def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(object_data) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object
Python 2 compatible version
# A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): local_data = {"i": 0, "length": len(bytes_data)} def _constant(byte_data): # nonlocal i, length if byte_data != bytes_data[local_data["i"]]: local_data["i"] += 1 raise Exception("Unexpected 0x" + byte_data.toString(16)) local_data["i"] += 1 if local_data["i"] < local_data["length"]: return _constant return emit(value) return _constant def string_machine(emit): local_data = {"string": ""} def _string(byte_data): # nonlocal string if byte_data == 0x22: # " return emit(local_data["string"]) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + byte_data.toString(16)) local_data["string"] += chr(byte_data) return _string def _escaped_string(byte_data): # nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / local_data["string"] += chr(byte_data) return _string if byte_data == 0x62: # b local_data["string"] += "\b" return _string if byte_data == 0x66: # f local_data["string"] += "\f" return _string if byte_data == 0x6e: # n local_data["string"] += "\n" return _string if byte_data == 0x72: # r local_data["string"] += "\r" return _string if byte_data == 0x74: # t local_data["string"] += "\t" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): # nonlocal string local_data["string"] += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): local_data = {"left": 0, "num": 0} def _utf8(byte_data): # nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) local_data["left"] -= 1 local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6) if local_data["left"]: return _utf8 return emit(local_data["num"]) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character local_data["left"] = 1 local_data["num"] = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character local_data["left"] = 2 local_data["num"] = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character local_data["left"] = 3 local_data["num"] = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): local_data = {"left": 4, "num": 0} def _hex(byte_data): # nonlocal num, left i = 0 # Parse the hex byte if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") local_data["left"] -= 1 local_data["num"] |= i << (local_data["left"] * 4) if local_data["left"]: return _hex return emit(local_data["num"]) return _hex def number_machine(byte_data, emit): local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0} def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): # nonlocal number if 0x30 <= byte_data < 0x40: local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + byte_data.toString(16)) if byte_data == 0x2d: # - local_data["sign"] = -1 return _start def _decimal(byte_data): # nonlocal decimal if 0x30 <= byte_data < 0x40: local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): # nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - local_data["esign"] = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): # nonlocal exponent if 0x30 <= byte_data < 0x40: local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = local_data["sign"] * (local_data["number"] + local_data["decimal"]) if local_data["exponent"]: value *= math.pow(10, local_data["esign"] * local_data["exponent"]) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): local_data = {"array_data": []} def _array(byte_data): if byte_data == 0x5d: # ] return emit(local_data["array_data"]) return json_machine(on_value, _comma)(byte_data) def on_value(value): # nonlocal array_data local_data["array_data"].append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(local_data["array_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): local_data = {"object_data": {}, "key": ""} def _object(byte_data): # nonlocal object_data, key if byte_data == 0x7d: # return emit(local_data["object_data"]) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + byte_data.toString(16)) def on_key(result): # nonlocal object_data, key local_data["key"] = result return _colon def _colon(byte_data): # nonlocal object_data, key if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): # nonlocal object_data, key local_data["object_data"][local_data["key"]] = value def _comma(byte_data): # nonlocal object_data if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(local_data["object_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object
Testing it
if __name__ == "__main__": test_json = """[1,2,"3"] {"name": "tarun"} 1 2 3 [{"name":"a", "data": [1, null,2]}] """ def found_json(data): print(data) state = json_machine(found_json) for char in test_json: state = state(ord(char))
The output of the same is
[1, 2, '3'] {'name': 'tarun'} 1 2 3 [{'name': 'a', 'data': [1, None, 2]}]
Answers 9
Here's mine:
import simplejson as json from simplejson import JSONDecodeError class StreamJsonListLoader(): """ When you have a big JSON file containint a list, such as [{ ... }, { ... }, { ... }, ... ] And it's too big to be practically loaded into memory and parsed by json.load, This class comes to the rescue. It lets you lazy-load the large json list. """ def __init__(self, filename_or_stream): if type(filename_or_stream) == str: self.stream = open(filename_or_stream) else: self.stream = filename_or_stream if not self.stream.read(1) == '[': raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.') def __iter__(self): return self def next(self): read_buffer = self.stream.read(1) while True: try: json_obj = json.loads(read_buffer) if not self.stream.read(1) in [',',']']: raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).') return json_obj except JSONDecodeError: next_char = self.stream.read(1) read_buffer += next_char while next_char != '}': next_char = self.stream.read(1) if next_char == '': raise StopIteration read_buffer += next_char
Answers 10
If you use a json.JSONDecoder instance you can use raw_decode
member function. It returns a tuple of python representation of the JSON value and an index to where the parsing stopped. This makes it easy to slice (or seek in a stream object) the remaining JSON values. I'm not so happy about the extra while loop to skip over the white space between the different JSON values in the input but it gets the job done in my opinion.
import json def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() try: nread = 0 while nread < len(vals_str): val, n = decoder.raw_decode(vals_str[nread:]) nread += n # Skip over whitespace because of bug, below. while nread < len(vals_str) and vals_str[nread].isspace(): nread += 1 yield val except json.JSONDecodeError as e: pass return
The next version is much shorter and eats the part of the string that is already parsed. It seems that for some reason a second call json.JSONDecoder.raw_decode() seems to fail when the first character in the string is a whitespace, that is also the reason why I skip over the whitespace in the whileloop above ...
def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() while vals_str: val, n = decoder.raw_decode(vals_str) #remove the read characters from the start. vals_str = vals_str[n:] # remove leading white space because a second call to decoder.raw_decode() # fails when the string starts with whitespace, and # I don't understand why... vals_str = vals_str.lstrip() yield val return
In the documentation about the json.JSONDecoder class the method raw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders contains the following:
This can be used to decode a JSON document from a string that may have extraneous data at the end.
And this extraneous data can easily be another JSON value. In other words the method might be written with this purpose in mind.
With the input.txt using the upper function I obtain the example output as presented in the original question.
Answers 11
For production code, it is better (really faster !!!) to use ujson:
import ujson as json
Then do your job as usual !
0 comments:
Post a Comment