Python-Json-源码浅析-Load - ChrisJaunes

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Load 中涉及的函数

Load 时序图

解析JSON字符串默认流程

Load函数

json.load 用于解码存有JSON数据的文件

源代码位于 github

def load(fp, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    return loads(fp.read(),
        cls=cls, object_hook=object_hook,
        parse_float=parse_float, parse_int=parse_int,
        parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

这个函数读取文件fp，然后将解析fp.read()的任务交给了loads

Loads函数

json.loads 用于解码 JSON 数据。该函数返回 Python 字段的数据类型。

源代码位于 github

Loads 要接受一个需要解析的对象s，这个对象的类型可以是str，也可以是bytes或者bytearray

如果就提供了s一个参数，使用默认的解析器_default_decoder，调用_default_decoder.decode(s)

否则构造解析器：

如果没有提供解析器类，则使用JSONDecoder作为解析器类，否则使用自定义的解析器，自定义解析器要是JSONDecoder的子类。

消耗参数cls，其余参数当作解析器类的构造参数提供，其余参数在py_make_scanner会进行介绍

特别注释: The encoding argument is ignored and deprecated.

def loads(s, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    if isinstance(s, str):
        if s.startswith('\ufeff'):
            raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
                                  s, 0)
    else:
        if not isinstance(s, (bytes, bytearray)):
            raise TypeError(f'the JSON object must be str, bytes or bytearray, '
                            f'not {s.__class__.__name__}')
        s = s.decode(detect_encoding(s), 'surrogatepass')
    if (cls is None and object_hook is None and
            parse_int is None and parse_float is None and
            parse_constant is None and object_pairs_hook is None and not kw):
        return _default_decoder.decode(s)
    if cls is None:
        cls = JSONDecoder
    if object_hook is not None:
        kw['object_hook'] = object_hook
    if object_pairs_hook is not None:
        kw['object_pairs_hook'] = object_pairs_hook
    if parse_float is not None:
        kw['parse_float'] = parse_float
    if parse_int is not None:
        kw['parse_int'] = parse_int
    if parse_constant is not None:
        kw['parse_constant'] = parse_constant
    return cls(**kw).decode(s)

decode函数

decode是JSONDecoder类的成员方法

源代码位于 github

class JSONDecoder(object):
    def decode(self, s, _w=WHITESPACE.match):
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
        end = _w(s, end).end()
        if end != len(s):
            raise JSONDecodeError("Extra data", s, end)
        return obj

decode把解析任务交给了raw_decode方法

raw_decode函数

raw_decode是JSONDecoder类的成员方法

源代码位于 github

class JSONDecoder(object):
    def raw_decode(self, s, idx=0):
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
            raise JSONDecodeError("Expecting value", s, err.value) from None
        return obj, end

raw_decode把解析任务交给了scan_once函数

scan_once是scanner.make_scanner(self)

make_scanner方法

make_scanner是scanner文件中的函数

源代码位于 github

make_scanner = c_make_scanner or py_make_scanner

make_scanner实际指向c_make_scanner 或者 py_make_scanner

c_make_scanner采用的c语言编写，源代码 github

py_make_scanner采用python语言编写，源代码 github

这两俩才是真正进行对JSON字符串解析的，为了分析这两个，需要进一步了解JSON格式文法

JSON 格式文法

JSON 文法

JSONObject -> { ObjectMembers }

ObjectMembers -> Members | ε

Members -> Member A

A -> , Member | ε

Member -> Key : Value

Key -> str

Value -> JSON

JSONArray -> [ ArrayElements ]

ArrayElements -> Elements | ε

Elements -> Element B

B -> , Element | ε

Element -> JSON

编写JSON解析器

_scan_once 函数

这是一个闭包, 其自由变量位于py_make_scanner中

parse_object = context.parse_object = JSONDecoder中的parse_object = JSONObject，
JSONObject位于 github

parse_array = context.parse_array = JSONDecoder中的parse_array = JSONArray，
JSONArray位于 github

parse_string = context.parse_string = JSONDecoder中的parse_string = scanstring，
scanstring位于 github

match_number = NUMBER_RE.match = re.compile(r’(-?(?:0|[1-9]\d*))(.\d+)?([eE][-+]?\d+)?', (re.VERBOSE | re.MULTILINE | re.DOTALL))

strict = context.strict = JSONDecoder中的strict = True or loads中kw存有strict参数

parse_float = context.parse_float = JSONDecoder中的parse_float = float or (parse_float = loads 中的 parse_float)

parse_int = context.parse_int = JSONDecoder中的parse_int = int or (parse_int = loads 中的 parse_int)

parse_constant = context.parse_constant = JSONDecoder中的parse_constant = _CONSTANTS.__getitem__ or (parse_constant = loads 中的 parse_constant)

object_hook = context.object_hook= JSONDecoder中的object_hook = object_hook = loads 中的 object_hook

object_pairs_hook = context.object_pairs_hook= JSONDecoder中的object_pairs_hook = object_pairs_hook = loads 中的 object_pairs_hook

memo = context.memo= JSONDecoder中的memo = {}

采用了递归下降分析法

def _scan_once(string, idx):
    try:
        nextchar = string[idx]
    except IndexError:
        raise StopIteration(idx) from None
    if nextchar == '"':
        return parse_string(string, idx + 1, strict)
    elif nextchar == '{':
        return parse_object((string, idx + 1), strict,
            _scan_once, object_hook, object_pairs_hook, memo)
    elif nextchar == '[':
        return parse_array((string, idx + 1), _scan_once)
    elif nextchar == 'n' and string[idx:idx + 4] == 'null':
        return None, idx + 4
    elif nextchar == 't' and string[idx:idx + 4] == 'true':
        return True, idx + 4
    elif nextchar == 'f' and string[idx:idx + 5] == 'false':
        return False, idx + 5
    m = match_number(string, idx)
    if m is not None:
        integer, frac, exp = m.groups()
        if frac or exp:
            res = parse_float(integer + (frac or '') + (exp or ''))
        else:
            res = parse_int(integer)
        return res, m.end()
    elif nextchar == 'N' and string[idx:idx + 3] == 'NaN':
        return parse_constant('NaN'), idx + 3
    elif nextchar == 'I' and string[idx:idx + 8] == 'Infinity':
        return parse_constant('Infinity'), idx + 8
    elif nextchar == '-' and string[idx:idx + 9] == '-Infinity':
        return parse_constant('-Infinity'), idx + 9
    else:
        raise StopIteration(idx)

parse_string 函数

parse_string函数默认是c_scanstring 或者 py_scanstring 函数

这两个都是用来解析字符串的，c_scanstring采用c语言编写，py_scanstring采用python语言编写。

下面讨论py_scanstring函数，源代码位于: github

py_scanstring使用了正则表达式，其中：

_m = re.compile(r'(.*?)(["\\\x00-\x1f])', re.VERBOSE | re.MULTILINE | re.DOTALL)

代表匹配以" \ \x00-\x1f结尾的最小字符串。groups会返回捕获的分组，这里设置了两个分组, 因此terminator会获得第二个分组的信息

JSONObject 函数

parse_object函数默认是 JSONObject 函数

下面讨论JSONObject函数，源代码位于: github

在JSONObject中跳过空格、制表、回车、换行等空白字符是有必要的，采用正则表达式来完成上述操作。

JSONObject的key 必须是string, 其原因可以参考 stackoverflow

JSONObject -> { ObjectMembers }

ObjectMembers -> Members | ε

处理JSONObject = {} 的情况

Member -> Key : Value

Key -> str

Value -> JSON | null

如果 object_pairs_hook 不为空，利用 object_pairs_hook 对于 pairs 进行处理

object_pairs_hook 接受一个 list 对象，其中list的每一个元素是一个tuple

以下给一个demo:

class Test(object):
    def __init__(self, li):
        self.li = li
    def __str__(self):
        return "[Test object_pairs_hook] : " + self.li.__str__()
def test():
    t = json.loads('{"123": 234}', object_pairs_hook=Test)
    print(t)
    print(type(t))

如果 object_hook 不为空，利用 object_hook 对于dict(pairs) 进行处理

class Test(object):
    def __init__(self, di):
        self.di = di
    def __str__(self):
        return "[Test object_pairs_hook] : " + self.di.__str__()
def test():
    t = json.loads('{"123": 234}', object_hook=Test)
    print(t)
    print(type(t))

JSONArray 函数

parse_array函数默认是 JSONArray 函数

下面讨论 JSONArray 函数，源代码位于: github

JSONArray -> [ ArrayElements ]

ArrayElements -> Elements | ε

Elements -> Element B

B -> , Element | ε

Element -> JSON

先判断是不是空的list, 利用正则表达式跳过空白字符

不断调用scan_once获取 JSON

判断null、true、false

def _scan_once(string, idx):
  if nextchar == 'n' and string[idx:idx + 4] == 'null':
    return None, idx + 4
  elif nextchar == 't' and string[idx:idx + 4] == 'true':
    return True, idx + 4
  elif nextchar == 'f' and string[idx:idx + 5] == 'false':
    return False, idx + 5

利用正则表达式进行匹配

NUMBER_RE = re.compile(
    r'(-?(?:0|[1-9]\d*))(\.\d+)?([eE][-+]?\d+)?',
    (re.VERBOSE | re.MULTILINE | re.DOTALL))

然后利用groups()获取分组，根据分组情况转成int或者float, 如果无法转换，检查是否是NaN、Infinity、-Infinity