Experts teach you to write a simple JSON parser

Writing a JSON parser is one of the easiest ways to become familiar with parsing techniques. The format is very simple. It's defined recursively, so you'll have a slight challenge compared to parsing Brainfuck; you're probably already working with JSON. Apart from the last point, parsing Scheme's S-expressions is probably the simpler task.

Parsing is usually divided into two phases: lexical analysis and syntactic analysis. Lexical analysis decomposes the source input into the simplest decomposable elements in the language called "tokens". Syntactic analysis (often called "parsing") takes a list of tokens and tries to find patterns in it to match the language being parsed.

Parsing cannot determine the semantic feasibility of the input source. The semantic feasibility of the input source might include whether the variable is defined before it is used, whether the function is called with the correct arguments, or whether the variable can be declared a second time in some scope.

Of course, the way people choose to parse and apply semantic rules will always vary, but I'm assuming a "traditional" approach to explaining core concepts.

Interface to JSON library

Ultimately, there should be a from_string method that takes a JSON-encoded string and returns the equivalent Python dictionary.

E.g:

assert_equal(from_string('{"foo": 1}'),

             {"foo": 1})

Lexical analysis

Lexical analysis breaks the input string into tokens. Comments and whitespace are often discarded during lexical analysis, so you can search for grammatical matches during parsing with a simple input.

Assuming a simple lexer, you can replace all characters in the input string and split them into non-recursively defined language constructs like non-integer, string, and boolean literals. In particular, the string must be part of the lexical analysis, since whitespace cannot be discarded without knowing that it is not part of the string.

In a useful lexer, you can keep track of the spaces and comments you skipped, the current line number and file, so that you can refer to it in errors produced by analyzing the source code at any stage. The V8 Javascript engine was recently able to reproduce the exact source code of a function. This requires at least the help of a lexer to be possible.

Implement a JSON lexer

The point of a JSON lexer is to iterate over the input source and try to find patterns in strings, numbers, booleans, nulls, or JSON syntax such as opening and closing parentheses, eventually returning each element as a list.

Here's what the lexer should return for an example input:

assert_equal(lex('{"foo": [1, 2, {"bar": 2}]}'),

             ['{', 'foo', ':', '[', 1, ',', 2, '{', 'bar', ':', 2, '}', ']', '}'])

Here's what this logic might start to look like:

def lex(string):

    tokens = []

    while len(string):

        json_string, string = lex_string(string)

        if json_string is not None:

            tokens.append(json_string)

            continue

        # TODO: lex booleans, nulls, numbers

        if string[0] in JSON_WHITESPACE:

            string = string[1:]

        elif string[0] in JSON_SYNTAX:

            tokens.append(string[0])

            string = string[1:]

        else:

            raise Exception('Unexpected character: {}'.format(string[0]))

    return tokens

The goal here is to try to match strings, numbers, booleans and nulls and add them to the list of tokens. If none of these match, check if the character is whitespace, and if so, discard it. Otherwise, store it as markup if it is part of JSON syntax (like an opening parenthesis). If the character/string does not match any of these patterns, an exception is thrown at the end.

Let's extend the core logic here to support all types and add function stubs.

def lex_string(string):

    return None, string

def lex_number(string):

    return None, string

def lex_bool(string):

    return None, string

def lex_null(string):

    return None, string

def lex(string):

    tokens = []

    while len(string):

        json_string, string = lex_string(string)

        if json_string is not None:

            tokens.append(json_string)

            continue

        json_number, string = lex_number(string)

        if json_number is not None:

            tokens.append(json_number)

            continue

        json_bool, string = lex_bool(string)

        if json_bool is not None:

            tokens.append(json_bool)

            continue

        json_null, string = lex_null(string)

        if json_null is not None:

            tokens.append(json_null)

            continue

        if string[0] in JSON_WHITESPACE:

            string = string[1:]

        elif string[0] in JSON_SYNTAX:

            tokens.append(string[0])

            string = string[1:]

        else:

            raise Exception('Unexpected character: {}'.format(string[0]))

    return tokens

Lexing字符串

对于该lex_string功能,要点是检查第一个字符是否是报价。如果是,则遍历输入字符串,直到找到结尾引号。如果您没有找到初始报价,请返回无和原始列表。如果您发现初始报价和结束报价,请返回报价中的字符串以及未经检查的输入字符串的其余部分。

def lex_string(string):

    json_string = ''

    if string[0] == JSON_QUOTE:

        string = string[1:]

    else:

        return None, string

    for c in string:

        if c == JSON_QUOTE:

            return json_string, string[len(json_string)+1:]

        else:

            json_string += c

    raise Exception('Expected end-of-string quote')

对于这个lex_number函数,要点是遍历输入,直到找到一个不能成为数字的字符。(当然,这是一种粗略的简化,但更准确的将作为练习留给读者。)在找到不能作为数字一部分的字符后,如果您使用的字符不是返回浮点数或int累计数大于0.否则返回None和原始字符串输入。

def lex_number(string):

    json_number = ''

    number_characters = [str(d) for d in range(0, 10)] + ['-', 'e', '.']

    for c in string:

        if c in number_characters:

            json_number += c

        else:

            break

    rest = string[len(json_number):]

    if not len(json_number):

        return None, string

    if '.' in json_number:

        return float(json_number), rest

    return int(json_number), rest

Lexing布尔和空值

查找布尔值和空值是一个非常简单的字符串匹配。

def lex_bool(string):

    string_len = len(string)

    if string_len >= TRUE_LEN and \

       string[:TRUE_LEN] == 'true':

        return True, string[TRUE_LEN:]

    elif string_len >= FALSE_LEN and \

         string[:FALSE_LEN] == 'false':

        return False, string[FALSE_LEN:]

    return None, string

def lex_null(string):

    string_len = len(string)

    if string_len >= NULL_LEN and \

       string[:NULL_LEN] == 'null':

        return True, string[NULL_LEN]

    return None, string

现在,词法分析器代码已经完成!

句法分析

语法分析器(基本)的工作是遍历一维的令牌列表,并根据语言的定义将令牌组匹配到多个语言片断。如果在句法分析过程中的任何时候,解析器都无法将当前的一组令牌与语言的有效语法进行匹配,那么解析器将会失败,并可能为您提供有用的信息,包括您给出的内容,位置以及期望的内容您。

实现JSON解析器

JSON解析器的要点是对调用后接收到的令牌进行迭代,lex并尝试将令牌与对象,列表或普通值进行匹配。

以下是解析器应该为示例输入返回的内容:

tokens = lex('{"foo": [1, 2, {"bar": 2}]}')

assert_equal(tokens,

             ['{', 'foo', ':', '[', 1, ',', 2, '{', 'bar', ':', 2, '}', ']', '}'])

assert_equal(parse(tokens),

             {'foo': [1, 2, {'bar': 2}]})

以下是这种逻辑可能开始看起来像:

def parse_array(tokens):

    return [], tokens

def parse_object(tokens):

    return {}, tokens

def parse(tokens):

    t = tokens[0]

    if t == JSON_LEFTPAREN:

        return parse_array(tokens[1:])

    elif t == JSON_LEFTBRACKET:

        return parse_object(tokens[1:])

    else:

        return t, tokens[1:]

这个词法分析器和解析器之间的一个关键结构区别是词法分析器返回一个一维的标记数组。解析器通常是递归定义的,并返回一个递归的树状对象。由于JSON是数据序列化格式而不是语言,因此解析器应该使用Python生成对象,而不是可以在其上执行更多分析(或编译器情况下的代码生成)的语法树。

而且,词法分析独立于解析器的好处在于,这两个代码都比较简单,只关注特定的元素。

分析数组

解析数组需要解析数组成员,并期望它们之间有一个逗号标记或指示数组结尾的右圆括号。

def parse_array(tokens):

    json_array = []

    t = tokens[0]

    if t == JSON_RIGHTPAREN:

        return json_array, tokens[1:]

    while True:

        json, tokens = parse(tokens)

        json_array.append(json)

        t = tokens[0]

        if t == JSON_RIGHTPAREN:

            return json_array, tokens[1:]

        elif t != JSON_COMMA:

            raise Exception('Expected comma after object in array')

        else:

            tokens = tokens[1:]

    raise Exception('Expected end-of-array round bracket')

解析对象

解析对象是一个解析由冒号内部分隔的键值对,外部用逗号分隔的对象,直到到达对象的末尾。

def parse_object(tokens):

    json_object = {}

    t = tokens[0]

    if t == JSON_RIGHTBRACKET:

        return json_object, tokens[1:]

    while True:

        json_key = tokens[0]

        if type(json_key) is str:

            tokens = tokens[1:]

        else:

            raise Exception('Expected string key, got: {}'.format(json_key))

        if tokens[0] != JSON_COLON:

            raise Exception('Expected colon after key in object, got: {}'.format(t))

        json_value, tokens = parse(tokens[1:])

        json_object[json_key] = json_value

        t = tokens[0]

        if t == JSON_RIGHTBRACKET:

            return json_object, tokens[1:]

        elif t != JSON_COMMA:

            raise Exception('Expected comma after pair in object, got: {}'.format(t))

        tokens = tokens[1:]

    raise Exception('Expected end-of-object bracket')

现在解析器代码已经完成!查看 代码的 pj / parser.py作为一个整体。为了提供理想的界面,创建from_string包装lex和parse功能的函数。

def from_string(string):

    tokens = lex(string)

    return parse(tokens)[0]


编写完毕!文章来源:黑客周刊


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325807646&siteId=291194637