Taught you how to achieve a JSON parser!

1. Background

JSON (JavaScript Object Notation) is a lightweight data interchange format. With respect to another data exchange format XML, JSON has many advantages. For example, better legibility, take up less space and so on.

In the field of web application development, thanks to good support JavaScript for JSON provided, JSON than XML is more popular with developers of all ages. So as a developer, if you are interested, then we should look further into JSON relevant knowledge.

In line with the purpose of exploring JSON principle, I will detail to introduce a simple analytical process JSON parser and implementation details in this article.

Since JSON itself is relatively simple, it is not complicated to resolve. So if you are interested, after reading this article, wish yourself to achieve a JSON parser. Well, if the other is not to say, let's venue to the key chapter of it.

2. JSON parser implementation principle

Is based on JSON JSON parser grammar rules created essentially a state machine, a JSON string is input, the output is a JSON object. In general, the parsing process including lexical analysis and parsing two stages.

Lexical analysis stage is the target according to the compounding rules JSON string parsed into Token stream, such as JSON string has the following:

{
    "name" : "小明",
    "age": 18
}

Results lexical analysis, to obtain a set of the Token, as follows:

{、 name、 :、 小明、 ,、 age、 :、 18、 }

Lexical analysis parse out the Token sequence, the next to be parsed. Parsing object according to the above structure of JSON JSON grammar check sequence composed Token is legitimate.

Requirements such as JSON grammar non-empty JSON object appears in the form of key-value pairs, such as shape  object = {string : value}. If you pass a malformed string, such as:

{
    "name", "小明"
}

Then the syntax analysis phase, the parser After analysis Token name, that it is in line with the rules of a Token, and that it is a key. Do not use this JSON package in the JDK 7+! This look.

Next, the parser reads the next Token, expect this Token Shi  :. But when it reads this Token, I found this Token Shi ,, its not expected :, so the grammar parser will report an error.

Here briefly summarize the above two processes, lexical analysis is to parse the string into a set of Token sequence, and parsing JSON format is to check the input of Token sequence consisting of legality. Here we have a fine impression of the resolution process JSON, and then I will analyze in detail each process.

2.1 lexical analysis

In the beginning of this chapter, I said lexical analytical purposes, that in accordance with the "rules of word formation" parse a JSON string into a Token stream. Note that the double quotation marks word - word formation rules, the so-called compounding rules are the rules of lexical analysis module to parse the string Token referenced.

In JSON, word formation rule corresponds to several types of data, when the lexical parser reads a word, and the word type matches the data type specified JSON, think the word lexical analyzer in line with the rules of word formation, will to form the corresponding Token.

Here we can refer to the http://www.json.org/definition of JSON, and a list of what data type specified in JSON:

  • BEGIN_OBJECT ({),

  • END_OBJECT(})

  • BEGIN_ARRAY([)

  • END_ARRAY(])

  • NULL (zero)

  • NUMBER (Digital)

  • STRING (string)

  • BOOLEAN(true/false)

  • SEP_COLON(:)

  • SEP_COMMA (,)

When the read word lexical analyzer is a time of the above types, it can be parsed into a Token. We can define a class to represent enumerated data types above, as follows:

public enum TokenType {
    BEGIN_OBJECT(1),
    END_OBJECT(2),
    BEGIN_ARRAY(4),
    END_ARRAY(8),
    NULL(16),
    NUMBER(32),
    STRING(64),
    BOOLEAN(128),
    SEP_COLON(256),
    SEP_COMMA(512),
    END_DOCUMENT(1024);

    TokenType(int code) {
        this.code = code;
    }

    private int code;

    public int getTokenCode() {
        return code;
    }
}

During parsing, only TokenType type is not enough. We want to be saved in addition to type a word, but also save the literal word. Therefore, so there is also a need to define Token class. For packaging and literal word type, as follows:

public class Token {
    private TokenType tokenType;
    private String value;
    // 省略不重要的代码
}

Token good definition class, then look again read a string of class definitions. as follows:

public CharReader(Reader reader) {
        this.reader = reader;
        buffer = new char[BUFFER_SIZE];
    }

    /**
     * 返回 pos 下标处的字符,并返回
     * @return 
     * @throws IOException
     */
    public char peek() throws IOException {
        if (pos - 1 >= size) {
            return (char) -1;
        }

        return buffer[Math.max(0, pos - 1)];
    }

    /**
     * 返回 pos 下标处的字符,并将 pos + 1,最后返回字符
     * @return 
     * @throws IOException
     */
    public char next() throws IOException {
        if (!hasMore()) {
            return (char) -1;
        }

        return buffer[pos++];
    }

    public void back() {
        pos = Math.max(0, --pos);
    }

    public boolean hasMore() throws IOException {
        if (pos < size) {
            return true;
        }

        fillBuffer();
        return pos < size;
    }

    void fillBuffer() throws IOException {
        int n = reader.read(buffer);
        if (n == -1) {
            return;
        }

        pos = 0;
        size = n;
    }
}

With TokenType, Token and CharReader three auxiliary class, then we can achieve a lexical parser.

public class Tokenizer {
    private CharReader charReader;
    private TokenList tokens;

    public TokenList tokenize(CharReader charReader) throws IOException {
        this.charReader = charReader;
        tokens = new TokenList();
        tokenize();

        return tokens;
    }

    private void tokenize() throws IOException {
        // 使用do-while处理空文件
        Token token;
        do {
            token = start();
            tokens.add(token);
        } while (token.getTokenType() != TokenType.END_DOCUMENT);
    }

    private Token start() throws IOException {
        char ch;
        for(;;) {
            if (!charReader.hasMore()) {
                return new Token(TokenType.END_DOCUMENT, null);
            }

            ch = charReader.next();
            if (!isWhiteSpace(ch)) {
                break;
            }
        }

        switch (ch) {
            case '{':
                return new Token(TokenType.BEGIN_OBJECT, String.valueOf(ch));
            case '}':
                return new Token(TokenType.END_OBJECT, String.valueOf(ch));
            case '[':
                return new Token(TokenType.BEGIN_ARRAY, String.valueOf(ch));
            case ']':
                return new Token(TokenType.END_ARRAY, String.valueOf(ch));
            case ',':
                return new Token(TokenType.SEP_COMMA, String.valueOf(ch));
            case ':':
                return new Token(TokenType.SEP_COLON, String.valueOf(ch));
            case 'n':
                return readNull();
            case 't':
            case 'f':
                return readBoolean();
            case '"':
                return readString();
            case '-':
                return readNumber();
        }

        if (isDigit(ch)) {
            return readNumber();
        }

        throw new JsonParseException("Illegal character");
    }

    private Token readNull() {...}
    private Token readBoolean() {...}
    private Token readString() {...}
    private Token readNumber() {...}
}

The above code is implemented lexical analyzer, part of the code is not posted out here, particularly when the paste later analysis.

Let's look at the lexical analyzer core methods start, this method small amount of code, is not complicated. Stop character which is read by an endless loop, and then depending on the type of characters to perform different parsing logic.

Mentioned above, JSON parsing process is relatively simple. The reason is that, when resolved simply by the first character of each word to determine the word Token Type. such as:

  • The first character is {, }, [, ], ,, :, is directly encapsulated into a respective Token returned to

  • The first character is nexpected this word null, Token type isNULL

  • The first character is tor fexpected the word trueor false, Token type isBOOLEAN

  • The first character is the "desired word is a string, Token typeString

  • The first character is 0~9or -desired word is a number, typeNUMBER

As mentioned above, according to the lexical analyzer only the first character of each word, you can know the next it expects to read the content of what. If you meet the expectations, and then return Token, otherwise it returns an error.

Here's a look at the lexical parser encountered the first character is n and the process "when the character n look at the process of encounter:

private Token readNull() throws IOException {
    if (!(charReader.next() == 'u' && charReader.next() == 'l' && charReader.next() == 'l')) {
        throw new JsonParseException("Invalid json string");
    }

    return new Token(TokenType.NULL, "null");
}

The above code is very simple, lexical analyzer after reading the character n, a desired three characters are behind u, l, l, n and forming the word null. If you meet the expectations, the type of Token NULL is returned, otherwise reported abnormal. readNull method logic is simple, not much to say.

Then take a look at the string type of data processing:

private Token readString() throws IOException {
    StringBuilder sb = new StringBuilder();
    for (;;) {
        char ch = charReader.next();
        // 处理转义字符
        if (ch == '\\') {
            if (!isEscape()) {
                throw new JsonParseException("Invalid escape character");
            }
            sb.append('\\');
            ch = charReader.peek();
            sb.append(ch);
            // 处理 Unicode 编码,形如 \u4e2d。且只支持 \u0000 ~ \uFFFF 范围内的编码
            if (ch == 'u') {
                for (int i = 0; i < 4; i++) {
                    ch = charReader.next();
                    if (isHex(ch)) {
                        sb.append(ch);
                    } else {
                        throw new JsonParseException("Invalid character");
                    }
                }
            }
        } else if (ch == '"') { // 碰到另一个双引号,则认为字符串解析结束,返回 Token
            return new Token(TokenType.STRING, sb.toString());
        } else if (ch == '\r' || ch == '\n') { // 传入的 JSON 字符串不允许换行
            throw new JsonParseException("Invalid character");
        } else {
            sb.append(ch);
        }
    }
}

private boolean isEscape() throws IOException {
    char ch = charReader.next();
    return (ch == '"' || ch == '\\' || ch == 'u' || ch == 'r'
                || ch == 'n' || ch == 'b' || ch == 't' || ch == 'f');
}

private boolean isHex(char ch) {
    return ((ch >= '0' && ch <= '9') || ('a' <= ch && ch <= 'f')
            || ('A' <= ch && ch <= 'F'));
}

string together the type of data analysis is slightly more complicated, mainly to deal with some special types of characters. JSON allowed special types of characters are as follows:

\"
\
\b
\f
\n
\r
\t
\u four-hex-digits
\/

Finally, a special character \/code without making process, other characters have done a determination, the determination logic isEscape process. In JSON an incoming string, an escape character string contains only allowed listed above. If the mass chaos escape character, you will get an error when parsing.

For STRING type of word parsing process begins with the characters ", finally ". So in the process of parsing, when the characters encounter again ", readString method would think that this string parsing process is complete, and return the appropriate type of Token.

It says a null string type and the type of data analysis process, the process is not complicated, it should not be difficult to understand. As for boolean type and number of data analysis process, we are interested, then you can look up the source code, here is not talking about.

Focus on micro-channel public number: Java technology stack in the background reply: java, you can get the latest Java Tutorial I N chapter finishing, are dry.

2.2 Parsing

After the end of the lexical analysis, and the analysis process does not throw an error, then we can parse the. The parsing process to parse the Token sequence of lexical analysis stage as an input, output or JSON Object JSON Array.

Grammar parser implemented as follows:

object = {} | { members }
members = pair | pair , members
pair = string : value
array = [] | [ elements ]
elements = value | value , elements
value = string | number | object | array | true | false | null

Achieved by means of parser requires two auxiliary class, which is the output of the parser classes are JsonObject and JsonArray. Several commonly used Java Json library, strong performance comparison! This recommendation look.

code show as below:

public class JsonObject {

    private Map<String, Object> map = new HashMap<String, Object>();

    public void put(String key, Object value) {
        map.put(key, value);
    }

    public Object get(String key) {
        return map.get(key);
    }

    public List<Map.Entry<String, Object>> getAllKeyValue() {
        return new ArrayList<>(map.entrySet());
    }

    public JsonObject getJsonObject(String key) {
        if (!map.containsKey(key)) {
            throw new IllegalArgumentException("Invalid key");
        }

        Object obj = map.get(key);
        if (!(obj instanceof JsonObject)) {
            throw new JsonTypeException("Type of value is not JsonObject");
        }

        return (JsonObject) obj;
    }

    public JsonArray getJsonArray(String key) {
        if (!map.containsKey(key)) {
            throw new IllegalArgumentException("Invalid key");
        }

        Object obj = map.get(key);
        if (!(obj instanceof JsonArray)) {
            throw new JsonTypeException("Type of value is not JsonArray");
        }

        return (JsonArray) obj;
    }

    @Override
    public String toString() {
        return BeautifyJsonUtils.beautify(this);
    }
}

public class JsonArray implements Iterable {

    private List list = new ArrayList();

    public void add(Object obj) {
        list.add(obj);
    }

    public Object get(int index) {
        return list.get(index);
    }

    public int size() {
        return list.size();
    }

    public JsonObject getJsonObject(int index) {
        Object obj = list.get(index);
        if (!(obj instanceof JsonObject)) {
            throw new JsonTypeException("Type of value is not JsonObject");
        }

        return (JsonObject) obj;
    }

    public JsonArray getJsonArray(int index) {
        Object obj = list.get(index);
        if (!(obj instanceof JsonArray)) {
            throw new JsonTypeException("Type of value is not JsonArray");
        }

        return (JsonArray) obj;
    }

    @Override
    public String toString() {
        return BeautifyJsonUtils.beautify(this);
    }

    public Iterator iterator() {
        return list.iterator();
    }
}

The core logic is encapsulated in the parseJsonObject parser and parseJsonArray two methods, then I will analyze in detail parseJsonObject method, parseJsonArray method we own analysis of it.

parseJsonObject method implementation is as follows:

private JsonObject parseJsonObject() {
    JsonObject jsonObject = new JsonObject();
    int expectToken = STRING_TOKEN | END_OBJECT_TOKEN;
    String key = null;
    Object value = null;
    while (tokens.hasMore()) {
        Token token = tokens.next();
        TokenType tokenType = token.getTokenType();
        String tokenValue = token.getValue();
        switch (tokenType) {
        case BEGIN_OBJECT:
            checkExpectToken(tokenType, expectToken);
            jsonObject.put(key, parseJsonObject()); // 递归解析 json object
            expectToken = SEP_COMMA_TOKEN | END_OBJECT_TOKEN;
            break;
        case END_OBJECT:
            checkExpectToken(tokenType, expectToken);
            return jsonObject;
        case BEGIN_ARRAY: // 解析 json array
            checkExpectToken(tokenType, expectToken);
            jsonObject.put(key, parseJsonArray());
            expectToken = SEP_COMMA_TOKEN | END_OBJECT_TOKEN;
            break;
        case NULL:
            checkExpectToken(tokenType, expectToken);
            jsonObject.put(key, null);
            expectToken = SEP_COMMA_TOKEN | END_OBJECT_TOKEN;
            break;
        case NUMBER:
            checkExpectToken(tokenType, expectToken);
            if (tokenValue.contains(".") || tokenValue.contains("e") || tokenValue.contains("E")) {
                jsonObject.put(key, Double.valueOf(tokenValue));
            } else {
                Long num = Long.valueOf(tokenValue);
                if (num > Integer.MAX_VALUE || num < Integer.MIN_VALUE) {
                    jsonObject.put(key, num);
                } else {
                    jsonObject.put(key, num.intValue());
                }
            }
            expectToken = SEP_COMMA_TOKEN | END_OBJECT_TOKEN;
            break;
        case BOOLEAN:
            checkExpectToken(tokenType, expectToken);
            jsonObject.put(key, Boolean.valueOf(token.getValue()));
            expectToken = SEP_COMMA_TOKEN | END_OBJECT_TOKEN;
            break;
        case STRING:
            checkExpectToken(tokenType, expectToken);
            Token preToken = tokens.peekPrevious();
            /*
             * 在 JSON 中,字符串既可以作为键,也可作为值。
             * 作为键时,只期待下一个 Token 类型为 SEP_COLON。
             * 作为值时,期待下一个 Token 类型为 SEP_COMMA 或 END_OBJECT
             */
            if (preToken.getTokenType() == TokenType.SEP_COLON) {
                value = token.getValue();
                jsonObject.put(key, value);
                expectToken = SEP_COMMA_TOKEN | END_OBJECT_TOKEN;
            } else {
                key = token.getValue();
                expectToken = SEP_COLON_TOKEN;
            }
            break;
        case SEP_COLON:
            checkExpectToken(tokenType, expectToken);
            expectToken = NULL_TOKEN | NUMBER_TOKEN | BOOLEAN_TOKEN | STRING_TOKEN
                    | BEGIN_OBJECT_TOKEN | BEGIN_ARRAY_TOKEN;
            break;
        case SEP_COMMA:
            checkExpectToken(tokenType, expectToken);
            expectToken = STRING_TOKEN;
            break;
        case END_DOCUMENT:
            checkExpectToken(tokenType, expectToken);
            return jsonObject;
        default:
            throw new JsonParseException("Unexpected Token.");
        }
    }

    throw new JsonParseException("Parse error, invalid Token.");
}

private void checkExpectToken(TokenType tokenType, int expectToken) {
    if ((tokenType.getTokenCode() & expectToken) == 0) {
        throw new JsonParseException("Parse error, invalid Token.");
    }
}

parseJsonObject method resolution process is as follows:

  1. Read a Token, the Token check whether their desired type

  2. If so, update the desired Token type. Otherwise, throw an exception and exit

  3. Repeat steps 1 and 2 until all the Token have been parsed, or abnormal

The above steps are not complicated, but there may be difficult to understand. Here is an example to explain, Token has the following sequence:

{、 id、 :、 1、 }

parseJsonObject finished parsing  { the Token, it will look forward to the next type of STRING or END_OBJECT type of Token Token appear. So parseJsonObject read a new Token, Token find this type is STRING type, to meet the expectations.

Thus parseJsonObject Token Update desired type SEL_COLON, ie :. So the cycle continues until the end of the sequence analysis Token exit or throw an exception.

Although the above resolution process is not very complicated, but in the specific implementation, still need to pay attention to some details. such as:

In JSON, either as a string key, it can be used as a value. As a key, parser look forward to the next Token type SEP_COLON. As a value, the expected next Token type SEP_COMMA or END_OBJECT.

So here as to whether the character string as a key or value, the determination method is relatively simple, i.e., the determination of the type can be a Token. If you are on a Token SEP_COLON, that is :, the string here only as a value. Otherwise, it is only as a key.

For integer type analyzing Token, point processing is simple, it can be directly resolved into the Long integer type. However, considering the space occupied by the problem, for  [Integer.MIN_VALUE, Integer.MAX_VALUE]integer in the range, the resolve to Integer is more appropriate, so the parsing process also need to take note.

3. Test results show and

In order to verify the correctness of the code, the code here a simple test. Test data from Netease music, about 4.5W characters. To avoid each download data, because data changes caused by problems the test is not passed.

I will download a data file stored in the music.json, behind each test reads the data from the file.

About the test section, and the code is not posted here a screenshot. We are interested, you can download the source code to test their own play.

Testing is not to say, let's look at JSON beautifying effect show. Here casually point simulation data, simulate the king of glory hero Di Renjie information it (Yes, this hero I often use). As shown below:

About Code JSON landscaping here is not explained, is not the focus, we only count as one egg.

4. Finally writing

This article is almost coming to an end. This article corresponding code has been placed on github, if necessary, we are free to download: https: //github.com/code4wt/JSONParser.

It should be a statement about the article corresponding code implements a relatively simple JSON parser, the purpose is to explore the realization of the principle of JSON parsing. JSONParser can only be considered a practice nature of the project, implementation of the code is not beautiful, but lack of adequate testing.

At the same time, I limited my ability (basic principles of compiling can be ignored), and I can not guarantee that no mistakes this article and the corresponding code. If everyone in the process of reading the code, we found some errors, or write bad place, can be put forward, I'll modify. If these errors caused distress for you, here to say I'm sorry.

Finally, write and implement the main reference with a JSON parser and how to write a JSON parser two articles and two articles corresponding implementation code, where the author of both gratitude toward Bowen. Well, this is over, I wish you all a happy student life! Goodbye.

Author: Tian Xiaobo

www.cnblogs.com/nullllun/p/8358146.html

reference

Together to write a JSON parser 

http://www.cnblogs.com/absfree/p/5502705.html 

How to write a JSON parser 

https://www.liaoxuefeng.com/article/994977272296736 

Introduction JSON 

http://json.org/json-zh.html 

写一个 JSON、XML 或 YAML 的 Parser 的思路是什么?www.zhihu.com/question/24640264/answer/80500016

发布了50 篇原创文章 · 获赞 1628 · 访问量 203万+

Guess you like

Origin blog.csdn.net/zl1zl2zl3/article/details/104625803