简单讲述c++json解析器的实现

这篇文章，笔者会以github用户hjiang的开源代码jsonxx为例简单讲解下json解析器的实现。

JSON（JavaScript Object Notation）是一种轻量级的数据交换格式，它使用易于人们阅读和编写的文本格式来表示数据。JSON支持三种数据类型：

对象（Object）：表示一个无序键值对集合，每个键值对之间用逗号分隔，键和值之间用冒号分隔。对象用大括号“{}”包围。例如：
```
{
  "name": "John",
  "age": 30,
  "city": "New York"
}
```
数组（Array）：表示一个有序的值集合，每个值之间用逗号分隔。数组用方括号“[]”包围。例如：
```
[10, 20, 30, 40, 50]
```
值（Value）：可以是字符串、数值、布尔值、null值、对象或数组。例如：
```
"John"
30
true
null
{"name": "John", "age": 30}
[10, 20, 30]
```
这三种数据类型可以组合使用，形成复杂的数据结构。例如，下面是一个JSON对象，它包含了一个数组和一个对象：
```
{
  "name": "John",
  "age": 30,
  "cities": ["New York", "Paris", "London"],
  "contact": {
    "email": "[email protected]",
    "phone": "555-555-5555"
  }
}
```
以下给出了一些json的规范要求：

数据格式：JSON由键值对组成，用花括号{}表示，每个键值对之间用逗号隔开。
键名：键名必须是字符串类型，用双引号""括起来。
值：值可以是字符串、数字、布尔值、数组、对象或null。
字符串：字符串必须用双引号""括起来。
数字：数字可以是整数或浮点数，不允许使用科学计数法。
布尔值：布尔值只有两个取值，true和false。
数组：数组用方括号[]表示，数组元素之间用逗号隔开。
对象：对象用花括号{}表示，对象属性之间用逗号隔开，每个属性由键名和值组成，键名和值之间用冒号:隔开。

对于一个object，考虑它是一个由一系列“字符串”键和“value”值组成的键值对，我们可以使用一个std::map<std::string, Value*>来实现object的底层数据。这个类需要实现一系列构造函数，拷贝函数，以及操作符重载，对key的查找，添加，重设，解析，输出字符串等等操作。

如下：

class Object {
	public:
		Object();
		~Object();

		template <typename T>
		bool has(const std::string& key) const;

		// Always call has<>() first. If the key doesn't exist, consider
		// the behavior undefined.
		//根据给定的key返回一个T类型的value 当然如果key对应的value的类型不是T的话 会在运行阶段报错
		template <typename T>
		T& get(const std::string& key);
		template <typename T>
		const T& get(const std::string& key) const;

		template <typename T>
		const T& get(const std::string& key, const typename identity<T>::type& default_value) const;

		size_t size() const;
		bool empty() const;
		//key必然是string 以指针存储value更小更快
		typedef std::map<std::string, Value*> container;
		const container& kv_map() const;//返回 value_map_
		//输出其object格式
		std::string json() const;
		//输出其xml格式
		std::string xml(unsigned format = JSONx, const std::string &header = std::string(), const std::string &attrib = std::string()) const;
		//返回json 或者xml格式字符串
		std::string write(unsigned format) const;

		void reset();//重设
		bool parse(std::istream &input);
		bool parse(const std::string &input);//解析
		void import(const Object &other);//添加
		void import(const std::string &key, const Value &value);
		Object &operator<<(const Value &value);
		Object &operator<<(const Object &value);
		Object &operator=(const Object &value);
		Object(const Object &other);
		Object(const std::string &key, const Value &value);
		template<size_t N>
		//为了适配char数组
		Object(const char(&key)[N], const Value &value) {
			import(key, value);
		}
		template<typename T>
		Object &operator<<(const T &value);

	protected:
		static bool parse(std::istream& input, Object& object);
		//无需实例化对象的解析
		container value_map_;//map容器
		std::string odd;
	};

同样，对于一个array，考虑到它需要存储一系列value值，我们需要用一个std::vector<Value*>来作为底层数据存储容器。

同样的，这个类需要实现一系列构造函数，拷贝函数，以及操作符重载，对value的查找，添加，重设，解析，输出字符串等等操作。

如下：

class Array {
	public:
		Array();
		~Array();

		size_t size() const;
		bool empty() const;

		template <typename T>
		bool has(unsigned int i) const;

		template <typename T>
		T& get(unsigned int i);
		template <typename T>
		const T& get(unsigned int i) const;

		template <typename T>
		const T& get(unsigned int i, const typename identity<T>::type& default_value) const;

		const std::vector<Value*>& values() const {
			return values_;
		}
		std::string json() const;
		std::string xml(unsigned format = JSONx, const std::string &header = std::string(), const std::string &attrib = std::string()) const;

		std::string write(unsigned format) const { return format == JSON ? json() : xml(format); }
		void reset();
		bool parse(std::istream &input);
		bool parse(const std::string &input);
		typedef std::vector<Value*> container;
		void append(const Array &other);
		void append(const Value &value) { import(value); }
		void import(const Array &other);
		void import(const Value &value);
		Array &operator<<(const Array &other);
		Array &operator<<(const Value &value);
		Array &operator=(const Array &other);
		Array &operator=(const Value &value);
		Array(const Array &other);
		Array(const Value &value);
	protected:
		static bool parse(std::istream& input, Array& array);
		container values_;//std::vector<Value*>
	};

至于value,这个稍微复杂一点，一个value可以是字符串、数字、布尔值、数组、对象或null，那么这个类就需要一种类似“多态”的数据类型，在c++中，我们可以使用union来实现，在任意时刻，union中只能有一个数据成员可以有值。当给联合中某个成员赋值之后，该联合中的其它成员就变成未定义状态了。与此同时，我们就还需要一个枚举来记录当前的value的type。另外，

同样的，这个类需要实现一系列构造函数，拷贝函数，以及操作符重载，添加，重设，解析，输出字符串，获取类型等等操作。

如下：

class Value {
	public:

		Value();
		~Value() { reset(); }
		void reset();

		template<typename T>
		void import(const T &) {
			reset();
			type_ = INVALID_;
			// debug
			// std::cout << "[WARN] No support for " << typeid(t).name() << std::endl;
		}
		void import(const bool &b) {
			reset();
			type_ = BOOL_;
			bool_value_ = b;
		}
#define $number(TYPE) \
  void import( const TYPE &n ) { \
    reset(); \
    type_ = NUMBER_; \
    number_value_ = static_cast<long double>(n); \
  }
		$number(char)
			$number(int)
			$number(long)
			$number(long long)
			$number(unsigned char)
			$number(unsigned int)
			$number(unsigned long)
			$number(unsigned long long)
			$number(float)
			$number(double)
			$number(long double)
#undef $number
#if JSONXX_COMPILER_HAS_CXX11 > 0
			void import(const std::nullptr_t &) {
			reset();
			type_ = NULL_;
		}
#endif
		void import(const Null &) {
			reset();
			type_ = NULL_;
		}
		void import(const String &s) {
			reset();
			type_ = STRING_;
			*(string_value_ = new String()) = s;
		}
		void import(const char* s) {
			reset();
			type_ = STRING_;
			*(string_value_ = new String()) = s;
		}
		void import(const Array &a) {
			reset();
			type_ = ARRAY_;
			*(array_value_ = new Array()) = a;
		}
		void import(const Object &o) {
			reset();
			type_ = OBJECT_;
			*(object_value_ = new Object()) = o;
		}
		void import(const Value &other) {
			if (this != &other)
				switch (other.type_) {
				case NULL_:
					import(Null());
					break;
				case BOOL_:
					import(other.bool_value_);
					break;
				case NUMBER_:
					import(other.number_value_);
					break;
				case STRING_:
					import(*other.string_value_);
					break;
				case ARRAY_:
					import(*other.array_value_);
					break;
				case OBJECT_:
					import(*other.object_value_);
					break;
				case INVALID_:
					type_ = INVALID_;
					break;
				default:
					JSONXX_ASSERT(!"not implemented");
				}
		}
		template<typename T>
		Value &operator <<(const T &t) {
			import(t);
			return *this;
		}
		template<typename T>
		Value &operator =(const T &t) {
			reset();
			import(t);
			return *this;
		}
		Value(const Value &other);
		template<typename T>
		Value(const T&t) : type_(INVALID_) { import(t); }
		template<size_t N>
		Value(const char(&t)[N]) : type_(INVALID_) { import(std::string(t)); }

		bool parse(std::istream &input);
		bool parse(const std::string &input);

		template<typename T>
		bool is() const;
		template<typename T>
		T& get();
		template<typename T>
		const T& get() const;

		bool empty() const;

	public:
		enum {
			NUMBER_,
			STRING_,
			BOOL_,
			NULL_,
			ARRAY_,
			OBJECT_,
			INVALID_
		} type_;
		union {
			Number number_value_;
			Boolean bool_value_;
			String* string_value_;
			Array* array_value_;
			Object* object_value_;
		};//内部数据的多态 （可以这样说嘛？

	protected:
		static bool parse(std::istream& input, Value& value);
	};

建立好这三个基本的类后，我们需要再来实现一系列帮助我们解析输入流的函数，之后的部分，需要有一定istream的基础，让我们继续吧。

第一个函数match，我们用来判断从第一位起是否与给定的字符串匹配如果匹配，将istream中的get步进到pattern之后的位置。使用这个函数，我们可以对给定的istream进行一系列判断，它是object吗？那它第一位开始一定与 { 匹配。它是注释吗？那它第一位和第二位一定与/ 匹配。

以下是它的实现：

bool match(const char* pattern, std::istream& input) {
		input >> std::ws;//忽视空白符
		const char* cur(pattern);
		char ch(0);
		while (input && !input.eof() && *cur != 0) {
			input.get(ch);
			if (ch != *cur) {
				input.putback(ch);//不相等就放回
				if (parse_comment(input))
					continue;
				//遇到注释 清除注释 再来一次
				while (cur > pattern) {
					cur--;
					input.putback(*cur);
					//如果不匹配 则依次将字符物归原主
				}
				return false;
			}
			else {
				//只要get匹配 cur++判断下一位如果存在 是否相等？
				cur++;
			}
		}
		return *cur == 0;
	}

首先调用了std::ws清除了input前面的空白符，然后会一次一个的取input中的字符，并将让与cur匹配，如果不相等，函数将会根据之前从input中取得的字符，通过一次循环将其放回，并且程序返回false；如果相等，程序继续匹配下一位从input中取得的字符，直到确认与pattern完全一样。

注意：返回true的函数是会将input的get推移到pattern之后的。

接下来会有一系列parse函数，我们选取两个比较经典的来讲。

第一个是parse_string，解析istream参数是否为符合json的字符串，如果符合，通过pass by reference的方式将另一个string参数填充。

bool parse_string(std::istream& input, String& value) {
		char ch = '\0', delimiter = '\"';//这样的delimiter定义即为 " 
		if (!match("\"", input)) {
			if (parser_is_strict()) {
				return false;
				//如果解析是严格的话 第一位(除开空白符)不匹配已经g了
			}
			delimiter = '\'';
			if (input.peek() != delimiter) {
				return false;
			}
			input.get(ch);
			//拿走前缀 '
		}
		while (!input.eof() && input.good()) {
			input.get(ch);
			if (ch == delimiter) {
				break;
			}
			if (ch == '\\') {
				input.get(ch);
				switch (ch) {
				case '\\':
				case '/':
					value.push_back(ch);
					break;
				case 'b':
					value.push_back('\b');
					break;
				case 'f':
					value.push_back('\f');
					break;
				case 'n':
					value.push_back('\n');
					break;
				case 'r':
					value.push_back('\r');
					break;
				case 't':
					value.push_back('\t');
					break;
				case 'u': {
					int i;
					std::stringstream ss;
					for (i = 0; (!input.eof() && input.good()) && i < 4; ++i) {
						input.get(ch);
						ss << std::hex << ch;
					}
					if (input.good() && (ss >> i))
						value.push_back(static_cast<char>(i));
				}
						  break;
				default:
					if (ch != delimiter) {
						value.push_back('\\');
						value.push_back(ch);
					}
					else value.push_back(ch);
					break;
				}
			}
			else {
				value.push_back(ch);
			}
		}
		if (input && ch == delimiter) {
			return true;
		}
		else {
			return false;
		}
	}

照例，我们先使用std::ws将空白符清除，之后会对这个input的首位进行判断，在严格的判断中，字符串的首部必须是双引号，在宽容的判断中，字符串的首部可以是单引号，如果都不符合，那么该函数返回false，接下来进行一个循环，循环取出input中的字符进行判断，如果以及到了分隔符（delimiter），退出循环，如果是转义字符，将反斜杠+某某添加到string中，如果是普通字符，直接加上去。循环结束后，判断ch是否等于分隔符并且input不能为空，返回true。反正，返回false;

接下来是parse_comment,解析注释，众所周知，json中是不能加上注释的，但是在json5中，有如下规定：

其特点包含以下几点：

对象结尾可以有多余的逗号（Objects may have a single trailing comma.）
数组结尾可以有多余的逗号 (Arrays may have a single trailing comma.)
字符串
可以使用单引号包裹 (Strings may be single quoted.)
字符串可以使用转义字符
（Strings may include character escapes.）
字符串可以换行 (Strings may span multiple lines by escaping new line characters.)
数值可以使用十六进制
表示（Numbers may be hexadecimal.）
数值可以使用小数点开头/结尾表示 (Numbers may be hexadecimal.)
数值可以使用正无穷大、负无穷大
、和Nan 来表示（Numbers may be IEEE 754 positive infinity, negative infinity, and NaN.）
数值前面可以添加+号（Numbers may begin with an explicit plus sign.）
允许使用单行/多行注释
(Single and multi-line comments are allowed.)
允许多余的空白符
(Additional white space characters are allowed.)

虽然json5相较冗杂，但是我们还是秉着时新的态度来实现json中的注释判断吧。

以下是代码：

bool parse_comment(std::istream &input) {
		if (parser_is_permissive())
			if (!input.eof() && input.peek() == '/')
			{
				char ch0(0);
				input.get(ch0);

				if (!input.eof())
				{
					char ch1(0);
					input.get(ch1);

					if (ch0 == '/' && ch1 == '/')
					{
						// trim chars till \r or \n
						for (char ch(0); !input.eof() && (input.peek() != '\r' && input.peek() != '\n'); )
							input.get(ch);

						// consume spaces, tabs, \r or \n, in case no eof is found
						if (!input.eof())
							input >> std::ws;
						return true;
					}

					input.unget();
					input.clear();
				}

				input.unget();
				input.clear();
			}

		return false;
	}

如果你理解了前几个讲解的函数，那么这个的实现将会十分简单。首先，我们进行解析注释是一定要在parser_is_permissive()成立的情况下进行的，当input没有为eof时，get一个字符，再get一个字符，只要这两个字符都为'/' 那么，确认input的这一段为注释，接着，消耗掉input中的字符直到遇到换行与回车，返回true。如果这两个不相同都不为'/'，我们将执行unget将get的位置返回至初始值。

接下来讲一下tag，输入一个value，返回正确的json格式的字符串函数。

以下是代码：

//返回一个正确value格式的字符串注意 返回的字符串末尾会多一个逗号
			std::string tag(unsigned format, unsigned depth, const std::string &name, const jsonxx::Value &t) {
				std::stringstream ss;
				//用\t模仿层数
				const std::string tab(depth, '\t');

				if (!name.empty())
					ss << tab << '\"' << escape_string(name) << '\"' << ':' << ' ';
				else
					ss << tab;

				switch (t.type_)
				{
				default:
				case jsonxx::Value::NULL_:
					ss << "null";
					return ss.str() + ",\n";

				case jsonxx::Value::BOOL_:
					ss << (t.bool_value_ ? "true" : "false");
					return ss.str() + ",\n";

				case jsonxx::Value::ARRAY_:
					ss << "[\n";
					for (Array::container::const_iterator it = t.array_value_->values().begin(),
						end = t.array_value_->values().end(); it != end; ++it)
						ss << tag(format, depth + 1, std::string(), **it);
					return remove_last_comma(ss.str()) + tab + "]" ",\n";

				case jsonxx::Value::STRING_:
					ss << '\"' << escape_string(*t.string_value_) << '\"';
					return ss.str() + ",\n";

				case jsonxx::Value::OBJECT_:
					ss << "{\n";
					for (Object::container::const_iterator it = t.object_value_->kv_map().begin(),
						end = t.object_value_->kv_map().end(); it != end; ++it)
						ss << tag(format, depth + 1, it->first, *it->second);
					return remove_last_comma(ss.str()) + tab + "}" ",\n";

				case jsonxx::Value::NUMBER_:
					// max precision
					ss << std::setprecision(std::numeric_limits<long double>::digits10 + 1);
					ss << t.number_value_;
					return ss.str() + ",\n";
				}
			}
		}

name作为value的名称，如果存在，将率先打印在控制台上。接着，根据value的类型，分别给stringstream添加字符，并且返回stringstream的底层字符串，其他类型不算复杂，需要注意的是object和array。遇到object，我们先会在stringstream中预加入一个{，用于和最后添加的 } 匹配，接着，我们便利该object中的map容器，依次调用该函数，需要注意的是，depth要加一，这样可以模拟出多层的object，在array中，差不多也是这种处理方式。这两种处理方式返回的字符串都要加上remove_last_comma，因为该函数在最后一位会自动加上逗号。

其他的函数多与json中三种数据类型产生的三种类的添加，查找，重载，运算符重载有关，在理解以上函数后，并无多大难度，希望你也能去github上下载源码慢慢研究，本文仅仅只是抛砖引玉罢了。本文若有错误，希望诸位不吝指出。

源码地址：GitHub - hjiang/jsonxx: A JSON parser in C++

简单讲述c++json解析器的实现

猜你喜欢