C++ jsoncpp Chinese and \uXXXX use toStyledString to generate string Chinese garbled solutions

table of Contents

1. Chinese garbled solutions

1.1, garbled display

1.2. Reasons for garbled codes and solutions

Second, the solution to the garbled code that contains \uXXXX parsing

2.1, garbled display

2.2 Reasons for garbled characters

2.3. Solution


1. Chinese garbled solutions

1.1, garbled display

When using jsoncpp to parse a string containing Chinese, the Chinese part of the string generated by the toStyledString() function will become \u plus 4 hexadecimal digits, which will cause garbled parsing.

such as:

1.2. Reasons for garbled codes and solutions

To analyze the source code of jsoncpp (official download address: http://sourceforge.net/projects/jsoncpp/files/  ). By analyzing the writeValue function of StyledWriter, it is found that his processing of the string is escaped by the valueToQuotedStringN function:

static String valueToQuotedStringN(const char* value, unsigned length) {
  if (value == nullptr)
    return "";

  if (!isAnyCharRequiredQuoting(value, length))
    return String("\"") + value + "\"";
  // We have to walk value and escape any special characters.
  // Appending to String is not efficient, but this should be rare.
  // (Note: forward slashes are *not* rare, but I am not escaping them.)
  String::size_type maxsize = length * 2 + 3; // allescaped+quotes+NULL
  String result;
  result.reserve(maxsize); // to avoid lots of mallocs
  result += "\"";
  char const* end = value + length;
  for (const char* c = value; c != end; ++c) {
    switch (*c) {
    case '\"':
      result += "\\\"";
      break;
    case '\\':
      result += "\\\\";
      break;
    case '\b':
      result += "\\b";
      break;
    case '\f':
      result += "\\f";
      break;
    case '\n':
      result += "\\n";
      break;
    case '\r':
      result += "\\r";
      break;
    case '\t':
      result += "\\t";
      break;
    // case '/':
    // Even though \/ is considered a legal escape in JSON, a bare
    // slash is also legal, so I see no reason to escape it.
    // (I hope I am not misunderstanding something.)
    // blep notes: actually escaping \/ may be useful in javascript to avoid </
    // sequence.
    // Should add a flag to allow this compatibility mode and prevent this
    // sequence from occurring.
	default: {
		unsigned int cp = utf8ToCodepoint(c, end);
		// don't escape non-control characters
		// (short escape sequence are applied above)
		if (cp < 0x80 && cp >= 0x20)
			result += static_cast<char>(cp);
		else if (cp < 0x10000) { // codepoint is in Basic Multilingual Plane
			result += "\\u";
			result += toHex16Bit(cp);
		}
		else { // codepoint is not in Basic Multilingual Plane
			   // convert to surrogate pair first
			cp -= 0x10000;
			result += "\\u";
			result += toHex16Bit((cp >> 10) + 0xD800);
			result += "\\u";
			result += toHex16Bit((cp & 0x3FF) + 0xDC00);
		}

		}break;
	}
  }
  result += "\"";
  return result;
}

Through the code, we can clearly see that default: handles characters including Chinese characters: so we can modify the source code and recompile the library. will:

	default: {
		unsigned int cp = utf8ToCodepoint(c, end);
		// don't escape non-control characters
		// (short escape sequence are applied above)
		if (cp < 0x80 && cp >= 0x20)
			result += static_cast<char>(cp);
		else if (cp < 0x10000) { // codepoint is in Basic Multilingual Plane
			result += "\\u";
			result += toHex16Bit(cp);
		}
		else { // codepoint is not in Basic Multilingual Plane
			   // convert to surrogate pair first
			cp -= 0x10000;
			result += "\\u";
			result += toHex16Bit((cp >> 10) + 0xD800);
			result += "\\u";
			result += toHex16Bit((cp & 0x3FF) + 0xDC00);
		}

			//result += *c;
			
		}break;

To:

	default: {
			result += *c;
    }break;

The final result is:

Reference link:

C++ jsoncpp uses toStyledString to generate string Chinese garbled solution

Second, the solution to the garbled code that contains \uXXXX parsing

2.1, garbled display

The json file is as follows:

Analysis result:

2.2 Reasons for garbled characters

I changed the valueToQuotedStringN function before. This function converts a string into a unicode encoding, so the string in the format of \uXXXX is actually a utf-8 string (unicode encoding if you read Chinese). So here you need to convert the string to unicode code extra

2.3. Solution

utf-8 to unicode:

wstring UTF8ToUnicode(const string& str)
{
	int len = 0;
	len = str.length();
	int unicodeLen = ::MultiByteToWideChar(CP_UTF8,
		0,
		str.c_str(),
		-1,
		NULL,
		0);
	wchar_t * pUnicode;
	pUnicode = new wchar_t[unicodeLen + 1];
	memset(pUnicode, 0, (unicodeLen + 1) * sizeof(wchar_t));
	::MultiByteToWideChar(CP_UTF8,
		0,
		str.c_str(),
		-1,
		(LPWSTR)pUnicode,
		unicodeLen);
	wstring rt;
	rt = (wchar_t*)pUnicode;
	delete pUnicode;
	return rt;
}

Add this function to the program and call:

std::string ws2s(const std::wstring& ws)
{
	std::string curLocale = setlocale(LC_ALL, NULL);     
	setlocale(LC_ALL, "chs");
	const wchar_t* _Source = ws.c_str();
	size_t _Dsize = 2 * ws.size() + 1;
	char *_Dest = new char[_Dsize];
	memset(_Dest, 0, _Dsize);
	wcstombs(_Dest, _Source, _Dsize);
	std::string result = _Dest;
	delete[]_Dest;
	setlocale(LC_ALL, curLocale.c_str());
	return result;
}


//调用
std::string content = root["Cnki"][i]["content"].toStyledString();
wstring wstr = UTF8ToUnicode(content);//将utf-8转化为unicode格式	
cout << ws2s(wstr) << endl;	

result:

Reference link:

Conversion function between C++ STRING and WSTRING

Guess you like

Origin blog.csdn.net/wxplol/article/details/109505854