table of Contents
1.2. Reasons for garbled codes and solutions
Second, the solution to the garbled code that contains \uXXXX parsing
2.2 Reasons for garbled characters
1. Chinese garbled solutions
1.1, garbled display
When using jsoncpp to parse a string containing Chinese, the Chinese part of the string generated by the toStyledString() function will become \u plus 4 hexadecimal digits, which will cause garbled parsing.
such as:
1.2. Reasons for garbled codes and solutions
To analyze the source code of jsoncpp (official download address: http://sourceforge.net/projects/jsoncpp/files/ ). By analyzing the writeValue function of StyledWriter, it is found that his processing of the string is escaped by the valueToQuotedStringN function:
static String valueToQuotedStringN(const char* value, unsigned length) {
if (value == nullptr)
return "";
if (!isAnyCharRequiredQuoting(value, length))
return String("\"") + value + "\"";
// We have to walk value and escape any special characters.
// Appending to String is not efficient, but this should be rare.
// (Note: forward slashes are *not* rare, but I am not escaping them.)
String::size_type maxsize = length * 2 + 3; // allescaped+quotes+NULL
String result;
result.reserve(maxsize); // to avoid lots of mallocs
result += "\"";
char const* end = value + length;
for (const char* c = value; c != end; ++c) {
switch (*c) {
case '\"':
result += "\\\"";
break;
case '\\':
result += "\\\\";
break;
case '\b':
result += "\\b";
break;
case '\f':
result += "\\f";
break;
case '\n':
result += "\\n";
break;
case '\r':
result += "\\r";
break;
case '\t':
result += "\\t";
break;
// case '/':
// Even though \/ is considered a legal escape in JSON, a bare
// slash is also legal, so I see no reason to escape it.
// (I hope I am not misunderstanding something.)
// blep notes: actually escaping \/ may be useful in javascript to avoid </
// sequence.
// Should add a flag to allow this compatibility mode and prevent this
// sequence from occurring.
default: {
unsigned int cp = utf8ToCodepoint(c, end);
// don't escape non-control characters
// (short escape sequence are applied above)
if (cp < 0x80 && cp >= 0x20)
result += static_cast<char>(cp);
else if (cp < 0x10000) { // codepoint is in Basic Multilingual Plane
result += "\\u";
result += toHex16Bit(cp);
}
else { // codepoint is not in Basic Multilingual Plane
// convert to surrogate pair first
cp -= 0x10000;
result += "\\u";
result += toHex16Bit((cp >> 10) + 0xD800);
result += "\\u";
result += toHex16Bit((cp & 0x3FF) + 0xDC00);
}
}break;
}
}
result += "\"";
return result;
}
Through the code, we can clearly see that default: handles characters including Chinese characters: so we can modify the source code and recompile the library. will:
default: {
unsigned int cp = utf8ToCodepoint(c, end);
// don't escape non-control characters
// (short escape sequence are applied above)
if (cp < 0x80 && cp >= 0x20)
result += static_cast<char>(cp);
else if (cp < 0x10000) { // codepoint is in Basic Multilingual Plane
result += "\\u";
result += toHex16Bit(cp);
}
else { // codepoint is not in Basic Multilingual Plane
// convert to surrogate pair first
cp -= 0x10000;
result += "\\u";
result += toHex16Bit((cp >> 10) + 0xD800);
result += "\\u";
result += toHex16Bit((cp & 0x3FF) + 0xDC00);
}
//result += *c;
}break;
To:
default: {
result += *c;
}break;
The final result is:
Reference link:
C++ jsoncpp uses toStyledString to generate string Chinese garbled solution
Second, the solution to the garbled code that contains \uXXXX parsing
2.1, garbled display
The json file is as follows:
Analysis result:
2.2 Reasons for garbled characters
I changed the valueToQuotedStringN function before. This function converts a string into a unicode encoding, so the string in the format of \uXXXX is actually a utf-8 string (unicode encoding if you read Chinese). So here you need to convert the string to unicode code extra
2.3. Solution
utf-8 to unicode:
wstring UTF8ToUnicode(const string& str)
{
int len = 0;
len = str.length();
int unicodeLen = ::MultiByteToWideChar(CP_UTF8,
0,
str.c_str(),
-1,
NULL,
0);
wchar_t * pUnicode;
pUnicode = new wchar_t[unicodeLen + 1];
memset(pUnicode, 0, (unicodeLen + 1) * sizeof(wchar_t));
::MultiByteToWideChar(CP_UTF8,
0,
str.c_str(),
-1,
(LPWSTR)pUnicode,
unicodeLen);
wstring rt;
rt = (wchar_t*)pUnicode;
delete pUnicode;
return rt;
}
Add this function to the program and call:
std::string ws2s(const std::wstring& ws)
{
std::string curLocale = setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "chs");
const wchar_t* _Source = ws.c_str();
size_t _Dsize = 2 * ws.size() + 1;
char *_Dest = new char[_Dsize];
memset(_Dest, 0, _Dsize);
wcstombs(_Dest, _Source, _Dsize);
std::string result = _Dest;
delete[]_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}
//调用
std::string content = root["Cnki"][i]["content"].toStyledString();
wstring wstr = UTF8ToUnicode(content);//将utf-8转化为unicode格式
cout << ws2s(wstr) << endl;
result:
Reference link: