Original text link: http://hexo.liferecords.top/post/3147335795.html
Click the original text to read , the experience is more
introduction
For C++ projects, character encoding is a big pit, and the encoding between different platforms is often different. If different encoding formats are read with a set of character reading formats, garbled characters will appear. Therefore, it is generally converted into UTF-8, which is a universal and well-supported encoding format for platforms.
Character encoding knowledge
First briefly introduce the concepts between Unicode, UTF-8, and ANSI.
-
Unicode: International general encoding, using two bytes to represent a character, can represent any text and symbol in the world.
Unicode is just a symbol set, which only stipulates the binary code of the symbol , but does not stipulate how the binary code should be stored.
-
UTF-8: UTF-8 is one of the Unicode implementations.
Variable-length encoding, which can represent any character in the Unicode standard, and the first byte in the encoding is compatible with ASCII. UTF-8 occupies one character in Chinese and English, that is, the encoding result of English characters is the same as Chinese because three bytes are required
. It is different from ASCII code, so there are garbled characters -
ANSI: local code, related to the default encoding of the system
Chinese windows: GBK
English windows: ASICII
Traditional windows: BIG5
- GB2312/BIG5: Coding specification formulated by China
GB2312: Simplified Chinese
BIG5: Traditional Chinese
GBK: A unified encoding format for Asian double-byte characters, compatible with all platforms
The following table is obtained by checking the online character encoding mentioned in the C++ detection character encoding.
character | GB2312 | Unicode | UTF-8 | GBK |
---|---|---|---|---|
middle | D6D0 | 00004E2D | E4B8AD | D6D0 |
H | 48 | 00000048 | 48 | 48 |
character encoding conversion
Because Unicode can represent all characters and symbols, and because it is a bridge for character encoding conversion, it can be freely converted among the three encoding formats of Unicode, UTF-8, and ANSI, as shown in the following figure:
On the Windows platform, you can use the standard library or the Windows API.
C++11 is still able to do a good job with international standards, and provides these interfaces. The standard library does not provide an interface for mutual conversion from UTF-8 to ANSI, and needs to be encapsulated by itself.
The conversion between ANSI, UTF-8, and Unicode mainly depends on the functions WideCharToMultiByte and MultiByteToWideChar .
Windows API functions have "A" and "W" versions, the "A" version is based on the Windows Code Page, and the "W" version is based on Unicode characters.
So it is a wise choice for windows to use the "W" version
- Unicode to UFT-8:
WideCharToMultiByte
set the CodePage parameter to CP_UTF8; - UTF-8 to Unicode:
MultiByteToWideChar
set the CodePage parameter to CP_UTF8 - Unicode to ANSI:
WideCharToMultiByte
set the CodePage parameter to CP_ACP; - ANSI to Unicode:
MultiByteToWideChar
set the CodePage parameter to CP_ACP; - UTF-8 to ANSI: first convert UTF-8 to Unicode, and then convert Unicode to ANSI;
- ANSI to UTF-8: first convert ANSI to Unicode, and then convert Unicode to ANSI.
Unicode、ANSI
{% tabs unicode2ansi %}
std::string UnicodeToAnsi(const std::wstring & wstr)
{
std::string ret;
std::mbstate_t state = {};
const wchar_t *src = wstr.data();
size_t len = std::wcsrtombs(nullptr, &src, 0, &state);
if (static_cast<size_t>(-1) != len) {
std::unique_ptr< char [] > buff(new char[len + 1]);
len = std::wcsrtombs(buff.get(), &src, len, &state);
if (static_cast<size_t>(-1) != len) {
ret.assign(buff.get(), len);
}
}
return ret;
}
std::wstring AnsiToUnicode(const std::string & str)
{
std::wstring ret;
std::mbstate_t state = {};
const char *src = str.data();
size_t len = std::mbsrtowcs(nullptr, &src, 0, &state);
if (static_cast<size_t>(-1) != len) {
std::unique_ptr< wchar_t [] > buff(new wchar_t[len + 1]);
len = std::mbsrtowcs(buff.get(), &src, len, &state);
if (static_cast<size_t>(-1) != len) {
ret.assign(buff.get(), len);
}
}
return ret;
}
std::wstring AnsiToUnicode(const std::string &strAnsi)
{
//获取转换所需的接收缓冲区大小
int nUnicodeLen = ::MultiByteToWideChar(CP_ACP,
0,
strAnsi.c_str(),
-1,
NULL,
0);
//分配指定大小的内存
wchar_t* pUnicode = new wchar_t[nUnicodeLen + 1];
memset((void*)pUnicode, 0, (nUnicodeLen + 1) * sizeof(wchar_t));
//转换
::MultiByteToWideChar(CP_ACP,
0,
strAnsi.c_str(),
-1,
(LPWSTR)pUnicode,
nUnicodeLen);
std::wstring strUnicode;
strUnicode = (wchar_t*)pUnicode;
delete[]pUnicode;
return strUnicode;
}
std::string UnicodeToAnsi(const std::wstring& strUnicode)
{
int nAnsiLen = WideCharToMultiByte(CP_ACP,
0,
strUnicode.c_str(),
-1,
NULL,
0,
NULL,
NULL);
char *pAnsi = new char[nAnsiLen + 1];
memset((void*)pAnsi, 0, (nAnsiLen + 1) * sizeof(char));
::WideCharToMultiByte(CP_ACP,
0,
strUnicode.c_str(),
-1,
pAnsi,
nAnsiLen,
NULL,
NULL);
std::string strAnsi;
strAnsi = pAnsi;
delete[]pAnsi;
return strAnsi;
}
{% endloss %}
UTF-8、Unicode
{% tabs %}
std::string UnicodeToUtf8(const std::wstring & wstr)
{
std::string ret;
try {
std::wstring_convert< std::codecvt_utf8<wchar_t> > wcv;
ret = wcv.to_bytes(wstr);
} catch (const std::exception & e) {
std::cerr << e.what() << std::endl;
}
return ret;
}
std::wstring Utf8ToUnicode(const std::string & str)
{
std::wstring ret;
try {
std::wstring_convert< std::codecvt_utf8<wchar_t> > wcv;
ret = wcv.from_bytes(str);
} catch (const std::exception & e) {
std::cerr << e.what() << std::endl;
}
return ret;
}
std::wstring Utf8ToUnicode(const std::string& str)
{
int nUnicodeLen = ::MultiByteToWideChar(CP_UTF8,
0,
str.c_str(),
-1,
NULL,
0);
wchar_t* pUnicode;
pUnicode = new wchar_t[nUnicodeLen + 1];
memset((void*)pUnicode, 0, (nUnicodeLen + 1) * sizeof(wchar_t));
::MultiByteToWideChar(CP_UTF8,
0,
str.c_str(),
-1,
(LPWSTR)pUnicode,
nUnicodeLen);
std::wstring strUnicode;
strUnicode = (wchar_t*)pUnicode;
delete []pUnicode;
return strUnicode;
}
std::string UnicodeToUtf8(const std::wstring& strUnicode)
{
int nUtf8Length = WideCharToMultiByte(CP_UTF8,
0,
strUnicode.c_str(),
-1,
NULL,
0,
NULL,
NULL);
char* pUtf8 = new char[nUtf8Length + 1];
memset((void*)pUtf8, 0, sizeof(char) * (nUtf8Length + 1));
::WideCharToMultiByte(CP_UTF8,
0,
strUnicode.c_str(),
-1,
pUtf8,
nUtf8Length,
NULL,
NULL);
std::string strUtf8;
strUtf8 = pUtf8;
delete[] pUtf8;
return strUtf8;
}
{% endloss %}
UTF8、ANSI
std::string AnsiToUtf8(const std::string &strAnsi)
{
std::wstring strUnicode = AnsiToUnicode(strAnsi);
return UnicodeToUTF8(strUnicode);
}
std::string Utf8ToAnsi(const std::string &strUtf8)
{
std::wstring strUnicode = UTF8ToUnicode(strUtf8);
return UnicodeToANSI(strUnicode);
}
reference
- https://www.jianshu.com/p/c23f3ea5443d
- https://blog.csdn.net/bladeandmaster88/article/details/54849660
- https://blog.csdn.net/bajianxiaofendui/article/details/83302855
- https://blog.csdn.net/Fengfgg/article/details/115539849