Know GBK encoding and UTF-8 encoding

GBK encoding and UTF-8 encoding are two different character encoding methods;

1. The main differences are as follows:

(1) The range of character sets is different: GBK encoding supports Chinese characters and Japanese and Korean characters, while UTF-8 encoding supports characters worldwide;

(2) Different encoding methods: GBK encoding adopts double-byte encoding, and each character occupies 2 bytes, while UTF-8 encoding adopts variable-length encoding, and the encoding length of a character can be 1-4 bytes;

(3) Different compatibility: GBK encoding is widely used in China, but its international application is limited, while UTF-8 encoding has better international compatibility;

(4) The size of the storage space is different: Since each character of the GBK encoding occupies 2 bytes, it occupies a relatively large space in storage, while the UTF-8 encoding uses variable-length encoding, which can allocate storage according to the actual length of the characters space, so it takes up relatively little space in storage;

In short, GBK encoding is applicable to Chinese, Japanese and Korean locales, and UTF-8 encoding is applicable to characters worldwide.

2. Specific transcoding example:

(1) Convert utf8 encoding to gbk encoding

std::string utf8ToGbk(const char *pszSrc)
{
	if (nullptr == pszSrc)
		return "";

	//Windows API 函数,用于将多字节字符集(如 ASCII)转换为宽字符集(如 Unicode)
	int nLen = MultiByteToWideChar(CP_UTF8, 0, pszSrc, -1, NULL, 0);
	wchar_t* pwszGBK = new wchar_t[nLen + 1];
	memset(pwszGBK, 0, nLen * 2 + 2);
	MultiByteToWideChar(CP_UTF8, 0, pszSrc, -1, pwszGBK, nLen);
	nLen = WideCharToMultiByte(CP_ACP, 0, pwszGBK, -1, NULL, 0, NULL, NULL);

	char* pszGBK = new char[nLen + 1];
	memset(pszGBK, 0, nLen + 1);
	WideCharToMultiByte(CP_ACP, 0, pwszGBK, -1, pszGBK, nLen, NULL, NULL);
	string strTemp(pszGBK);

	delete[] pwszGBK;
	pwszGBK = nullptr;

	delete[] pszGBK;
	pszGBK = nullptr;
	return strTemp;
}

(2) Convert gbk encoding to utf8 encoding

std::string gbk2Utf8(std::string& strData)
{
	int nLen = MultiByteToWideChar(CP_ACP, 0, strData.c_str(), -1, NULL, 0);

	WCHAR *pWStr1 = new WCHAR[nLen];
	MultiByteToWideChar(CP_ACP, 0, strData.c_str(), -1, pWStr1, nLen);
	nLen = WideCharToMultiByte(CP_UTF8, 0, pWStr1, -1, NULL, 0, NULL, NULL);


	char *pStr2 = new char[nLen];
	WideCharToMultiByte(CP_UTF8, 0, pWStr1, -1, pStr2, nLen, NULL, NULL);

	string strOutUtf8 = pStr2;

	delete[] pWStr1;
	pWStr1 = NULL;

	delete[] pStr2;
	pStr2 = NULL;

	return strOutUtf8;
}

3. You can also choose on VS when coding

Note: In C++, CString belongs to unicode, and multi-byte CString cannot be used.

Guess you like

Origin blog.csdn.net/bigger_belief/article/details/131442841