ansic, utf8, unicode mutual conversion

Original text link: http://hexo.liferecords.top/post/3147335795.html
Click the original text to read , the experience is more

introduction

For C++ projects, character encoding is a big pit, and the encoding between different platforms is often different. If different encoding formats are read with a set of character reading formats, garbled characters will appear. Therefore, it is generally converted into UTF-8, which is a universal and well-supported encoding format for platforms.

Character encoding knowledge

First briefly introduce the concepts between Unicode, UTF-8, and ANSI.

  • Unicode: International general encoding, using two bytes to represent a character, can represent any text and symbol in the world.

    Unicode is just a symbol set, which only stipulates the binary code of the symbol , but does not stipulate how the binary code should be stored.

  • UTF-8: UTF-8 is one of the Unicode implementations.

    Variable-length encoding, which can represent any character in the Unicode standard, and the first byte in the encoding is compatible with ASCII. UTF-8 occupies one character in Chinese and English, that is, the encoding result of English characters is the same as Chinese because three bytes are required
    . It is different from ASCII code, so there are garbled characters

  • ANSI: local code, related to the default encoding of the system

    Chinese windows: GBK
    English windows: ASICII
    Traditional windows: BIG5


  • GB2312/BIG5: Coding specification formulated by China

    GB2312: Simplified Chinese
    BIG5: Traditional Chinese
    GBK: A unified encoding format for Asian double-byte characters, compatible with all platforms

The following table is obtained by checking the online character encoding mentioned in the C++ detection character encoding.

character GB2312 Unicode UTF-8 GBK
middle D6D0 00004E2D E4B8AD D6D0
H 48 00000048 48 48

character encoding conversion

Because Unicode can represent all characters and symbols, and because it is a bridge for character encoding conversion, it can be freely converted among the three encoding formats of Unicode, UTF-8, and ANSI, as shown in the following figure:

img

On the Windows platform, you can use the standard library or the Windows API.

C++11 is still able to do a good job with international standards, and provides these interfaces. The standard library does not provide an interface for mutual conversion from UTF-8 to ANSI, and needs to be encapsulated by itself.

The conversion between ANSI, UTF-8, and Unicode mainly depends on the functions WideCharToMultiByte and MultiByteToWideChar .

Windows API functions have "A" and "W" versions, the "A" version is based on the Windows Code Page, and the "W" version is based on Unicode characters.
So it is a wise choice for windows to use the "W" version

  • Unicode to UFT-8: WideCharToMultiByteset the CodePage parameter to CP_UTF8;
  • UTF-8 to Unicode: MultiByteToWideCharset the CodePage parameter to CP_UTF8
  • Unicode to ANSI: WideCharToMultiByteset the CodePage parameter to CP_ACP;
  • ANSI to Unicode: MultiByteToWideCharset the CodePage parameter to CP_ACP;
  • UTF-8 to ANSI: first convert UTF-8 to Unicode, and then convert Unicode to ANSI;
  • ANSI to UTF-8: first convert ANSI to Unicode, and then convert Unicode to ANSI.

Unicode、ANSI

{% tabs unicode2ansi %}

std::string UnicodeToAnsi(const std::wstring & wstr)
{
    std::string ret;
    std::mbstate_t state = {};
    const wchar_t *src = wstr.data();
    size_t len = std::wcsrtombs(nullptr, &src, 0, &state);
    if (static_cast<size_t>(-1) != len) {
        std::unique_ptr< char [] > buff(new char[len + 1]);
        len = std::wcsrtombs(buff.get(), &src, len, &state);
        if (static_cast<size_t>(-1) != len) {
            ret.assign(buff.get(), len);
        }
    }
    return ret;
}

std::wstring AnsiToUnicode(const std::string & str)
{
    std::wstring ret;
    std::mbstate_t state = {};
    const char *src = str.data();
    size_t len = std::mbsrtowcs(nullptr, &src, 0, &state);
    if (static_cast<size_t>(-1) != len) {
        std::unique_ptr< wchar_t [] > buff(new wchar_t[len + 1]);
        len = std::mbsrtowcs(buff.get(), &src, len, &state);
        if (static_cast<size_t>(-1) != len) {
            ret.assign(buff.get(), len);
        }
    }
    return ret;
}

std::wstring AnsiToUnicode(const std::string &strAnsi)
{
	//获取转换所需的接收缓冲区大小
	int  nUnicodeLen = ::MultiByteToWideChar(CP_ACP,
		0,
		strAnsi.c_str(),
		-1,
		NULL,
		0);
	//分配指定大小的内存
	wchar_t* pUnicode = new wchar_t[nUnicodeLen + 1];
	memset((void*)pUnicode, 0, (nUnicodeLen + 1) * sizeof(wchar_t));

	//转换
	::MultiByteToWideChar(CP_ACP,
		0,
		strAnsi.c_str(),
		-1,
		(LPWSTR)pUnicode,
		nUnicodeLen);

	std::wstring  strUnicode;
	strUnicode = (wchar_t*)pUnicode;
	delete[]pUnicode;

	return strUnicode;
}

std::string UnicodeToAnsi(const std::wstring& strUnicode)
{
	int nAnsiLen = WideCharToMultiByte(CP_ACP,
		0,
		strUnicode.c_str(),
		-1,
		NULL,
		0,
		NULL,
		NULL);

	char *pAnsi = new char[nAnsiLen + 1];
	memset((void*)pAnsi, 0, (nAnsiLen + 1) * sizeof(char));

	::WideCharToMultiByte(CP_ACP,
		0,
		strUnicode.c_str(),
		-1,
		pAnsi,
		nAnsiLen,
		NULL,
		NULL);

	std::string strAnsi;
	strAnsi = pAnsi;
	delete[]pAnsi;

	return strAnsi;
}

{% endloss %}

UTF-8、Unicode

{% tabs %}

std::string UnicodeToUtf8(const std::wstring & wstr)
{
    std::string ret;
    try {
        std::wstring_convert< std::codecvt_utf8<wchar_t> > wcv;
        ret = wcv.to_bytes(wstr);
    } catch (const std::exception & e) {
        std::cerr << e.what() << std::endl;
    }
    return ret;
}

std::wstring Utf8ToUnicode(const std::string & str)
{
    std::wstring ret;
    try {
        std::wstring_convert< std::codecvt_utf8<wchar_t> > wcv;
        ret = wcv.from_bytes(str);
    } catch (const std::exception & e) {
        std::cerr << e.what() << std::endl;
    }
    return ret;
}

std::wstring Utf8ToUnicode(const std::string& str)
{
	int nUnicodeLen = ::MultiByteToWideChar(CP_UTF8,
		0,
		str.c_str(),
		-1,
		NULL,
		0);

	wchar_t*  pUnicode;
	pUnicode = new wchar_t[nUnicodeLen + 1];
	memset((void*)pUnicode, 0, (nUnicodeLen + 1) * sizeof(wchar_t));

	::MultiByteToWideChar(CP_UTF8,
		0,
		str.c_str(),
		-1,
		(LPWSTR)pUnicode,
		nUnicodeLen);

	std::wstring  strUnicode;
	strUnicode = (wchar_t*)pUnicode;
	delete []pUnicode;

	return strUnicode;
}

std::string UnicodeToUtf8(const std::wstring& strUnicode)
{
	int nUtf8Length = WideCharToMultiByte(CP_UTF8,
		0,
		strUnicode.c_str(),
		-1,
		NULL,
		0,
		NULL,
		NULL);

	char* pUtf8 = new char[nUtf8Length + 1];
	memset((void*)pUtf8, 0, sizeof(char) * (nUtf8Length + 1));

	::WideCharToMultiByte(CP_UTF8,
		0,
		strUnicode.c_str(),
		-1,
		pUtf8,
		nUtf8Length,
		NULL,
		NULL);

	std::string strUtf8;
	strUtf8 = pUtf8;
	delete[] pUtf8;

	return strUtf8;
}


{% endloss %}

UTF8、ANSI

std::string AnsiToUtf8(const std::string &strAnsi)
{
	std::wstring strUnicode = AnsiToUnicode(strAnsi);
	return UnicodeToUTF8(strUnicode);
}

std::string Utf8ToAnsi(const std::string &strUtf8)
{
	std::wstring strUnicode = UTF8ToUnicode(strUtf8);
	return UnicodeToANSI(strUnicode);
}


reference

  1. https://www.jianshu.com/p/c23f3ea5443d
  2. https://blog.csdn.net/bladeandmaster88/article/details/54849660
  3. https://blog.csdn.net/bajianxiaofendui/article/details/83302855
  4. https://blog.csdn.net/Fengfgg/article/details/115539849

おすすめ

転載: blog.csdn.net/liferecords/article/details/125872059