Explore the garbled code problem in Visual Studio

Regarding the garbled code, everyone is not happy when encountering it, but I have a headache when encountering it. When the program encounters garbled code in Visual Studio, three concepts need to be clarified, then the problem can be easily solved.

Three character set concepts

Source character set

MSVC/source-charset

That is, the character set of the source code text file . In similar text editors such as NodePad++, Notepad, and VS Code, you can open the source file and take a look at your character set (file encoding).

The source code text file is stored on the hard disk in binary form. It is the same regardless of Chinese or English. When you input a Chinese character and save it, the Chinese character will be converted into binary encoding according to the character set you specified and saved. When you open a file in this format, it will be converted back to binary according to the character set you specified. If different character sets are used twice, garbled characters will appear.

Execution character set

MSVC/execution-charset

In C++ const char* str = "我";, the execution character set determines what bytes are stored in this line of code when the compiler compiles it. strYou may say that the source code character set has already determined the binary representation of this "I"? Yes, but this execution character set is for you to explain it again here. For example, my source code character set may be UTF8, but I can execute the character set so that the final strstorage is GBK byte encoding.

Parse character set

You will need to use it when you finally want to restore and display these binary byte encodings. For example, when printf()the previous strdisplay is displayed to the console , printf()these byte encodings will be parsed according to the parsed character set, and the specified characters will be found and displayed.

Character set analysis in Visual Studio

By default, Visual Studio detects the byte order mark to determine whether the source file is in an encoded Unicode format, such as UTF-16 or UTF-8. If no byte order mark is found, the source file is assumed to be encoded in the current code page, unless the /source-charsetor /utf-8option is used to specify a character set name or code page. Visual Studio allows C++ source code to be saved in any of several character encodings.

A code page is a character set that can include numbers, punctuation, and other symbols. Different languages ​​and locales may use different code pages. For example, ANSI code page 1252 is suitable for English and most European languages; OEM code page 932 is suitable for Japanese kanji.

The above is MIcrosoft's official statement, which is a bit convoluted. In short, for the execution character set , Visual Studio determines the execution character set according to the Locale of the system by default. Generally, everyone has a Chinese Windows system, and the Locale is China, so it is GBK encoding. Regarding the parsing character set , if there is no manual change, Visual Studio's standard input and output ( printf, cout) to the command line is also determined based on the system Locale, which is GBK.

How to use UTF-8

In order for the whole process to be displayed normally without garbled characters, all three stages should be set to UTF-8.

The source character set and execution character set are set to UTF-8

You can use /utf-8the option to specify the source and execution character sets to be UTF-8 encoded. It is equivalent to specifying it on the command line /source-charset:utf-8 /execution-charset:utf-8.

Set this compiler option in the Visual Studio development environment

  1. Opens the project Property Pages dialog box. For more information, see Setting C++ Compiler and Build Properties in Visual Studio .
  2. Select the Configuration Properties > C/C++ > Command Line property page.
  3. In Additional Options, add /utf-8the option to specify your preferred encoding.
  4. Select OK to save changes.

The parsing character set is set to UTF-8

SetConsoleCPThe function sets the input code page used by the console associated with the calling process. The console uses its input code page to convert keyboard input into corresponding character values.

BOOL WINAPI SetConsoleCP(
  _In_ UINT wCodePageID
);

Add the following code to the main() function

std::cout << "GetConsloeCP" << GetConsoleCP() << std::endl;
SetConsoleOutputCP(65001);//65001代表UTF-8,参见代码页标识符
std::cout << "GetConsloeCP" << GetConsoleOutputCP() << std::endl;

qDebug() << QTextCodec::codecForLocale()->name();
QTextCodec::setCodecForLocale(QTextCodec::codecForName("utf-8"));//Qt输出
qDebug() << QTextCodec::codecForLocale()->name();

In addition, C++11 can specify the execution character set of string literals , just const char* str = u8"我";add it in front of the string u8.

/source-charset (Set source character set) | Microsoft Docs

/utf-8 (sets source and execution character sets to UTF-8) | Microsoft Docs

SetConsoleCP function - Windows Console | Microsoft Docs

Code pages | Microsoft Docs

Code page identifier - Win32 apps | Microsoft Docs

Research on C++ UTF8 Chinese encoding processing in MSVC

Guess you like

Origin blog.csdn.net/no_say_you_know/article/details/126695461