Unicode encoding standard and UTF

coding standards

  • ASCII

    • American Standard Code for Information Interchange

    • American Standard Code for Information Interchange

    • Shows modern English and other Western European languages

    • 128 characters

  • GB2312-80

    • The Chinese National Standard for Chinese Character Encoding formulated in 1980. A total of 7445 characters are included, including 6763 Chinese characters. GB2312 is compatible with standard ASCII code and uses the coding space of extended ASCII code for encoding. One Chinese character occupies two bytes, and the highest bit of each byte is 1.

  • GBK

    • "Chinese Character Internal Code Extension Specification" (GBK) was formulated in 1995. It is compatible with all Chinese characters in GB2312, GB13000-1, and BIG5 encoding. It uses double-byte encoding. The encoding space is 0x8140~0xFEFE, with a total of 23940 code points, of which GBK1 The area and GBK2 area are also the encoding range of GB2312. Contains 21003 Chinese characters. GBK is downwardly compatible with GB 2312 encoding and upwardly supports the ISO 10646.1 international standard, which is a connecting product in the transition process from the former to the latter. ISO 10646 is a coding standard published by the International Organization for Standardization ISO, namely Universal Multilpe-Octet Coded Character Set (UCS). It is translated as "Universal Multi-Octet Coded Character Set" in mainland China and "Universal Multi-Octet Coded Character Set" in Taiwan. Character Set", which is fully compatible with the Unicode encoding of the Unicode organization. ISO 10646.1 is the first part of the standard, "Architecture and Basic Multilingual Plane". my country recognized it in the form of GB 13000.1 national standard in 1993 (that is, GB 13000.1 is equivalent to ISO 10646.1)

  • GB18030

    • The national standard GB18030-2000 "Supplement to the Basic Set of Chinese Character Encoding Character Sets for Information Exchange" is my country's most important Chinese character encoding standard after GB2312-1980 and GB13000-1993. It is one of the basic standards that my country's computer systems must follow. The GB18030-2000 coding standard was jointly issued by the Ministry of Information Industry and the State Administration of Quality and Technical Supervision on March 17, 2000, and will be officially enforced as a national standard in January 2001. GB18030-2005 "Information Technology Chinese Coded Character Set" is a very large Chinese coded character set formulated by our country that is mainly Chinese characters and includes a variety of Chinese minority languages ​​(such as Tibetan, Mongolian, Dai, Yi, Korean, Uyghur, etc.) Mandatory standards, including more than 70,000 Chinese characters

  • Latin1

    • Latin1 is an alias for ISO-8859-1, and is written as Latin-1 in some environments. ISO-8859-1 encoding is a single-byte encoding, backward compatible with ASCII. Its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is a control character, and 0xA0-0xFF is Text symbols.

    • In addition to the characters included in ASCII, the characters included in ISO-8859-1 also include text symbols corresponding to Western European languages, Greek, Thai, Arabic, and Hebrew. The euro symbol appeared relatively late and is not included in ISO-8859-1.

      Because the ISO-8859-1 encoding range uses all space within a single byte, byte streams of any other encoding will not be discarded when transmitted and stored in systems that support ISO-8859-1. In other words, there is no problem in treating any other encoded byte stream as ISO-8859-1 encoding. This is a very important feature. The default encoding of the MySQL database is Latin1, which takes advantage of this feature. ASCII encoding is a 7-bit container, and ISO-8859-1 encoding is an 8-bit container.

  • Unicode

    • Brief description

      • Unicode is called Unicode in Chinese and includes character sets and encoding schemes. The encoding standard on the right belongs to the dialect of each institution. The disadvantage of dialect is that it can only express characters in a specific region (for example: ASCII cannot describe Chinese characters, GBK cannot describe French, etc.), while Unicode is a global unified language. If various text encodings are described as dialects from various places, then Unicode is a language jointly developed by countries around the world. It is also a character encoding scheme developed by an international organization that can accommodate all texts and symbols in the world.

      • Basically, computers just crunch numbers. They specify a number to store letters or other characters. Before Unicode was created, there were hundreds of encoding systems that specified these numbers. No single encoding can contain enough characters: for example, the European Community alone requires several different encodings to cover all languages. Even for a single language, such as English, no single encoding can apply to all letters, punctuation marks, and commonly used technical symbols. These coding systems can also conflict with each other. That is, two encodings may use the same number to represent two different characters, or use different numbers to represent the same character. Any given computer (especially a server) needs to support many different encodings, but whenever data passes between different encodings or platforms, there is always a risk of that data being corrupted.

    • character set

      • Universal Multiple-Octet Coded Character Set, Chinese name: Universal Multiple-Octet Coded Character Set, referred to as UCS. Here UCS = Unicode, but it is called differently in some places, but it is one thing. UCS and Unicode are just code tables that assign integers to characters.

      • UCS-2

        • 2 byte encoding

      • UCS-4

    • Encoding

      • Unicode maps these characters with numbers 0-0x10FFFF, which can hold up to 1114112 characters, or 1114112 code points. A code point is a number that can be assigned to a character. UTF-8, UTF-16, UTF-32 are all encoding schemes for converting numbers into program data.

        There are several ways to represent a string of characters as a sequence of bytes. The two most obvious methods are to store Unicode text as a string of 2 or 4 byte sequences. The official names of these two methods are UCS-2 and UCS-4, respectively, and unless otherwise specified, most bytes are like this (Bigendian convention).

    • UCS

      • Unicode is developed based on the Universal Character Set (Universal Character Set) standard

      • It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.

    • Relationship with UTF

      • The relationship between unicode and UTF: Unicode is a set of encoding standards, UCS is a character set that establishes the correspondence between characters and numbers, and the UTF standard is the encoding implementation method of UCS under the Unicode system, (for example: how many characters does a certain character occupy? Bytes, encoding order, how to store, etc.)

    • Method to realize

      • UTF

        • UCS Transformation Format: UCS character set conversion format, or called

          Unicode Transformation Format: Unicode transformation format.

          UTF-8, UTF-16 and UTF-32 can store all characters, the difference lies in the space usage. UTF-8 only needs to use one byte when storing the ASCII character set, UTF-32 needs to use a fixed four bytes, and the comprehensive performance of UTF-16 is between the two.

        • Endianness

          • Byte Order Mark

            • effect

              • There are two ways to determine the order in which character values ​​are stored.

              • For example: the hexadecimal code of Chinese "character" is: 0x5B 57

                • little endian

                  • 0x57 5B

                • big endian

                  • 0x5B 57

              • The role of UTF-8 with BOM is special. It is not used to determine the byte order, but to determine the encoding method. The first three bytes of the file are EF BB BF, which means: I am a UTF-8 encoded file instead of UTF-16 or UTF-32.

          • That is, before the byte stream is transmitted, the character "zero-width non-breaking space" that is used as the BOM is transmitted first. The encoding of this character is FEFF, and the converse FFFE (UTF-16) and FFFE0000 (UTF-32) are undefined code points in Unicode and should not appear in actual transmission.

        • UTF-8

          • Dynamic length (1-4 bytes), maximize the use of space, and store extended characters need to be calculated

          • ASCII compatible

          • It has nothing to do with endianness, its endianness is the same in all systems, so it doesn't really require a BOM.

          • UTF-8 with BOM

            • The first three bytes of the file are: EF BB BF

              • Reading in C++ requires some processing

              • Java reading does not require manual processing

            • Used to distinguish UTF-16 or UTF-32 text

          • UTF-8 without BOM

            • Compared to UTF-8 with BOM, the file header does not have EF BB BF

        • UTF-16

          • Non-fixed length (2 or 4 bytes), space utilization and required calculations are between UTF-8 and UTF-32 (UTF-32 does not require calculation, it is not dynamic, it just takes up more space).

            UTF-16 uses 2 bytes to encode USC-2 characters commonly used around the world, and also uses 4 bytes to encode uncommon characters.

          • Implementation-dependent endianness

          • UTF-16LE

            • Little endian order (also called: little endian order)

              • The low-order bits of the value are stored first

            • BOM:FF FE

          • UTF-16BE

            • Big-endian order (also called: big-endian order)

              • The high bit of the value is at the front

            • BOM:FE FF

          • Character set: UCS-2, and some UCS-4 character sets

          • UTF-16 big-endian and little-endian files do not require special processing in C++

        • UTF-32

          • UTF-32 is a simple encoding method that uses a 32-bit memory space (that is, 4 bytes) to store a character encoding. Since the maximum number of the Unicode character set is 0x10FFFF (ISO 10646), the 4-byte The space can accommodate any character encoding. The advantages of the UTF-32 encoding method are obvious. It is very simple, and it is convenient to calculate the string length and find characters; the disadvantage is also obvious, it takes up too much memory space.

          • Fixed length: 4 bytes

          • Character set: UCS-4

          • UTF-32LE

            • FF FE 00 00

          • UTF-32BE

            • 00 00 FE FF

        • Use in mainstream programming languages

          • java

            • Java uses UTF-16 for internal text representation and supports a non-standard modified UTF-8 encoding (which can take up to 6 bytes) for string serialization.

          • C++

            • wchar_t

              Header file <cuchar>

              Provides conversion from multibyte characters to wide characters and vice versa

              size_t mbrtoc16(char16_t* pc16, const char* s, size_t n, mbstate_t* ps);

              size_t c16rtomb(char* s, char16_t c16, mbstate_t* ps);

              size_t mbrtoc32(char32_t* pc32, const char* s, size_t n, mbstate_t* ps);

              size_t c32rtomb(char* s, char32_t c32, mbstate_t* ps);

              • Unicode-encoded characters are generally stored in wchar_t type, but wchar_t is not equal to Unicode. It occupies 16 or 32 bits depending on the system standard.

              • The main problem is that on Windows it is 16-bit, but on Linux and macOS wchar_t is 32-bit (4 bytes). This has led to a serious consequence, the code we write cannot keep the same line on different platforms

                for. The emergence of char16_t and char32_t solves this problem. They clearly specify the size of the memory space they occupy, so that the code can have consistent performance on any platform.

                • std::codecvt_utf8 is a std::codecvt plane that encapsulates conversion between UTF-8 encoded strings and UCS2 or UTF-32 strings (depending on the Elem type). This codecvt can be used to read and write text and binary UTF-8 files.

                  readTextFileStream.imbue(locale(locale::classic(), new codecvt_utf8<wchar_t>));

              • Chinese name: wide characters

            • char16_t corresponds to UTF-16

              • Literal encoding: u"text"

            • char32_t corresponds to UTF-32

              • Literal encoding: U"literal"

            • C++17 adds char8_t to represent UTF-8

              • Literal encoding: u8 "text"

C++ check if a file is UTF-8 with BOM

	{
		char *pcstr = new char[4];
		ifstream inputFileStream("./someword2.txt", ios_base::binary);
		long realReadSize = 3;
		inputFileStream.getline(pcstr, 4);
		if (realReadSize == 3)
		{
			cout << hex << pcstr[0] << pcstr[1] << pcstr[2] << endl;
			if ((unsigned char)pcstr[0] == 0xEF && (unsigned char)pcstr[1] == 0xBB && (unsigned char)pcstr[2] == 0xBF)
			{
				cout << "this text file is UTF-8 with BOM" << endl;
			}
		}
		// 检查一个文本文件是否UTF-8 with BOM
	}

Guess you like

Origin blog.csdn.net/wangxudongx/article/details/125126524