Wei Dongshan Embedded Introductory Notes-Application Development Fundamentals (5)

Chapter 6 Text Display

6.1 Character encoding

6.1.1 Encoding and font

1. Encoding: refers to the use of numbers to represent characters, such as 0x41 to represent character A.
2. Font: different shapes can be used when the same character is displayed on the screen.
3. The core of the character is stored in the TXT file: its Code value. When displayed on Notepad, what shape state these characters correspond to is determined by the font.

6.1.2 Coding standards
For the same kind of character, different encoding standards have different encoding values.

1. ASCⅡ code (American Standard Code for Information Interchange)
(1) In ASCII code, one byte is used to represent a character, and only 7 of them are used, which can represent 2ˇ7 or 128 characters in total.
(2) The highest bit bit7 is always 0
(3) The disadvantage is that there are too few characters

2. ANSI code
(1) When using Notepad to save the file, you can select "ANSI" encoding, but there is no "ASCII"
(2) ANSI is an extension of ASCII, and contains ASCII down: ASCⅡ
characters: still expressed by one byte, And bit7 is 0;
non-ASCⅡ characters: expressed by two bytes, bit7 is 1. For example, Chinese characters are non-ASCⅡ characters, occupying two bytes.
(3) When selecting ANSI encoding, you need to select the corresponding character set to display the desired characters. For example, in mainland China, the default code of ANSI is GB2312; in Hong Kong, Macao and Taiwan, the default code is BIG5.
These incompatibility problems only occur in China. For different countries, their default ANSI codes are different, so the same TXT file in different countries may appear garbled.

3. UNICODE encoding
(1) UNICODE encoding solves the problem of garbled codes caused by different encodings: for any character on the earth, it is given a unique value.
(2) UNICODE is still backward compatible with ASCII, but there will be corresponding values ​​for other characters. For example, for "中" and "笢", their values ​​are: 0x4e2d, 0x7b22 (one is simplified and the other is traditional)
(3) The range of values ​​in UNICODE is 0x0000 to 0x10FFFF. There are 1,114,111 or more than 1 million values, which can represent more than 1 million characters, which is enough for the people of the earth.
(4) UNICODE encoding can be divided into four types according to different encoding implementations: "UTF-16 LE", "UTF-16 BE", "UTF-8", and "UTF-8 with BOM". Choose the interface for saving txt text files.Insert picture description here

6.1.3 Coding implementation

1. The concept of coding realization The
so-called coding realization is how to represent a value.
For example, the UNICODE value of "中" is 0x4e2d, how to represent 0x4e2d in the TXT file?
The key to the problem is: how to hyphenate. In the TXT file, is the 2-byte data "0x2d 0x4e" treated as a whole or divided into two parts?
Therefore, a certain skill is needed to represent the value, which corresponds to different coding implementations.
For example, now we know:
(1) In ASCII encoding, one byte is used to represent a character, only 7 bits are used, and the highest bit is always 0;
(2) In ANSI encoding, one byte is still used for ASCII characters. Indicates (BIT7 is 0). For non-ASCII characters, 2 bytes are generally used to represent. The value of non-ASCII characters is 1 for BIT7.

Second, the realization of UNICODE coding

1. Using 3 bytes to represent a UNICODE
Disadvantage: too waste of memory, the original result of one or two bytes of information is represented by three bytes.

2. UTF-16 LE and UTF-16 BE
(1) Since three bytes are wasted, use two bytes. Two bytes can represent 2^16=65536 characters, and characters commonly used in the world can be represented. Up.
(2) But there will be another problem: endianness problem
Little endianness: the byte with the lower weight in the value is placed first.
Big-endian: The lower-weight bytes in the
value are placed at the back. For example, the code value 0x41 has big-endian 0x41 0x00 and little-endian 0x00 0x41.

(3) UCS-2 Little endian / UTF-16 LE
Little endian (LE) means little endian, "0xff 0xfe" at the beginning of the file means "UTF-16 LE".
Bitu "A" is represented by two bytes of "0x41 0x00"; "Medium" is represented by two bytes of "0x2d 0x4e".
Insert picture description here

(4) UCS-2 Big endian / UTF-16 BE
Big endian (BE) means big endian, "0xfe 0xff" at the beginning of the file means "UTF-16 BE".
For example, "A" is represented by two bytes of "0x00 0x41"; "Medium" is represented by two bytes of "0x4e 0x2d".
Insert picture description here
(5) For the above two methods, each UNICODE is represented by 2 bytes, which has 3 disadvantages:
①The number of characters represented is limited
②There is a waste of space for ASCII characters
③If a byte in the file is missing, this It will make all the following characters unable to display because of misplacement.

Therefore, UTF8 appeared to solve the above problems

3. UTF8
(1) UTF8 is a variable-length encoding method. There are 2 kinds of UTF8 format files: with header and without header.
Insert picture description here
(2) For the ASCII characters, they are directly represented by their ASCII codes in the UTF8 file, such as 0x61 for character a and 0x62 for character b. The difference between ASCII characters and non-ASCII characters is that the bit 7 of the former is 0, which is completely consistent with the previous content.

(3) For non-ASCII characters, use variable-length encoding: the high bits of each byte have their own length information. Please see the picture below: In the
Insert picture description here
picture above, the binary value of 0xe4 is "11100100", and the high bit has 3 1s, which means that 3 bytes from the current byte participate in expressing UNICODE;
the binary value of 0xb8 is "10111000", and the high bit has 1 1 ,
Which means that 1 byte from the current byte is involved in expressing UNICODE; the binary value of 0xad is "10101101", and there is a 1 in the upper bit, which means that 1 byte from the current byte is involved in expressing UNICODE;
except for the upper "1110", After "10" and "10", the remaining binary numbers are combined to get "01001110001101", and its hexadecimal representation is 0x4e2d, which is the UNICODE value of "medium". This is how non-ASCII characters use encoding with length information to solve the garbled problem caused by byte loss. (See below for details)

(4) Because each byte has its own length information, even if some data is lost in the TXT file, it will only affect the display of the current character, and the following characters will not be affected.

For example, suppose that the byte 0xb8 is missing in the above example: originally the application program reads 0xe4 and knows that there are three bytes of data including 0xb4 to represent this non-ASCII character, so when the next two bytes are read At the beginning of 10, it will participate in coding. After the missing byte, the next byte read to 0xad is not 10 at the beginning, which means that it does not participate in the encoding of this character. Finally, the encoding value of the character is only read to 0xad, so garbled codes appear, and the following bytes do not participate in the encoding of this character; and the encoding is controlled by the length information in the bytes at this time, and the loss of bytes will not cause byte misalignment. , So all subsequent bytes will not be affected.

For example, if the byte of 0xe4 is lost: the application program reads that the beginning of the two bytes of 0xb8 and 0xad are both 10, which cannot form a code (a single byte whose bit7 bit is 1 can neither represent ASCII nor non-ASCII characters. , Non-ASCII characters are at least two bytes), so garbled characters will be formed. When the byte after reading 0xad is not 10, it will resume normal encoding, and since the first two bytes start with 10, the following bytes will not participate in the previous encoding.

Question:
If the lost data is not in bytes, but how many binary bits are lost, will it cause the data to be misplaced and then all garbled? Or does it mean that the loss of data must be in bytes?

Guess you like

Origin blog.csdn.net/San_a_fish_of_dream/article/details/113824702