A brief history of Chinese coding

coding

Character is a general term for various words and symbols, including various national words, punctuation marks, graphic symbols, numbers, etc. A character set is a collection of multiple characters. There are many types of character sets, and each character set contains a different number of characters. Common character sets include: ASCII character set, ISO 8859 character set, GB2312 character set, BIG5 character set, GB18030 character set set, Unicode character set, etc. In order for the computer to accurately process the characters of various character sets, character encoding is required so that the computer can recognize and store various characters.

1、ASCII

ASCII (American Standard Code for Information Interchange) is a single-byte encoding. In the beginning of the computer world, there is only English, and a single byte can represent 256 different characters, which can represent all English characters and many control symbols.

2. GB2312, GBK and GB18030

 

(1)GB2312

When Chinese people got computers, there were no available byte states to represent Chinese characters, and there were more than 6,000 commonly used Chinese characters that needed to be saved, so GB2312 was a Chinese extension to ASCII. Compatible with ASCII.

(2)GBK

But there are too many Chinese characters in China, and we soon found that there are many people's names that cannot be typed here, and we have to continue to find out the unused code points of GB2312. Later, it was still not enough, so I no longer required that the low byte must be the internal code after number 127. As long as the first byte is greater than 127, it is fixed to indicate that this is the beginning of a Chinese character, regardless of whether it is followed by an extended character set. content in. As a result, the expanded encoding scheme is called the "GBK" standard. GBK includes all the contents of GB2312, while adding nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols.

(3)GB18030

Later, the ethnic minorities also had to use computers, so we expanded and added thousands of new ethnic minority characters, and GBK was expanded to GB18030. Since then, the culture of the Chinese nation can be inherited in the computer age.

 

3、Unicode

Later, some people began to think that too many codes made the world too complicated, which made people’s heads hurt, so everyone sat together and patted their heads and came up with a method. International organizations developed the UNICODE character set for each character in various languages. A unified and unique number is set to meet the requirements of cross-language and cross-platform text conversion and processing.

At present, computers generally use 2 bytes (16 bits) to store a serial number (DBCS, Double Byte Character System). Therefore, characters stored in this way are also called wide-byte characters. For example, under Windows 2000, the string "Chinese 123" actually stores 5 serial numbers in the memory, totaling 10 bytes.

4、UTF-8

Unicode is an in-memory encoding representation scheme (is a specification), while UTF is a scheme (is an implementation) of how Unicode is stored and transmitted. As a commonly used character encoding, UTF-8 is variable length and uses up to 3 bytes to represent a character.

 

 

Original: Solve the Chinese encoding problem of reading Oracle database in Python

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324644701&siteId=291194637