The history of computer coding

Copyright belongs to the author.
For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
Author: Yu Yang
Link: https://www.zhihu.com/question/23374078/answer/69732605
Source: Zhihu

A long time ago, there was a group of people who decided to use 8 transistors that could be switched on and off to combine into different states to represent everything in the world. They see that 8 switch states are good, so they call this a "byte". Later, they made some machines that could process these bytes. The machine started, and the bytes could be used to combine many states, and the states began to change. They saw that this was good, so they called the machine a "computer". Initially the computer will only be used in the United States. Eight-bit bytes can be combined into a total of 256 (2 to the 8th power) different states.
They defined the 32 states whose numbers start from 0 for special purposes. Once the bytes agreed on by the terminal and the printer are passed, some agreed actions must be performed. When it encounters 0x10, the
terminal will wrap the line. When it encounters 0x07, the terminal will beep to people. For example, when it encounters 0x1b, the
printer will print the words in reverse, or the terminal will display letters in color. They see that this is good, so they call these byte states below 0x20 "control codes". They also represented all
spaces , punctuation marks, numbers, and uppercase and lowercase letters with consecutive byte states, until the number 127, so that the computer could use different bytes to store English words. Everyone feels
 good when they see this, so everyone calls this scheme the ANSI "Ascii" encoding (American Standard Code for Information Interchange). All the computers in the world at the time used the same ASCII scheme to store English text. Later, like the construction of the Tower of Babylon, computers all over the world began to be used, but many countries did not use English, and many of their letters were not in ASCII. In order to
save their words in the computer, they decided to use
127 The space after the number represents these new letters and symbols, and many shapes such as horizontal lines, vertical lines, and crosses that need to be used when drawing tables are added, and the serial number has been numbered to the last state 255. from 128
 The character set on this page to 255 is called the "extended character set". Since then, there will be no new state for greedy human beings to use. The US imperialists may not have thought that people in third world countries also hope to use computers! When Chinese people got computers, there was no byte state that could be used to represent Chinese characters, and there were more than 6,000 commonly used Chinese characters that needed to be saved. However, this cannot be difficult for the wise Chinese people. We
rudely cancel the strange symbols after the number 127, and
stipulate that the meaning of a character smaller than 127 is the same as the original one, but two characters larger than 127 are connected together. When , it means a Chinese character, the first byte (he calls it high byte) goes from 0xA1
to 0xF7, and the next byte (low byte) goes from 0xA1 to 0xFE, so we can combine more than 7000 Simplified Chinese characters. In these codes, we also coded mathematical symbols, Roman and Greek
letters, and Japanese pseudonyms. Even
the numbers, punctuation, and letters that were originally in ASCII were all re-coded into two-byte long codes. , which is often referred to as "full-width" characters, and those below 127 are called "half-width" characters.
The Chinese people saw that this was very good, so they called this Chinese character scheme "GB2312". GB2312 is a Chinese extension to ASCII. But there are too many Chinese characters in China, and we soon found that there are many people whose names cannot be typed here, especially some national leaders who are very troublesome to others. So we have to continue to
find out the unused code points of GB2312 and use them honestly.
Later, it was still not enough, so I simply no longer required that the low byte must be the internal code after number 127. As long as the first byte is greater than 127, it is fixed to indicate that this is the beginning of a Chinese character, regardless of whether it is followed by an extended word or not.
content in the set. As a result, the expanded encoding scheme is called the GBK standard. GBK includes all the contents of GB2312, and at the same time adds nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols. Later, ethnic minorities also had to use computers, so we expanded it, adding thousands of new ethnic minority characters, and GBK was expanded to GB18030. Since then, the culture of the Chinese nation can be inherited in the computer age. Chinese programmers saw that this series of Chinese character encoding standards were good, so they called them "DBCS" (Double
 Byte Charecter Set
). In the DBCS series of standards, the biggest feature is that two-byte long Chinese characters and one-byte long English characters coexist in the same encoding scheme. Therefore, in order to support Chinese
processing must pay attention to the characters in the strings. The value of each byte of , if the value is greater than 127, then it is considered that a character in a double-byte character set appears. At that time, all computer monks
 who had received blessings and were able to program had to recite the following mantra hundreds of times a day: "One Chinese character counts as two English characters! One Chinese character counts as two English characters..." Because at that time every country was like China. They came up with their own coding standards, and as a result, no one understood each other's coding, and no one supported others' coding. Even the mainland and Taiwan, which are only 150
nautical miles and use the same language, are brothers. They also adopted different DBCS
encoding schemes - the Chinese at that time had to install a "Chinese character system" if they wanted the computer to display Chinese characters, which was specially used to deal with the display and input of Chinese characters, but the ignorant feudal people in Taiwan The written fortune-telling program
must be installed with another set
of "Yitian Chinese character system" that supports BIG5 encoding before it can be used. If the wrong character system is installed, the display will be messed up! How to do this? Moreover, there are still those poor people in the forest of nations who cannot use computers for a while.
What ? It is really the proposition of the Babylonian tower of the computer! At this moment, the Archangel Gabriel appeared in time - a man named ISO
The international organization (International Organization for Standardization) decided to tackle this problem. Their approach is simple: scrap all regional coding schemes and start over with a new code that includes all cultures, all letters and symbols on earth
 ! They are going to call it "Universal Multiple-Octet Coded Character Set", or UCS for short, commonly known as "unicode".
When unicode was first developed, the memory capacity of computers had grown enormously, and space was no longer an issue. Therefore, ISO
directly stipulates that two bytes, that is, 16 bits, must be used to represent all characters uniformly. For those "half-width" characters in ASCII, the unicode package keeps its original encoding unchanged, but its length is changed from the
original one. 8 bits were extended to 16 bits, while characters from other cultures and languages ​​were all re-encoded uniformly. Since the "half-width" English symbol only needs to use the lower 8 bits, its upper 8 bits are always 0, so this atmospheric scheme
will waste twice as much space when saving English text. At this time, programmers from the old society began to find a strange phenomenon: their strlen function was unreliable, and a Chinese character was no longer equivalent to two characters, but one! Yes, starting from unicode, whether it is half-width English letters or full-width Chinese characters, they are all unified "one character"! At the same time, they are also unified "two bytes", please note the difference between the terms "character" and "byte", "byte" is an 8-bit physical storage unit, and "character" is A culturally relevant symbol. In unicode, a character is two bytes. The era when one Chinese character counts as two English characters is almost over. unicode is also not perfect, there are two problems here, one is, how can we distinguish between unicode and ascii? How does the computer know that three bytes represent a symbol, rather than
three symbols each? The second problem is that we already know that English letters are only represented by one byte. If unicode uniformly stipulates that each symbol is represented by three or four bytes, then each symbol is represented by three or four bytes.
There must be two or three bytes of 0 before each English letter, which is a huge waste of storage space, and the size of the text file will be two or three times larger, which is unacceptable. unicode could not be promoted for a long time until the emergence of the Internet. In order to solve the problem of how unicode is transmitted on the network, many UTF (UCS Transfer Format) standards for transmission appeared. As the name suggests, UTF-8 is 8 each time One bit transmits data, and UTF-16 is 16 bits at a time. UTF-8 is the most widely used implementation of unicode on the Internet. This is an encoding designed for transmission and makes the encoding borderless, so that the characters of all cultures in the world can be displayed. One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length changes according to different symbols. When the character is in
the range of ASCII code, it is represented by one byte, and the encoding of one byte of ASCII characters is reserved as Part of it, note that unicode occupies 2 bytes for a Chinese character, while UTF-8 occupies 3 bytes for a
Chinese character). From unicode to uft-8 is not a direct correspondence, but requires some algorithms and rules to convert. Unicode Symbol Range | UTF-8 Encoding (Hexadecimal) | (Binary)
————————————————————————–
0000 0000-0000 007F | 0xxxxxxx
0000 0080 -0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326858440&siteId=291194637