What is coding in computer programming?

Encoding is the process of converting information from one form or format to another, also known as code in a computer programming language, or encoding for short. Use a predetermined method to encode characters, numbers or other objects into numbers, or convert information and data into prescribed electrical pulse signals. Coding is widely used in electronic computers, televisions, remote controls and communications. Encoding is the process of converting information from one form or format to another. Decoding is the reverse process of encoding. ——Baidu
Encyclopedia

The bottom layer of the computer can only store 0 and 1. If it is a number encountered in daily life, such as 127, this can be converted into decimal and binary so that the computer can store 01111111. However, if the computer stores characters similar to Chinese characters, English characters, How are symbols and other content stored?
Please add image description

The current text encoding standards mainly include ASCII, GB2312, GBK, Unicode, etc. ASCII encoding is the simplest Western encoding scheme. GB2312, GBK, and GB18030 are national standards for Chinese character encoding schemes. ISO/IEC 10646 and Unicode are both international standards for global character encoding.

ASCII

ASCII (American Standard Code for Information Interchange): The American Standard Code for Information Interchange is a computer coding system based on the Latin alphabet, mainly used to display modern English and other Western European languages. It is the most common information exchange standard and is equivalent to the international standard ISO/IEC 646. ASCII was first published as a standard type in 1967, and was last updated in 1986. So far, a total of 128 characters have been defined.

Please add image description

ASCII code is sufficient for the United States, but it is not enough for other countries. Therefore, various computer manufacturers in various countries have invented various encoding methods to represent the characters of their own countries. In order to maintain consistency with For ASCII code compatibility, the highest bit is generally set to 1. That is to say, when the highest bit is 0, it represents the ASCII code, and when it is 1, it represents the characters of each country. Among these extended encodings, ISO 8859-1 and Windows-1252 are popular in Western European countries, and GB2312, GBK and Big5 are popular in China.

ISO-8859-1

ISO 8859-1 is also called Latin-1. It also uses one byte to represent one character, because Western European characters are also spliced ​​together, but they are not 26 English letters. 0 to 127 are the same as ASCII, 128 to 255 Different meanings are provided. Among 128 to 255, 128 to 159 represent some control characters, and 160 to 255 represent some Western European characters.

GB2312

One byte is enough for American and Western European characters, but it is obviously not enough for Chinese. The first standard for Chinese is GB2312. The GB2312 standard mainly targets common simplified Chinese characters, including about 7,000 Chinese characters, excluding some rare words, and excluding traditional Chinese characters. GB2312 uses two bytes to represent Chinese characters. In these two bytes, the highest bit is 1. If it is 0, it is considered to be an ASCII character. Among these two bytes, the first byte range is 1010 0001 (decimal 161)-1111 0111 (decimal 247), and the second byte range is 1010 0001 (decimal 161)-1111 1110 (decimal 254) .

GBK

GBK is based on GB2312 and is backward compatible with GB2312. In other words, the binary representation of characters encoded in GB2312 is exactly the same in GBK encoding. GBK has added more than 14,000 Chinese characters, totaling about 21,000 Chinese characters, including traditional Chinese characters. GBK also uses a fixed two-byte representation, where the first byte range is 1000 0001 (decimal 129) - 1111 1110 ( decimal 254), the second byte range is 0100 0000 (decimal 64) - 0111 1110 (decimal 126) and 1000 0000 (decimal 128) - 1111 1110 (decimal 254).

GB18030

GB18030 is backward compatible with GBK, adding more than 55,000 characters, totaling more than 76,000 characters, including many ethnic minority characters and unified Chinese, Japanese and Korean characters. It is no longer possible to represent all the characters in GB18030 with two bytes. GB18030 uses variable length encoding. Some characters are two bytes and some are four bytes. In two-byte encoding, the byte representation range is the same as GBK. . In the four-byte encoding, the first byte has a value from 1000 0001 (decimal 129) to 1111 1110 (decimal 254), and the second byte has a value from 0011 0000 (decimal 48) to 0011 1001 (decimal 57 ), the third byte has a value from 1000 0001 (decimal 129) to 1111 1110 (decimal 254), and the fourth byte has a value from 0011 0000 (decimal 48) to 0011 1001 (decimal 57).

Big5

Big5 is for Traditional Chinese and is widely used in Taiwan, Hong Kong and other places. Big5 includes more than 13,000 traditional Chinese characters. Similar to GB2312, one character is also represented by two bytes. Of these two bytes, the first byte range is 1000 0001 (decimal 129) to 1111 1110 (decimal 254), the second byte range is 0100 0000 (decimal 64) - 0111 1110 (decimal 126) and 1010 0001 (161 decimal) - 1111 1110 (254 decimal). Big5 is not compatible with GB18030, GBK, and GB2312.

Unicode

Unicode, also called Universal Code and Unicode, is an industry standard in the field of computer science, including character sets, encoding schemes, etc. Unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.

The International Organization for Standardization (ISO) uniformly numbers letters, symbols, and words used in all languages ​​around the world. Each character is assigned a unique number corresponding to it (the ASCII code number remains unchanged). The character number ranges from 0x000000 to 0x10FFFF. The set is called Universal Multiple-Octet coded Character Set, UCS for short, also generally called Unicode. The Unicode character set only numbers all characters and does not specify the encoding rules for these numbers. Therefore, various Unicode encoding rules Unicode Transformation Format appeared later. Typical Unicode encoding rules such as UTF-8, UTF-16 , UTF-32, etc.

UTF-32

Unicode Transformation Format 32, encodes the Unicode character set in 32 bits (4 bytes). When encoding, each character in the Unicode character set is represented by 4 bytes, and the Unicode number corresponding to the character is directly converted into a binary number for storage. And because UTF-32 uses 4 bytes to encode each character, UTF-32 is not compatible with ASCII encoding. Programs written using the ASCII encoding standard will display garbled characters when opened through UTF-32 encoding.

UTF-16

Unicode Transformation Format 16, encodes the Unicode character set in 16 bits (2 bytes) or 32 bits (4 bytes). Use 2-byte encoding for characters with Unicode character number 0 65535, and directly convert the number of each character into a 2-byte binary number 0x0000 0xFFFF. The numbers in the Unicode character set in the range 0xD800~0xDBFF do not represent any characters. UTF-16 uses this number to map with the character numbers greater than 0xFFFF in the Unicode character set to obtain an extended 4-byte encoding. UTF-16 is also not compatible with ASCII encoding.

UTF-8

UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text. (When encoding in UTF-8, Chinese characters generally occupy three bytes)

Base64

Base64 is one of the most common encoding methods for transmitting 8-bit bytecode on the Internet. Base64 is a method of representing binary data based on 64 printable characters. Base64 encoding is the process of converting from binary to characters and can be used to convey longer identification information in an HTTP environment. Base64 encoding is unreadable and needs to be decoded before it can be read.

Base64 encoding table

index
Corresponding characters
index
Corresponding characters
index
Corresponding characters
index
Corresponding characters
0
A
17
R
34
i
51
z
1
B
18
S
35
j
52
0
2
C
19
T
36
k
53
1
3
D
20
U
37
l
54
2
4
E
21
V
38
m
55
3
5
F
22
W
39
n
56
4
6
G
23
X
40
o
57
5
7
H
24
Y
41
p
58
6
8
I
25
Z
42
q
59
7
9
J
26
a
43
r
60
8
10
K
27
b
44
s
61
9
11
L
28
c
45
t
62
+
12
M
29
d
46
u
63
/
13
N
30
e
47
v


14
O
31
f
48
w


15
P
32
g
49
x


16
Q
33
h
50
y


URL encoding

URL encoding is a format used by browsers to package form inputs. The browser gets all the names and values ​​from the form, and sends them to the server as part of the URL or separately, using name/value parameter encoding (removing untransmittable characters, sorting the data, etc.).

URL encoding rules

URL encoding follows the following rules: Each name/value pair is separated by an & character; each name/value pair from the form is separated by an = character. If the user does not enter a value for the name, the name will still appear, but with no value. Any special characters (that is, those that are not simple seven-bit ASCII, such as Chinese characters) will be encoded in hexadecimal with the percent sign %, including of course special characters like =, &;, and %. In fact, the URL encoding is the hexadecimal ASCII code of a character. However, there is a slight change. You need to add "%" in front. For example, "\", its ASCII code is 92, and the hexadecimal value of 92 is 5c, so the URL encoding of "\" is %5c.

URL encoding table

backspace %08
I %49
v %76
o %D3
tab %09
J %4A
w %77
Ô %D4
linefeed %0A
K %4B
x %78
Õ %D5
creturn %0D
L %4C
y %79
Ö %D6
space %20
M %4D
z %7A
Ø %D8
! %21
N %4E
{%7B
ù %D9
" %22
O %4F
| %7C
ú %DA
# %23
P %50
}%7D
Û %DB
$ %24
Q %51
~ %7E
ü %DC
% %25
R %52
¢%A2
Y %DD
& %26
S %53
£%A3
T %DE
' %27
T %54
¥%A5
ß %DF
( %28
U %55
| %A6
to %E0
) %29
V %56
§ %A7
to %E1
*%2A
W %57
« %AB
a %E2
+%2B
X %58
¬ %AC
ã %E3
,%2C
Y %59
ˉ %AD
ä %E4
- %2D
Z %5A
o %B0
å %E5
. %2E
[ %5B
± %B1
æ %E6
/ %2F
\ %5C
a %B2
ç %E7
0 %30
] %5D
, %B4
è %E8
1 %31
^ %5E
μ %B5
é %E9
2 %32
_ %5F
» %BB
ê %EA
3 %33
` %60
¼ %BC
ë %EB
4 %34
a %61
½ %BD
ì %EC
5 %35
b %62
¿ %BF
í %ED
6 %36
c %63
à %C0
î %EE
7 %37
d %64
á %C1
ï %EF
8 %38
e %65
 %C2
e %F0
9 %39
f %66
à %C3
ñ %F1
: %3A
g %67
Ä %C4
ò %F2
; %3B
h %68
Å %C5
ó %F3
< %3C
i %69
&AElig; %C6
&ocirc; %F4
= %3D
j %6A
&Ccedil; %C7
&otilde; %F5
> %3E
k %6B
è %C8
&ouml; %F6
%3F
l%6C
is %C9
÷%F7
@ %40
m %6D
ê %CA
&oslash; %F8
A %41
n%6E
Ë %CB
%F9
B %42
o %6F
ì  %CC
ú %FA
C %43
p %70
in %CD
&ucirc; %FB
D %44
q %71
&Icirc; %CE
ü %FC
E %45
r %72
&Iuml; %CF
y %FD
F %46
s %73
D %D0
t %FE
G %47
t %74
Ñ %D1
&yuml; %FF
H %48
u %75
o %D2

Guess you like

Origin blog.csdn.net/wujakf/article/details/129240227