gbk, utf-8, what are these? The most complete tutorial of "Python coding" is here

Coding problems have always been a headache for Python learners. What are the gbk and utf-8 that are often seen ? Taking advantage of the holiday today, I have nothing to do, and I will talk about the origin and development of coding. Like to remember to collect, follow, like.

Origin of the problem

In the process of learning Python, we may often encounter the following coding problems. Sometimes we need to choose gbk , sometimes we need to choose utf-8 . Do you think this is the end? We also encountered various strange encodings such as gb2312 and gb18030 . So, what exactly is the origin of coding? Today, we will use **"storytelling"** to bring you to know it.

picture

1) The story of the beacon soldier

Before we officially tell the story, let's take a look at the picture below. Let's call it the story of "The Soldiers of the Beacon Fire" for the time being. So how is this story related to the coding problem? Then listen to my story.

picture

This string of numbers, looking from right to left

Lighting the first root means there is one soldier, and lighting the second root means there are two soldiers. That is to say, igniting 2 beacons can represent up to 3 soldiers. Sorting out the logic , if one beacon is not lit, it means that there are zero soldiers; if only the first beacon is lit, it means that there is one soldier; when the second beacon is lit, the first beacon is extinguished, which means there are two soldiers. Soldiers; lighting 2 beacons at the same time means that there are three soldiers.

To sum up: 2 beacons can represent: 0, 1, 2, 3 soldiers, i.e. 1+2. 3 beacons can represent: 0, 1, 2, 3, 4, 5, 6, 7 Soldiers, i.e. 1+2+4. And so on...

Through the above description, you may have discovered that this is not the same as the binary number in the computer ? There are only 0 and 1, 0 means extinguishing the beacon, and 1 means lighting the beacon. Corresponding to the computer, 0 means off, 1 means on. Next, classmate Huang will take you to talk about **"0 and 1 in the computer"**.

The bottom layer of the computer is the circuit. It only knows 0 and 1, which is the so-called "circuit" in your junior high school physics. 0 means off, 1 means on, and nothing else. But think about it, if a circuit has only 0s and 1s, how can it show this colorful world? Therefore, smart foreigners encode the words and symbols used in daily life into 0101010... type, so that the computer can represent the words. So, first remember a key word: "what to use to encode, what to use to decode" .

Because, the computer was invented by the Americans. Therefore, the earliest computer code: ASCII code (also for Americans), there are only 26 common characters such as English letters, numbers, punctuation and so on that Americans use every day, so the earliest computer also only has English, numbers, punctuation, etc. Special characters. Don't wonder why there are only English letters and symbols that are commonly used by Americans, because the old Americans never thought that computers would quickly spread all over the world, and no one could predict the future in advance.

Then let's talk about the earliest computer code: ASCII code. The ASCII code occupies 8 bits, that is, a byte. The first bit is an extension bit, which is all 0. For future expansion, the remaining positions are either 0 or 1. This is because the computer is not sensitive to the number 7, and is familiar with numbers such as 2, 4, 8, 16, 32, etc., so it expands one bit and becomes 8 bits. Then according to the knowledge of permutation and combination , ASCII code can represent 2^7=128 code points, that is, it can represent 128 different symbols. In fact, these symbols are enough for Americans to use. This is the earliest computer code (ASCII code) at that time, and this is what Lao Mei planned at that time.

picture

2) The development of computers in China

With the development of computers all over the world, we found that the original code points are no longer enough to store the words and symbols of many countries. In order to clarify this matter, we take the development of computers in China as an example to illustrate.

Through the previous description, we already know that there is no Chinese in the earliest character encoding ASCII code, but with the popularization of computers in China, we need to make the computer able to represent Chinese, what should we do? Based on this: The Founder team of Peking University in China invented the gbk code . However, these characters must not be directly put into the ASCII code, because ASCII only has 8 bits, and there are at most 2 8=256** spaces, which can store more than 90,000 Chinese characters, which is obviously impossible (even the 3,000 Chinese characters commonly used in Chinese are also cannot be stored). Therefore, in gbk, Chinese characters are represented by 2 bytes, which becomes twice the length of the byte in ASCII code, that is, gbk occupies 16 bits, a total of **2 16=65536 vacancies, which are much more for storing commonly used Chinese characters. However, it is still impossible to store all Chinese characters in it. Who makes Chinese culture have a long history and profoundness?

When it comes to gbk, we have to talk about its siblings (as shown in the figure). In fact, they are a series, which were gradually derived due to the needs of the time. These three different encodings are upward compatible . It can be seen that GB18030 represents the largest number of characters, which is why sometimes when using Python to read Excel tables, neither GB2312 nor GBK can be used, but GB18030 must be used.

picture

3) How the computer is compatible with multiple languages

Computers are not only developed in China, in fact, computers are developing rapidly all over the world. If China has its own unique GBK code, then South Korea and Japan must also have their own unique code. But today is the era of "economic globalization". It is impossible for any country to develop independently. If you have an international cooperation business, the code we wrote in China, if you want to use it abroad, will appear. Garbled, how embarrassing is this? So how was this problem finally solved?

picture

To this end, Americans invented something called "Unicode" , also known as "Universal Code". In fact, you can see the meaning of the name, the universal code, the universal code, must be to contain the character encoding of the whole world! So what is the universal code? Then listen to what classmate Huang tells you.

Computer extensions are generally multiplied, either 1 byte, 2 bytes, 4 bytes... . The original Unicode, also known as ucs-2 , uses 1 byte for ASCII storage, so ucs-2 uses 2 bytes for storage, with a maximum of 2 16=65536** spaces, which is still not compatible with characters from all over the world . So **ucs-4** is produced, and the storage uses 4 bytes, a total of **2 32=400 million vacancies. However, according to statistics, the world's text, numbers, and symbols add up to 230,000. For more than 400 million spaces, ucs-4 is simply a waste of space . For file transfer, this is a waste of traffic.

picture

Considering saving space, on the basis of Unicode, we invented utf-8, a variable-length Unicode character encoding . Utf-8, for English, uses ASCII code occupying method, occupying 8 bits, that is, 1 byte; when storing European characters, it occupies 16 bits, that is, 2 bytes; when storing Chinese, it occupies 24 bits, that is 3 bytes. Although it is a waste of space for Chinese**, in order to unify the characters all over the world and save space, this method is already very good (because after all, it is impossible to cover everything, and who has the most Chinese characters? , will suffer a little).

Coding Knowledge Summary

1) The history of character encoding

picture

2) Take the lowercase letter a as an example to illustrate the character encoding

picture

3) Take everyone to write code and know about character encoding

① About the difference between Python2 and Python3

In Python2 , the default character encoding is ASCII code, so when writing Chinese in Python2, the first line will usually add - - coding:utf-8 - -, after reading this article, I think you already have a clear understanding. But Python2 has stopped updating now, we can understand it, don't pay too much attention.

For Python 3.x , the default character encoding is utf-8, which is an extension of Unicode. That is, all characters in Python3.x are Unicode by default. To put it bluntly, we can write anything in Python3.x, and the encoding is Unicode encoding.

Compare Python2 and Python3:

# 在Python2中如果要表示Unicode编码,应该这样写。
my_name = u"黄XX"
# 在Python3中如果要表示Unicode编码,应该这样写。
my_name = "黄XX"

Having said that, we can draw a conclusion: the conversion between different encodings must go through a Unicode .

② encode and decode

>>> name1 = "我是你们的teacher老师"
>>> name2 = "你们是我的student学生"
>>> # 将name1编码为“utf-8”
>>> name1_encode = name1.encode("utf-8")
>>> name1_encode
b'\xe6\x88\x91\xe6\x98\xaf\xe4\xbd\xa0\xe4\xbb\xac\xe7\x9a\x84teacher\xe8\x80\x81\xe5\xb8\x88'
>>> # 将name1_encode解码还原
>>> name1_encode.decode("utf-8")
'我是你们的teacher老师'
---------------------------------------------------------
>>> # 将name2编码为“gbk”
>>> name2_encode = name2.encode("gbk")
>>> name2_encode
b'\xc4\xe3\xc3\xc7\xca\xc7\xce\xd2\xb5\xc4student\xd1\xa7\xc9\xfa'
>>> # 将name2_encode解码还原
>>> name2_encode.decode("gbk")
'你们是我的student学生'
-------------------------------------------------
>>> # name1_encode此时是“utf-8”编码,如果用“gbk”解码,会出现什么?
>>> name1_encode.decode("gbk")
'鎴戞槸浣犱滑鐨則eacher鑰佸笀'
# 上面就是我们常说的乱码、乱码、乱码!

Code analysis: As can be seen from the code, if it is utf-8 encoding, each Chinese character is stored in 3 bytes. If it is gbk encoding, each Chinese character is stored in 2 bytes.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/123932593