Figure out string encoding, and you will go farther and farther on the road of programmers

First, the type of coding

 

  • ASCII occupies 1 byte, only supports English
  • GB2312 occupies 2 bytes, supports 6700+ Chinese characters
  • An upgraded version of GBK GB2312, supports 21000+ Chinese characters
  • Shift-JIS Japanese characters
  • ks_c_5601-1987 Korean code
  • TIS-620 Thai Code
  • Unicode 2-4 bytes
  • Unicode Transformation Format (UTF) 1-4 bytes

2. Unicode and UFT

  Since each country has its own encoding rules, it only covers its own characters and has no correspondence with other countries' characters, so Unicode (Universal Code) came into being , which covers all the characters and binary correspondences in the world.

 Unicode serves 2 purposes:

  1. Directly supports all languages ​​in the world, each country can no longer use its own old encoding, just use unicode. (Just like English is the universal language)
  2. unicode contains the mapping relationship with all national codes in the world.

  But using unicode to represent a character is a waste of space. For example, using unicode to represent "Python" requires 12 bytes to represent, which is double the original ASCII representation . In order to solve the problem of storage and network transmission , UTF was born.

  • UTF-8: Use 1, 2, 3, and 4 bytes to represent all characters; if 1 character is used first, if it cannot be satisfied, one byte will be added, up to 4 bytes. English accounts for 1 byte, European languages ​​account for 2, East Asia accounts for 3, and other and special characters account for 4.
  • UTF-16: Use 2 or 4 bytes to represent all characters; 2 bytes are preferred, otherwise 4 bytes are used.
  • UTF-32: use 4 bytes to represent all characters

So in general, UTF is an encoding scheme designed for unicode encoding to save space in storage and transmission.

3. How are the characters stored on the hard disk?

  Answer: It is converted into binary and stored on the hard disk according to a certain encoding.

  What needs to be paid attention to here is: what kind of code is used to save to the hard disk, and then read out from the hard disk, what kind of code must be used to read, otherwise garbled characters will appear.

Fourth, code conversion 

  Let's first take a look at the process of python3 executing the code:

  1. The interpreter finds the code file, loads the code string into memory according to the encoding defined by the file header, and converts it to unicode
  2. Interpret code strings according to grammar rules
  3. All variable characters are declared in unicode encoding

  We know that when you run a program encoded in utf-8 on your own Windows system, it will also be garbled, because there are only two cases in which the display on your windows will not be chaotic:

  1. Strings are displayed in GBK format
  2. String is unicode encoded

When your program is GBK encoded, it will be garbled when you use it on a foreign computer, because they don't support Chinese at all, so what should you do? You have the following methods:

  1. Let Americans have gbk code installed on their computers
  2. encode your software in utf-8

>> But the above two paths seem to be difficult to go, so what should I do? ?

>>Don't worry, Shanren has his own plan. Why don't you find a translator who understands both the American language and Chinese?

>>Who is this translator?

>> Unicode 呀!

  Yes, that's not what I said before, unicode supports all languages ​​in the world, and includes the mapping relationship with all national codes in the world . All systems and programming languages ​​support unicode by default, so it can be used as a converter (translator) to achieve Conversion between different encoding rules.

The specific methods are decode (decoding) and encode (encoding). The specific steps are as shown in the figure:

Code (running under py3):

s = "study well"
print(type(s)) # print string type
s1 = s.encode("gbk") # encode
print(s1, type(s1))
s2 = s1.decode("gbk")  # 解码
print(s2, type(s2))

 运行结果:

<class 'str'>
b'\xba\xc3\xba\xc3\xd1\xa7\xcf\xb0' <class 'bytes'>
好好学习 <class 'str'>

 

当然,Python2并不会自动的把文件编码转为unicode存在内存里, 那就只能你自己人肉转。Py3 会自动把文件编码转为unicode存入内存。

五、Python3与Python2的编码区别

  Python3:

1、文件默认编码是utf-8;

2、字符串编码是unicode;

3、py3将str和bytes做了明确区分,str就是Unicode格式的字符串,bytes就是单纯的二进制。

  Python2:

1、文件默认编码是ASCII;

2、字符串编码是ASCII(如果文件头申明了gbk、utf-8等编码形式,则字符串的编码就是gbk、utf-8等格式);

3、在py2中unicode是一种单独的类型,使用是需要单独申明,如:s = u"美丽”;

4、在py2中,str和bytes两种类型界限比较模糊,可以说str就是bytes,bytes就是str。

 

python中常用bytes表示图片、视频等在编码规范中找不到对应字符串的二进制数据。

 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325859087&siteId=291194637