Those encodings exist for programming languages/filesystems/network protocols

The previous article introduced URL encoding, UTF8 encoding, base64, gzip and other encoding and decoding methods. Here , this section sorts out some miscellaneous items of characters and encoding, I believe you will be interested.

Characters and Encodings in Python

Why emphasize the concept of separation of character number and character encoding? Because in a programming language, one of the most important aspects is to process characters. An excellent programming language should also clarify these two concepts, and Python happens to be such a language. Although the distinction is not particularly obvious in Python2, in Python3 it is clear to use the str class to represent strings; bytes represent byte strings, which can be used to store encoded values ​​of strings, as explained below:

1. The str class is used in Python3 to represent and store strings. The content stored in a specific string object is the number of each character, which is assigned by the character set. As for the number used for storage in Python, it is often uncertain. For example, the number of commonly used Chinese characters usually uses two bytes to represent the number. For some special characters (such as those in music), more bytes are often required. This is the internal implementation mechanism of Python, and I will not elaborate too much. Interested students can study and try by themselves after learning the method of this article.
2. The bytes class is used in Python 3 to represent and store string-encoded values. Since characters correspond to more than one encoding method, including UTF-8, UTF-16, UTF-32, GB2312, GBK, etc., the stored value is usually determined by the encoding method. Under different encoding methods, the encoding of the string is obtained The values ​​are different from each other; secondly, the values ​​stored in the bytes object and the str object are also different after encoding.

Still take the strings of my CSDN blog name - village boy as an example, as we can see from the previous:

1. First, the Unicode numbers of these characters are 6751, 4E2D, 5C11, and 5E74 respectively.
2. The UTF-8 encoded values ​​are E6 9D 91 E4 B8 AD E5 B0 91 E5 B9 B4

Write the following program and view the memory content corresponding to the object:

def printMem(data):
	from ctypes import string_at 
	from sys import getsizeof 
	from binascii import hexlify
	
	print(hexlify(string_at(id(data), getsizeof(data))))
if	name	== "	main	":

	nameStr = '村中少年' 
	nameBytes = nameStr.encode() 
	print(type(nameStr)) 
	print(type(nameBytes)) 
	printMem(nameStr) 
	printMem(nameBytes)

The result of the operation is as follows:

<class 'str'>

<class 'bytes'>

b'0500000000000000c0f562b5417f000004000000000000001a53594d- c9e6e066a80a64b5417f00000000000000000000000000000000000000000000000000 000000000000000000`51672d4e115c745e`0000'

b'0300000000000000e0fe60b5417f00000c00000000000000ffffffffffffffff`e69d91e4b8ade5b091e5b9b4`00'

The marked parts respectively represent the Unicode number in the string and the UTF-8 encoded value. You can see that the value of the Unicode number is indeed stored in the nameStr string, and the encoded value is stored in nameBytes. Verify the beginning said. This is Python's support for strings and numbers at the language level. Of course, the non-marked part is the value of other parts of the object, which is not covered in this article.

Figure 1 below is a series of operations of an external file from being formed to being read by Python 3:
insert image description here

Figure 1
shows that the formation of files, including the formation of Python source code files, is consistent with the separation of display and storage described above; at the same time, when Python 3 reads files, it is processed in the form of strings by default, that is, converted to Unicode number, unless explicitly specified as a binary read.

Why does Python use numbers to represent the string itself? In fact, this is the same as the previous separation of display and storage. No matter where the string is, whether it is written on paper or displayed on the computer, its meaning remains unchanged, so it needs to be represented by a certain value of number. However, when storing, for the sake of efficiency and other aspects, there are different encoding methods, and different encoding methods mean different values, so it is not appropriate to use encoded values ​​(not unique) to represent strings. The early use of ASCII, GB2312, which does not distinguish between encoding and numbering, is not a big problem, but it does not work in the Unicode character set. Through this article of mine, I believe you should be able to distinguish between strings and their encoding methods.

file system encoding

Since the encoding of the file system is basically the same as that of the operating system, we often ignore it. In fact, if you move files between two different file systems, a very easy thing to happen is the garbled characters of the file name. I believe many students have encountered this. Since the file name of the file is decoded and displayed by the file system, if the encoding of the two file systems is different, garbled characters will be obvious. The reason behind it is actually the same as the garbled file content, that is, the rules used for encoding writing and decoding display are inconsistent. Often one is written in GBK and the other is read in UTF-8.

network protocol encoding

In the protocol header of the HTTP protocol, there is such a field, Content-Type: which can be used to tell the protocol receiver to use a specified method to decode the received data, as follows for an HTTP message header:

GET / HTTP/1.1
Host: www.guoguo-app.com Connection: close

HTTP/1.1 200 OK
Content-Type: text/html;charset=utf-8 
Transfer-Encoding: chunked 
Connection: close
Vary: Accept-Encoding 
Vary: Accept-Encoding 
Content-Language: zh-CN 
Server: Tengine/Aserver

You can see that the server tells the client to use UTF-8 to decode the received data content. After the browser or crawler parses the HTTP header, it can correctly display the data content by decoding according to the HTTP specified method, such as displaying a text file online. Of course, I don't think it's accurate to use charset here, because charset often refers to a character set, and it is more appropriate to use charsetEncoding, although we can all understand its meaning.

web coding

In HTML web pages, the following tags are usually used to indicate the encoding method adopted by the HTML:

<meta charset="UTF-8">

After the browser reads the tag of the file, it will display the file in utf8 encoding. If the HTML file is stored in GBK, the browser will display garbled characters because the decoding method is wrong. Therefore, the actual encoding of the HTML file must be the same as that in the web page, otherwise errors may occur. At this time, you may ask, what if the Content-Type in the protocol is inconsistent with the meta charset in the web page. At this time, the browser will definitely make a trade-off. The usual practice is that the priority of the protocol response header is higher than that in the web page. But in fact, how the browser writers implement it is up to the beholder.

The above summarizes some points of encoding and characters in the computer system, and I hope it will be helpful to you.

This article is an original article by the youth in the village of CSDN, and may not be reproduced without permission. The blogger links here .

Guess you like

Origin blog.csdn.net/javajiawei/article/details/131198067