04- coding and strings

  SUMMARY: This paper analyzes an encoding format and a character string Python


 

First, the character encoding

Separatist warlords

  Early computer uses ASCII encoding, only 127 characters are encoded in the computer. ASCII code with one byte (8bit) can be represented.

  In order to deal with their own language, countries have developed their own codes. As China's GB2312 encoding, Shift_JIS encoding Japan, South Korea Euc-kr coding. Since the encoding format is not unified, there were bound to clash. Specific phenomenon is garbled.

Unicode: a four returnees

  So Unicode came into being. Unicode language will unify all set encodings, so as to solve the garbage problem.

  Unicode is most commonly expressed as a two-byte characters, only a very remote uses a 4-byte characters.

  ASCII encoding into Unicode encoding, just in front of 0s can be.

  However Unicode is not perfect. If we only used the English, using the ASCII code, a byte can get. Now use Unicode, it requires two bytes and bytes front or all-0, very wasteful. To solve this problem, the Unicode basis, has developed a "variable-length encoding" UTF-8:

  1. UTF-8 to put a 6-byte Unicode characters depending on the size of the digital encoding;

  2. English letters are encoded as a byte (ASCII encoding it can be seen as part of UTF-8 encoding, only supports ASCII encoding of the old software can continue to work in UTF-8 encoding);

  3. The characters are encoded as three bytes;

  4. Very rare character is encoded into bytes 4-6.

The computer system of universal character encoding works

  1. In computer memory, uniform use of Unicode encoding;

  2. When the hard disk needs to be stored or to be transmitted, it is converted to UTF-8 encoding;

  3. editing editor, UTF-8 characters read from the document is converted to Unicode characters stored in memory. After editing, and then converted to Unicode UTF-8, saved to a file.

 

   4. When browsing the web, the server dynamically generated content into Unicode UTF-8 and then transmitted to the browser.

 

 

Two, Python string

  In Python 3, a string in Unicode encoding, so Python string 3 supports multiple languages.

And character encoding conversion

  We mentioned before, the characters inside the computer is stored as an integer encoding for each character set, and encoding integer-one mapping relationship.

  Python, provided the ord () and CHR () function to character conversion and encoding:

>>> ord('')
21220
>>> chr(21220)
''

  If you know the integer encoded characters, you can write to hexadecimal string in the format '\ u hexadecimal code'. For example: "ground" integer code is 21220, hexadecimal 0x52E4. "Fen" integer code is 22859, hexadecimal 0x594B.

>>> ' \ u52E4 \ u594B ' 
' hard '

Transmission and storage of character

  Python string (str type) in Unicode in memory, a character corresponding to a number of bytes. If to be transmitted over a network, or stored to disk, to make changes in the string of bytes bytes.

  Python type of data bytes, b with single quotes or double quotes indicate a prefix, such as:  b = b ' the ABC '  . To note: "ABC" is the string, b "ABC" although the former and the same content, but only one byte per character.

  In str Unicode represented by the encode () method may be coded as specified in bytes, which specifies the parameters passed with the encoding format. For example:  'ABC'.encode ( ' ASCII ' ) . Note, str containing Chinese can not be encoded in ASCII, because out of range. In bytes, the bytes can not be displayed as ASCII characters, use \ x ## representation.

  If we read the byte stream from the network or disk, read data is bytes. A decode () function can be changed bytes str, for example:  B ' the ABC ' .decode ( ' ASCII ' ) . If the bytes contains can not be decoded bytes, it will error. We can pass errors = 'ignore' to ignore the error, such as:  b ' \ XE4 \ XB8 \ XAD \ xFF ' .decode ( ' UTF-8 ' , = errors ' the ignore ' ) 

String length

  You can use len () function to get str contains the number of characters. If the incoming str but not the type of bytes, the number of bytes.

# Incoming STR 
>>> len ( ' the ABC ' )
 . 3 
>>> len ( ' Chinese ' )
 2 # incoming bytes 
>>> len (B ' the ABC ' )
 . 3 
>>> len ( ' Chinese ' .encode ( ' UTF-. 8 ' ))
 . 6

  From the above code Chinese characters after a UTF-8 encoding, usually it occupies 3 bytes.

str and bytes conversion

  We often encounter conversion str and bytes, in order to avoid distortion, it should always be converted using UTF-8.

  When the source code contains Chinese, also designated as UTF-8 encoding. In order to ensure by UTF-8 code is read when reading the Python interpreter, you need to write on the following two lines at the beginning of the file

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

  In the header file declarations use UTF-8 encoding, it does not mean that the file is UTF-8 encoding (it just tells Python interpreter Yaoan UTF-8 to read the file). We have to set the text editor using the UTF-8 without BOM encoding.

Formatted string

  In Python, C and formatting method employed as implemented%. There are a few%? Placeholder behind going with a few variables or values, order to-one correspondence. If only one percent brackets after ?, may be omitted.

  Common placeholders:

Placeholder Replace content
%d Integer
%f Float
%s String
%x Hexadecimal integer

  Integer and floating point format can specify whether to fill 0, and the number of digits and decimal integer.

>>> print('%2d-%02d' % (3, 1))
 3-01
>>> print('%.2f' % 3.1415926)
3.14

  If you are unsure which placeholders, then use the% s it, it will all data types are converted to strings:

>>> 'Age: %s. Gender: %s' % (25, True)
'Age: 25. Gender: True'

  If there is a string of characters% how to do? At this point the application% to escape:

>>> 'growth rate: %d %%7'
'growth rate: %d %%7'

  In addition to format string placeholder%, format string () method may also be used. But in this way too much trouble, rarely used, it is not described here.

Guess you like

Origin www.cnblogs.com/murongmochen/p/11669460.html