python basis - coding

Codec

  • ASCII: 1 byte, 0-255
  • GBK2313: commonly used Chinese characters, more than 20,000
  • GBK: complement to GBK2313 to support Tibetan, two byte represents a character
  • big5: Taiwan, Traditional
  • When coding to Unicode, 2-4 bytes, except in memory, transmitted or stored: unicode
  • UTF:Unicode transformation format
  • UTF-8: variable length, 1-4 bytes, compatible with ASCII, 2-byte characters, the advantage is space saving, but a waste of time
  • UTF-16: 2 bytes
  • UTF-32: 4-byte

all in all:

  • Unicode defines an index value of each character in the world.
  • Unicode encoding: Contains all the characters around the world, but he just used to use in memory. Once the need to store the files or network transmission, the default computer system does not support unicode.

                                 If a file must be stored or the network transmission: We must bytes type (GBK or bytes utf-8 encoded string type).                             

                                 We use gbk or utf-8?

                                 Your application is for all of the processing is utf-8, then your data transfer will have to be utf-8
  • UTF8 / UTF16 Unicode standard implemented, the character is stored into a storage medium.
 
 
py3
  • py3 default encoding format is utf-8, utf-8 encoded Chinese typically 3 bytes, 2 bytes generally GBK
  • py3 default is str similar type (ie type unicode), saved in the file types are bytes type
                The default type is str (unicode) ----- >>>> not write directly to the file, no longer on the network after the transmission (coded transmission must
                >>>> bytes for network transmission type -------
                Web page source code ----- >>>> charset ------- >>>> character set
encoding encode / decode decoding
In Unicode (str) ----> Switch by encode (encoding) -----> bytes type ( UTF-. 8, ASCII, GBK )
bytes type ( UTF-. 8, ASCII, GBK ) ----> by decode ( decoding ) -----> Switch in Unicode (str)
 
 
 
py2
  • The default type is a type py2 str (i.e. bytes Type) 

bytes Type ----> by decode (decoding) ----> Type converted to Unicode (str)

In Unicode ----> by encode (encoding) -----> is converted to bytes Type

 
 
to sum up:
encode: type are used to give a non-unicode string
decode: unicode types are used to obtain
 
# Py3 string inside
 
>>> = S " A " 
>>> type (S)   # Py3 corresponding unicode str is the type 
< class  ' str ' >

>>> s1 = b'a '# py3 earlier is added into bytes type b

>>> type(s1)
<class 'bytes'>

S = >>> " China " 
>>> s.encode ( " UTF-. 8 " )   # after encoded into bytes encode type 
B ' \ XE4 \ XB8 \ XAD \ xe5 \ X9b \ XBD '   # bytes type has a front beginning b 
>>> type (s.encode ( " UTF-. 8 " ))
 < class  ' bytes ' > 
>>> 
>>> s.encode ( " GBK " )   # STR type into bytes need to encode the type of encoding bit 
b ' \ XD6 \ XD0 \ xb9 \ XFA ' 
>>> s.encode ( "gbk").decode("gbk" )   # Bytes into str type decode what type of need 
' China ' 
>>> of the type (s.encode ( " GBK " ) .decode ( " GBK " ))
 < class  ' str ' > 
>>> 

Note: write to file when, after encode must write to



Note: what encoding to decoding corresponding with what, otherwise it will error or garbled
 
  

>>> s = "Development Glory Road Test Training" .encode ( "gbk") # with gbk encoding
>>> S
B '\ xb9 \ XE2 \ xc8 \ xd9 \ XD6 \ XAE \ XC2 \ XB7 \ XB2 \ XE2 \ XCA \ XD4 \ XBF \ Xaa \ XB7 \ XA2 \ XC5 \ xe0 \ XD1 \ X
B5 '
>>> s.decode ( "utf-8") # with utf-8 decoding
Traceback (MOST Recent Last Call):
File "<stdin>", Line. 1, in <Module1>
a UnicodeDecodeError: 'utf-8' cODEC cAN not 0xB9 in decode byte position 0: invalid error #: 'utf-8' codec can not decode the byte position 0xB9 0: invalid
Start byte
>>>

 
  

>>> s.decode ( "gbk") # decoded by gbk
'Glory Road Test Development Training'
>>>

 

 

 
# Py2 string inside
 
>>> S = " ABC " 
>>> type (S)   # Py2 str, the bytes corresponding to the type 
<type ' str ' > 
>>> 
>>> s.decode ( " GBK " ) # py2str type can not encdoe, only the first decode, 
U ' ABC '   # decoded is unicode, added at the beginning of the U 
>>> type (s.decode ( " GBK " ))
 <type ' Unicode ' > 
>>> Print S. decode ( " GBK ")
abc
>>> prints.decode ( " GBK " ) .encode ( " GBK " ) 
abc
 >>> 
Note: If you can not remember can not control the situation in the py2 only remember the py3

 

 
 
 

Guess you like

Origin www.cnblogs.com/wenm1128/p/11550804.html