Codec
- ASCII: 1 byte, 0-255
- GBK2313: commonly used Chinese characters, more than 20,000
- GBK: complement to GBK2313 to support Tibetan, two byte represents a character
- big5: Taiwan, Traditional
- When coding to Unicode, 2-4 bytes, except in memory, transmitted or stored: unicode
- UTF:Unicode transformation format
- UTF-8: variable length, 1-4 bytes, compatible with ASCII, 2-byte characters, the advantage is space saving, but a waste of time
- UTF-16: 2 bytes
- UTF-32: 4-byte
all in all:
- Unicode defines an index value of each character in the world.
- Unicode encoding: Contains all the characters around the world, but he just used to use in memory. Once the need to store the files or network transmission, the default computer system does not support unicode.
If a file must be stored or the network transmission: We must bytes type (GBK or bytes utf-8 encoded string type).
We use gbk or utf-8?
- UTF8 / UTF16 Unicode standard implemented, the character is stored into a storage medium.
- py3 default encoding format is utf-8, utf-8 encoded Chinese typically 3 bytes, 2 bytes generally GBK
- py3 default is str similar type (ie type unicode), saved in the file types are bytes type
- The default type is a type py2 str (i.e. bytes Type)
bytes Type ----> by decode (decoding) ----> Type converted to Unicode (str)
In Unicode ----> by encode (encoding) -----> is converted to bytes Type
# Py3 string inside >>> = S " A " >>> type (S) # Py3 corresponding unicode str is the type < class ' str ' >
>>> s1 = b'a '# py3 earlier is added into bytes type b
>>> type(s1)
<class 'bytes'>
S = >>> " China " >>> s.encode ( " UTF-. 8 " ) # after encoded into bytes encode type B ' \ XE4 \ XB8 \ XAD \ xe5 \ X9b \ XBD ' # bytes type has a front beginning b >>> type (s.encode ( " UTF-. 8 " )) < class ' bytes ' > >>> >>> s.encode ( " GBK " ) # STR type into bytes need to encode the type of encoding bit b ' \ XD6 \ XD0 \ xb9 \ XFA ' >>> s.encode ( "gbk").decode("gbk" ) # Bytes into str type decode what type of need ' China ' >>> of the type (s.encode ( " GBK " ) .decode ( " GBK " )) < class ' str ' > >>> Note: write to file when, after encode must write to
>>> s = "Development Glory Road Test Training" .encode ( "gbk") # with gbk encoding
>>> S
B '\ xb9 \ XE2 \ xc8 \ xd9 \ XD6 \ XAE \ XC2 \ XB7 \ XB2 \ XE2 \ XCA \ XD4 \ XBF \ Xaa \ XB7 \ XA2 \ XC5 \ xe0 \ XD1 \ X
B5 '
>>> s.decode ( "utf-8") # with utf-8 decoding
Traceback (MOST Recent Last Call):
File "<stdin>", Line. 1, in <Module1>
a UnicodeDecodeError: 'utf-8' cODEC cAN not 0xB9 in decode byte position 0: invalid error #: 'utf-8' codec can not decode the byte position 0xB9 0: invalid
Start byte
>>>
>>> s.decode ( "gbk") # decoded by gbk
'Glory Road Test Development Training'
>>>
# Py2 string inside >>> S = " ABC " >>> type (S) # Py2 str, the bytes corresponding to the type <type ' str ' > >>> >>> s.decode ( " GBK " ) # py2str type can not encdoe, only the first decode, U ' ABC ' # decoded is unicode, added at the beginning of the U >>> type (s.decode ( " GBK " )) <type ' Unicode ' > >>> Print S. decode ( " GBK ") abc >>> prints.decode ( " GBK " ) .encode ( " GBK " ) abc >>>
Note: If you can not remember can not control the situation in the py2 only remember the py3