Smooth python, Fluent Python chapter notes

4.1 Character issues:

Character identification code bits, i.e., a string of numbers, and then to 4-6 unicdoe hexadecimal identifier. (In fact, when the code bits on the character, like my name is Xiao Ming, Xiao Ming is me again py3, unicode text output directly, in py2 in print output unicode also the specific text)

Specific character depends on the encoding expression. Coding algorithm is used when the bit between the code and the byte sequences.

 

The bit conversion code into a sequence of bytes is called encoding; converted into code bits of byte sequences is called the decoding process.

 

Simple to understand, to the machine used for coding, it is decoded using give.

 

The byte to 4.2

In [687]: b'\xc3\xa9'.decode('U8')                                                                                      
Out [687]: 'is'

In [688]: cafe = bytes('café', encoding='utf8')                                                                         

In [689]: cafe                                                                                                          
Out[689]: b'caf\xc3\xa9'

In [690]: cafe[0]                                                                                                       
Out[690]: 99

In [691]: cafe[-1]                                                                                                      
Out[691]: 169

In [692]: cafe[-1:]                                                                                                     
Out[692]: b'\xa9'

In [693]: cafe = bytearray(cafe)                                                                                        

In [694]: cafe                                                                                                          
Out[694]: bytearray(b'caf\xc3\xa9')

In [695]: cafe[0]                                                                                                       
Out[695]: 99

In [696]: cafe[-1]                                                                                                      
Out[696]: 169

In [697]: cafe[-1:]                                                                                                     
Out[697]: bytearray(b'\xa9')

In [698]: '12345'[2:3] == '12345'[2]                                                                                    
Out[698]: True

 Can be seen through the code, the individual elements are interposed bytesyubytesarray 0-255 (inclusive) [section two hexadecimal digits, a binary 1 111111111,8 255] is the maximum integer.

Slice is always the same binary sequence to a binary sequence type, comprising a length of 1 bit slices.

 

Iterative object slice may be seen from the above, only str can be done, the slice is the same value.

Further values ​​is the value, the same type of a slice out of sequence, which contains the value of the element.

Like s [i] is the value taken iteration i, s [i: i + 1] is extracted s [i]

 

From the output can be seen, only three different sequences of bytes displayed

1, the ASCII byte range can be printed using the ASCII character itself

2, tab, line feed, carriage return, and \ is \ t \ n \ r \\

3, is \ x la la, la la as two hexadecimal digits

 

Byte sequence may be strings in many ways, in addition to those determined string property, that is simply a string, but is a special binary string.

In [753]: bytes.fromhex('314bcea9')                                                                                     
Out[753]: b'1K\xce\xa9'

 This is a unique method of bytes, from a string can be converted directly to a sequence of bytes, 31 just as in 4b corresponding to the ASCII 1 to K 1 is output so directly to K, I waited a long time when jammed.

 

How to create a sequence of bytes objects yet.

1, the most direct

In [752]: '!love你,\n'.encode()                                                                                         
Out[752]: b'!love\xe4\xbd\xa0,\n'

 The above can be seen, punctuation, letters are output here, is another of the Chinese output.

 

Construction of bytes or bytearray instance can also call their configuration, passing the following parameters.

1, a str object and an encoding keyword argument

2, an iteration object that provides value between 0-255 (Do you think I wrote earlier fromhex)

3, an object that implements the protocol buffer (e.g., bytes, bytearray, memoryview, array.array); In this case a sequence of bytes to copy from the source object to the new binary sequence.

(Shajiao buffer protocol object, I can not find out the information, depressed.) The book about the code

In [754]: numbers = array.array('h',[-2,-1,0,1,2])                                                                      

In [755]: octets = bytes(numbers)                                                                                       

In [756]: octets                                                                                                        
Out[756]: b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

In [757]:  

 I do not know what is the point? And even if I tried different numbers behind the bytes within the 0-255 can also be output.

In [761]: bytes([1,2,3,4,5,233])                                                                                        
Out[761]: b'\x01\x02\x03\x04\x05\xe9'

In [762]:  

 Rear view spoke structure in memory

struct in meneryview, struct is used to manipulate binary data, ye did not understand, not on the code, do not write.

 

4.3 Basic codecs

Python comes decoder 100 kinds, many.

Commonly used utf8 can be written utf8, U8

unicode characters except utf-8 utf-16 may find all the byte order corresponding to the additional not full, then decode encoded, if not hit the character's own character set, default settings will be given.

 

4.4 understand the coding problem

If you do not have to decode utf-8 character set that he did not encounter the error will be.

In [762]: s = 'abs我cd'                                                                                                 

In [763]: s.encode(encoding='adcii')                                                                                    
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
<ipython-input-763-c481354da0b9> in <module>
----> 1 s.encode(encoding='adcii')

LookupError: unknown encoding: adcii

In [764]: s.encode(encoding='ascii')                                                                                    
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-764-8fa11c3441f2> in <module>
----> 1 s.encode(encoding='ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\u6211' in position 3: ordinal not in range(128)

In [765]: s.encode(encoding='ascii',errors='ignore')                                                                    
Out[765]: b'abscd'

In [766]: s.encode(encoding='ascii',errors='replace')                                                                   
Out[766]: b'abs?cd'

In [767]: s.encode(encoding='ascii',errors='xmlcharrefreplace')                                                         
Out[767]: b'abs&#25105;cd'

In [768]:         

 There are three parameters can be selected, the last argument is interesting, but also necessarily show up on the page above the word of my

In [767]: s.encode(encoding='ascii',errors='xmlcharrefreplace')                                                         
Out [767]: b'abs my cd '

 The book also said parameter error is expandable.

 

Of course decode the same

In [770]: s                                                                                                             
Out [770]: 'abs I cd'

In [771]: s.encode('U8').decode('ascii',errors='ignore')                                                                
Out[771]: 'abscd'

In [772]: s.encode('U8').decode('ascii',errors='replace)                                                                
  File "<ipython-input-772-ed017d19dba1>", line 1
    s.encode('U8').decode('ascii',errors='replace)
                                                  ^
SyntaxError: EOL while scanning string literal


In [773]:      

 Recommendations for use when utf-8 wrote replace, because a default would replace characters, which is much better tone over the word

 

4.4.4 How to identify the coding sequence of bytes

With chardet.detect

In [774]: string = 'I uncle agua'                                                                                       

In [775]: import chardet                                                                                                

In [776]: gg = string.encode('gbk')                                                                                     

In [777]: uu = string.encode()                                                                                          

In [778]: chardet.detect(gg)                                                                                            
Out[778]: {'encoding': None, 'confidence': 0.0, 'language': None}

In [779]: chardet.detect(uu)                                                                                            
Out[779]: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

In [780]: chardet.detect ( 'master dengue Te Ge age people wind Fane Er' .encode ( 'gbk'))                                                  
Out[780]: {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

In [781]:  

 The more data, the more accurate character.

 

 

4.5 with text files.

In the open text files with the time, encoding Do not use the system default, error-prone.

Then the computer system is to try to use linux, because all the systems which encoding and decoding are based on a utf-8,

This sequence of bytes in the error character encoding, decoding time.

We deal with the document default encoding:

In [810]: locale.getpreferredencoding()                                                                                 
Out[810]: 'UTF-8'

 

4.6 In order to properly compare and standardization of Unicode strings (little chance of feeling used)

Behind basic text are based abroad, German, Latin related unicdoe process, I saw a little lower. I do not write.

Guess you like

Origin www.cnblogs.com/sidianok/p/12057756.html