Source:
1 '' ' Baidu Bar crawling, it different in different pages ' '' 2 . 3 from the urllib Import Request . 4 from the urllib Import the parse . 5 . 6 # Definitions of common variables . 7 the base_url = " https://tieba.baidu.com/f? kW = " . 8 headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 6.1; the WOW64; RV: 6.0) the Gecko / 20,100,101 Firefox / 6.0 ' } . 9 10 # splicing URL, (the first coding, and then stitching, and then request) 11 tb_name the iNPUT = ( " Please enter the name attached to it: ") 12 is Key = parse.quote (tb_name) 13 is URL = the base_url + Key 14 15 Print (URL) 16 . 17 # three steps 18 # reconstruct the requested object, packaging the request header . 19 REQ = request.Request (URL, headers = headers ) 20 # send a request the urlopen 21 is RES = request.urlopen (REQ) 22 is # acquisition response 23 is HTML res.read = (). decode ( ' UTF-. 8 ' ) 24 25 # Print (HTML) 26 is 27 # save the file 28 with open('贴吧.txt','w') as f: 29 f.write(html)
During data reptiles, such a mistake:
Enter a name attached to it: the beauty of it
https://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3%E5%90%A7
Traceback (MOST recent Results Last Call):
File "D : / AID1812 / Spider / day01 / 05_ _ Baidu Post bar to practice .py ", Line 29, in <Module>
f.write (HTML)
UnicodeEncodeError: 'GBK' CODEC CAN not encode Character '\ U0001f236' in position 166 141: illegal multibyte sequence
solution:
with open supplementary add encoding = "utf-8" () inside, OK.
# Save file
with open ( 'it stick .txt', 'W', encoding = 'UTF-. 8') AS F:
f.write (HTML)