Using multi-threaded crawling, in addition to several modules previously used, the need to use the threading module and queue modules:
- Open a thread for everything: construction url_list, send a request to extract data, save data
- Examples of a method of adding three __init__ queue attributes are stored: url, response content, data processing
- Rewrite the previous code in each method, need something taken directly from the queue, in which case the parameters of the method are no extra
- Whenever data is extracted from a queue, remember to perform task_done () method decrements the count
- run () method in the yaozhixing things have turned a thread, slow things, such as a network request, you can give it to open several threads
These two steps are very important:
- Threads to handle daemon thread, the thread is not important; the end of the main thread, the child thread ends
- And then wait for the completion of the main thread blocks, waiting for the completion of the task queue after
If you do not do these two steps will result in:
- Not handle threads to daemon thread causes the program has been unable to end
- If the main thread is not blocked, it will lead to other things not done, but the program has ended
After the above explanation, directly on the code:
. 1 Import Requests 2 Import JSON . 3 Import Threading . 4 from Queue Import Queue . 5 from lxml Import etree . 6 . 7 . 8 class QiubaSpider (Object): . 9 "" " data in the popular crawling embarrassments Encyclopedia of " "" 10 . 11 DEF the __init__ (Self ): 12 is self.url_temp = ' https://www.qiushibaike.com/text/page/{}/ ' 13 is self.headers = { 14 '- Agent-the User ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' , 15 } 16 self.url_queue = Queue () # storage url queue . 17 self.html_queue = queue () # store response queue 18 is self.content_queue = queue () # storage content_list of the column . 19 20 is DEF get_url_list (Self): # configured URL_LIST 21 is # return [self.url_temp.format ( I) for I in Range (. 1, 14)] 22 is for I inRange (. 1, 14 ): 23 is self.url_queue.put (self.url_temp.format (I)) # each constructed of a url into the queue 24 25 DEF pass_url (Self): # sends a request 26 is the while True: 27 url self.url_queue.get = () # removed from the queue inside a url 28 Print (url) 29 Response = requests.get (url, headers = self.headers) 30 # return response.content.decode () 31 is self.html_queue. pUT (response.content.decode ()) # returned results in the queue 32 self.url_queue.task_done () # make a count by 33 is Print (. 1 ) 34 is 35 DEF get_content_list (Self): # extracted data 36 the while True: 37 [ html_str self.html_queue.get = () # removed from the queue 38 is HTML = etree.HTML (html_str) 39 div_list = html.xpath ( ' // div [@ ID = "Content-left"] / div ' ) # packet 40 CONTENT_LIST = [] 41 is for div in div_list: 42 is Item = {} 43 is # below all use some of the functions xpath and processing of the data 44 is Item [ ' Content ' ] = div.xpath ( ' .// div [@ class = "Content"] / span / text () ' ) 45 Item [ ' Content ' ] = [i.replace ( ' \ n- ' , ' ' ) for I in Item [ ' Content ' ]] 46 is Item [ ' author_gender ' ] = div.xpath ( './/div[contains(@class, "articleGend")]/@class') 47 item['author_gender'] = item['author_gender'][0].split(' ')[-1].replace('Icon', '') if len( 48 item['author_gender']) > 0 else None 49 item['author_age'] = div.xpath('.//div[contains(@class, "articleGend")]/text()') 50 item['author_age'] = item['author_age'][0] if len(item['author_age']) > 0 else None 51 item['author_img'] = div.xpath('.//div[@class="author clearfix"]//img/@src') 52 item['author_img'] = 'https' + item['author_img'][0] if len(item['author_img']) > 0 else None 53 item['stats_vote'] = div.xpath('.//span[@class="stats-vote"]/i/text()') 54 item['stats_vote'] = item['stats_vote'][0] if len(item['stats_vote']) > 0 else None 55 content_list.append(item) 56 # return content_list 57 self.content_queue.put(content_list) 58 self.html_queue.task_done() # 计数减一 59 print(2) 60 61 def save_content_list(self): # 保存 62 while True: 63 content_list = self.content_queue.get() # 获取 64 with open('qiuba.txt', 'a', encoding='utf-8') as f: 65 f.write (json.dumps (CONTENT_LIST, ensure_ascii = False, indent =. 4 )) 66 f.write ( ' \ n- ' ) # wrap 67 self.content_queue.task_done () # count by one 68 Print (. 3 ) 69 70 71 is DEF RUN (Self): # implement the main logic 72 "" " everything turned on a thread, which are now available from the queue without parameter passing " "" 73 is thread_list = [] # for accessing thread because four threads one by one start too much trouble 74 # 1. construction url_list, a total of 13 Top 75 of the threading.Thread = t_url (target = self.get_url_list) 76 thread_list.append (t_url) 77 # 2. traverse transmission request acquisition response 78 for I in Range (. 5): # to send a request here to open the thread 5, the direct circulation to 79 t_pass of the threading.Thread = (target = self.pass_url) 80 thread_list.append (t_pass) 81 # 3. extracted data 82 for I in Range (. 3): # is opened to extract data where three threads 83 t_html = Threading .thread (target = self.get_content_list) 84 thread_list.append (t_html) 85 # 4. Save the data 86 t_save of the threading.Thread = (target = self.save_content_list) 87 thread_list.append (t_save) 88 for T in thread_list: 89 t.setDaemon (True) # handle disposed threads daemon thread, the thread is not important; the end of the main thread, the child thread end 90 t.start () 91 for q in [self.url_queue, self.html_queue, self.content_queue]: 92 q.join () # main thread task blocked waiting, waiting for the completion of the queue again after the completion of 93 Print ( ' end of the main thread! ') 94 95 96 if __name__ == '__main__': 97 qiubai = QiubaSpider() 98 qiubai.run()