Multi-threaded crawling embarrassments Encyclopedia popular scripts (rewrite the day before yesterday's blog)

Using multi-threaded crawling, in addition to several modules previously used, the need to use the threading module and queue modules:

  • Open a thread for everything: construction url_list, send a request to extract data, save data
  • Examples of a method of adding three __init__ queue attributes are stored: url, response content, data processing
  • Rewrite the previous code in each method, need something taken directly from the queue, in which case the parameters of the method are no extra
  • Whenever data is extracted from a queue, remember to perform task_done () method decrements the count
  • run () method in the yaozhixing things have turned a thread, slow things, such as a network request, you can give it to open several threads

These two steps are very important:

  • Threads to handle daemon thread, the thread is not important; the end of the main thread, the child thread ends
  • And then wait for the completion of the main thread blocks, waiting for the completion of the task queue after

If you do not do these two steps will result in:

  • Not handle threads to daemon thread causes the program has been unable to end
  • If the main thread is not blocked, it will lead to other things not done, but the program has ended

After the above explanation, directly on the code:

. 1  Import Requests
 2  Import JSON
 . 3  Import Threading
 . 4  from Queue Import Queue
 . 5  from lxml Import etree
 . 6  
. 7  
. 8  class QiubaSpider (Object):
 . 9      "" " data in the popular crawling embarrassments Encyclopedia of " "" 
10  
. 11      DEF  the __init__ (Self ):
 12 is          self.url_temp = ' https://www.qiushibaike.com/text/page/{}/ ' 
13 is          self.headers = {
 14              '- Agent-the User ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' ,
 15          }
 16          self.url_queue = Queue ()   # storage url queue 
. 17          self.html_queue = queue ()   # store response queue 
18 is          self.content_queue = queue ()   # storage content_list of the column 
. 19  
20 is      DEF get_url_list (Self):   # configured URL_LIST 
21 is          # return [self.url_temp.format ( I) for I in Range (. 1, 14)] 
22 is          for I inRange (. 1, 14 ):
 23 is              self.url_queue.put (self.url_temp.format (I))   # each constructed of a url into the queue 
24  
25      DEF pass_url (Self):   # sends a request 
26 is          the while True:
 27              url self.url_queue.get = ()   # removed from the queue inside a url 
28              Print (url)
 29              Response = requests.get (url, headers = self.headers)
 30              # return response.content.decode () 
31 is              self.html_queue. pUT (response.content.decode ())   # returned results in the queue 
32             self.url_queue.task_done ()   # make a count by 
33 is              Print (. 1 )
 34 is  
35      DEF get_content_list (Self):   # extracted data 
36          the while True:
 37 [              html_str self.html_queue.get = ()   # removed from the queue 
38 is              HTML = etree.HTML (html_str)
 39              div_list = html.xpath ( ' // div [@ ID = "Content-left"] / div ' )   # packet 
40              CONTENT_LIST = []
 41 is              for div in div_list:
42 is                  Item = {}
 43 is                  # below all use some of the functions xpath and processing of the data 
44 is                  Item [ ' Content ' ] = div.xpath ( ' .// div [@ class = "Content"] / span / text () ' )
 45                  Item [ ' Content ' ] = [i.replace ( ' \ n- ' , ' ' ) for I in Item [ ' Content ' ]]
 46 is                  Item [ ' author_gender ' ] = div.xpath ( './/div[contains(@class, "articleGend")]/@class')
47                 item['author_gender'] = item['author_gender'][0].split(' ')[-1].replace('Icon', '') if len(
48                     item['author_gender']) > 0 else None
49                 item['author_age'] = div.xpath('.//div[contains(@class, "articleGend")]/text()')
50                 item['author_age'] = item['author_age'][0] if len(item['author_age']) > 0 else None
51                 item['author_img'] = div.xpath('.//div[@class="author clearfix"]//img/@src')
52                 item['author_img'] = 'https' + item['author_img'][0] if len(item['author_img']) > 0 else None
53                 item['stats_vote'] = div.xpath('.//span[@class="stats-vote"]/i/text()')
54                 item['stats_vote'] = item['stats_vote'][0] if len(item['stats_vote']) > 0 else None
55                 content_list.append(item)
56             # return content_list
57             self.content_queue.put(content_list)
58             self.html_queue.task_done()  # 计数减一
59             print(2)
60 
61     def save_content_list(self):  # 保存
62         while True:
63             content_list = self.content_queue.get()  # 获取
64             with open('qiuba.txt', 'a', encoding='utf-8') as f:
65                 f.write (json.dumps (CONTENT_LIST, ensure_ascii = False, indent =. 4 ))
 66                  f.write ( ' \ n- ' )   # wrap 
67              self.content_queue.task_done ()   # count by one 
68              Print (. 3 )
 69  
70  
71 is      DEF RUN (Self):   # implement the main logic 
72          "" " everything turned on a thread, which are now available from the queue without parameter passing " "" 
73 is          thread_list = []   # for accessing thread because four threads one by one start too much trouble 
74          # 1. construction url_list, a total of 13 Top 
75         of the threading.Thread = t_url (target = self.get_url_list)
 76          thread_list.append (t_url)
 77          # 2. traverse transmission request acquisition response 
78          for I in Range (. 5):   # to send a request here to open the thread 5, the direct circulation to 
79              t_pass of the threading.Thread = (target = self.pass_url)
 80              thread_list.append (t_pass)
 81          # 3. extracted data 
82          for I in Range (. 3):   # is opened to extract data where three threads 
83              t_html = Threading .thread (target = self.get_content_list)
 84             thread_list.append (t_html)
 85          # 4. Save the data 
86          t_save of the threading.Thread = (target = self.save_content_list)
 87          thread_list.append (t_save)
 88          for T in thread_list:
 89              t.setDaemon (True)   # handle disposed threads daemon thread, the thread is not important; the end of the main thread, the child thread end 
90              t.start ()
 91          for q in [self.url_queue, self.html_queue, self.content_queue]:
 92              q.join ()   # main thread task blocked waiting, waiting for the completion of the queue again after the completion of 
93          Print ( ' end of the main thread! ')
94 
95 
96 if __name__ == '__main__':
97     qiubai = QiubaSpider()
98     qiubai.run()

 

Guess you like

Origin www.cnblogs.com/springionic/p/11122261.html