Today, we try to take a picture in front of reptiles that crawl into a multi-threaded crawling, although the final climb can take pictures of the store, but there are still some problems. URL or the URL https://www.quanjing.com/category/1286521/1.html,
Here is the code directly behind the annotated difficult.
# Multithreaded crawling, crawling a page each thread Import Requests Import Threading Import Queue from BS4 Import BeautifulSoup Import Re Import Time String = "https://www.quanjing.com/category/1286521/" pipei = re.compile ( '.? <img = * lowsrc. "? (*)"') class Spiders (threading.Thread): name = # 1 cumulative variable, used to name each picture def __init __ (self, queue, page): the init __ .__ threading.Thread (Self) self.queue = Queue self.page = Page DEF RUN (Self): # define thread start function the while not self.queue.empty (): url = self.queue.get_nowait () # Print (url) self.request_url (url) def request_url (self, url): # crawling and storing images for each page HTML = requests.get (URL = URL) .text Soup = the BeautifulSoup (HTML, 'lxml') Li = STR (soup.find_all (attrs = { 'class': "gallery_list" })) # find all image links according to the class tag attributes lianjies = re.findall (pipei, li) # regular matching of each of the images a link for Lianjie in lianjies: Result = requests.get (url = lianjie) .content # binary image data acquired # stored in a binary image with open ( 'E: \ py project \ quanjingwang \ image {0} \ {1} .jpg'.format ( self.page, the self.name), 'ab & +') AS F: f.write (Result) Print ( "{0} of storage completion" .format (the self.name)) Self.Named name + = 1 # # to create multi-thread function def main (): = Queue.Queue url_queue () for I in Range (. 1,. 11): URL String = + STR (I) + ".html" url_queue.put (URL) thread_list = [] # thread list thread_counter = 5 # number of threads, the number of threads created in the same directory according image0-4 five folders for storing results corresponding to crawling thread, the corresponding thread Image0. 1 for I in Range (thread_counter): T = Spiders (url_queue, I) thread_list.append ( T) for T in thread_list: t.start () # for T in thread_list: # Print ( "thread id:% d"% t.ident) # obtain the thread ID # Print ( "thread name:% s"% t. getName ()) # get the thread name for t in thread_list: t.join () IF __name__ == '__main__': start_time = time.time () main () Print ( "when five threads used:% f"% (time.time ( ) - start_time))
At first, I wanted to crawl all the pictures are placed in a folder of, but due to the name issue, always covered, every time one-page picture, and finally solve the class static variable (name ). But just in just suddenly thought you can change a name to resolve, such as each image has its own title, named with the title not only solved the problem, but also more intuitive, using a lookup. (Own tcl ... if the article there is an error, we will correct me welcome ...)