multi-threaded python crawling picture examples

Today, we try to take a picture in front of reptiles that crawl into a multi-threaded crawling, although the final climb can take pictures of the store, but there are still some problems. URL or the URL https://www.quanjing.com/category/1286521/1.html,

Here is the code directly behind the annotated difficult.

# Multithreaded crawling, crawling a page each thread 
Import Requests 
Import Threading 
Import Queue 
from BS4 Import BeautifulSoup 
Import Re 
Import Time 

String = "https://www.quanjing.com/category/1286521/" 
pipei = re.compile ( '.? <img = * lowsrc. "? (*)"') 


class Spiders (threading.Thread): 
    name = # 1 cumulative variable, used to name each picture 

    def __init __ (self, queue, page): 
        the init __ .__ threading.Thread (Self) 
        self.queue = Queue 
        self.page = Page 

    DEF RUN (Self): # define thread start function 
        the while not self.queue.empty (): 
            url = self.queue.get_nowait () 
            # Print (url) 
            self.request_url (url)

    def request_url (self, url): # crawling and storing images for each page 
        HTML = requests.get (URL = URL) .text 
        Soup = the BeautifulSoup (HTML, 'lxml') 
        Li = STR (soup.find_all (attrs = { 'class': "gallery_list" })) # find all image links according to the class tag attributes 
        lianjies = re.findall (pipei, li) # regular matching of each of the images a link 
        for Lianjie in lianjies: 
            Result = requests.get (url = lianjie) .content # binary image data acquired 
            # stored in a binary image 
            with open ( 'E: \ py project \ quanjingwang \ image {0} \ {1} .jpg'.format ( self.page, the self.name), 'ab & +') AS F: 
                f.write (Result) 
            Print ( "{0} of storage completion" .format (the self.name)) 
            Self.Named name + = 1 # 

# to create multi-thread function 


def main ():
    = Queue.Queue url_queue () 
    for I in Range (. 1,. 11): 
        URL String = + STR (I) + ".html" 
        url_queue.put (URL) 
    thread_list = [] # thread list 
    thread_counter = 5 # number of threads, the number of threads created in the same directory according image0-4 five folders for storing results corresponding to crawling thread, the corresponding thread Image0. 1 
    for I in Range (thread_counter): 
        T = Spiders (url_queue, I) 
        thread_list.append ( T) 
    for T in thread_list: 
        t.start () 
    # for T in thread_list: 
    # Print ( "thread id:% d"% t.ident) # obtain the thread ID 
    # Print ( "thread name:% s"% t. getName ()) # get the thread name 
    for t in thread_list: 
        t.join () 


IF __name__ == '__main__': 
    start_time = time.time ()
    main () 
    Print ( "when five threads used:% f"% (time.time ( ) - start_time))

 At first, I wanted to crawl all the pictures are placed in a folder of, but due to the name issue, always covered, every time one-page picture, and finally solve the class static variable (name ). But just in just suddenly thought you can change a name to resolve, such as each image has its own title, named with the title not only solved the problem, but also more intuitive, using a lookup. (Own tcl ... if the article there is an error, we will correct me welcome ...)

Guess you like

Origin www.cnblogs.com/liangxiyang/p/11125761.html