Python multi-threaded crawler

Multi-threaded crawler

In some cases, such as downloading pictures, download pictures because is a time-consuming operation. Download that if synchronized manner before. That efficiency is willing to be particularly slow. This time we can consider the use of multi-threaded way to download pictures.

Multithreading description:

Multithreading is to synchronize multiple tasks to improve the efficiency of the system by improving the efficiency of resource use. Threads are implemented at the same time need to complete a number of tasks of the time.
The simplest analogy is like multi-threading for each carriage of the train, and the process is by train. The train left the carriage is not running, and how empathy can also train carriages. Multithreading is to improve efficiency occurs. At the same time it appears also brought some problems. More Description Please refer to: https://baike.baidu.com/item/ multi-threaded / 1190404 fr = aladdin?

threading module description:

threadingModule is pythonthe special offer to make multithreaded programming module. threadingThe most common module class is Thread. The following look at a simple multi-threaded program:

import threading
import time

def coding(): for x in range(3): print('%s正在写代码' % x) time.sleep(1) def drawing(): for x in range(3): print('%s正在画图' % x) time.sleep(1) def single_thread(): coding() drawing() def multi_thread(): t1 = threading.Thread(target=coding) t2 = threading.Thread(target=drawing) t1.start() t2.start() if __name__ == '__main__': multi_thread() 

View the number of threads:

Use threading.enumerate()function will be able to see the number of the current thread.

Check the name of the current thread:

Use threading.current_thread()can see the information for the current thread.

Inherited from the threading.Threadclass:

In order to better encapsulation threaded code. May be used threadingin the module Threadclass inherits from the class, and then implement runthe method, the thread will automatically run runthe code method. Sample code is as follows:

import threading
import time

class CodingThread(threading.Thread): def run(self): for x in range(3): print('%s正在写代码' % threading.current_thread()) time.sleep(1) class DrawingThread(threading.Thread): def run(self): for x in range(3): print('%s正在画图' % threading.current_thread()) time.sleep(1) def multi_thread(): t1 = CodingThread() t2 = DrawingThread() t1.start() t2.start() if __name__ == '__main__': multi_thread() 

Multithreading sharing global variables:

Multithreading is running in the same process. Therefore, in the process of global variables are all threads that can be shared. This creates a problem, because the order of execution threads are unordered. It may cause data errors. For example, the following code:

import threading

tickets = 0

def get_ticket(): global tickets for x in range(1000000): tickets += 1 print('tickets:%d'%tickets) def main(): for x in range(2): t = threading.Thread(target=get_ticket) t.start() if __name__ == '__main__': main() 

These results should be normal in terms of 6, but because multiple threads running uncertainty. So the final result may be random.

Lock mechanism:

In order to solve the above problem using shared global variables. threadingProvides a Locktime class that can access a variable in a thread lock, other threads can not come in at this time, until the current thread is finished processing, the lock is released, the other threads before they enter treatment. Sample code is as follows:

import threading

VALUE = 0

gLock = threading.Lock()

def add_value(): global VALUE gLock.acquire() for x in range(1000000): VALUE += 1 gLock.release() print('value:%d'%VALUE) def main(): for x in range(2): t = threading.Thread(target=add_value) t.start() if __name__ == '__main__': main() 

Lock versions of producer and consumer modes:

Producer and consumer model is a model for multi-threaded development is often seen. Thread producers designed to produce some of the data, and then stored in an intermediate variable. Consumers then remove the data from the middle of the variables for consumption. However, due to the use of intermediate variables, intermediate variables are often some global variables, and therefore need to use locks to ensure data integrity. Here is threading.Lockan example of a "producer-consumer model" lock implemented:

import threading
import random
import time

gMoney = 1000
gLock = threading.Lock()
# 记录生产者生产的次数,达到10次就不再生产 gTimes = 0 class Producer(threading.Thread): def run(self): global gMoney global gLock global gTimes while True: money = random.randint(100, 1000) gLock.acquire() # 如果已经达到10次了,就不再生产了 if gTimes >= 10: gLock.release() break gMoney += money print('%s当前存入%s元钱,剩余%s元钱' % (threading.current_thread(), money, gMoney)) gTimes += 1 time.sleep(0.5) gLock.release() class Consumer(threading.Thread): def run(self): global gMoney global gLock global gTimes while True: money = random.randint(100, 500) gLock.acquire() if gMoney > money: gMoney -= money print('%s当前取出%s元钱,剩余%s元钱' % (threading.current_thread(), money, gMoney)) time.sleep(0.5) else: # 如果钱不够了,有可能是已经超过了次数,这时候就判断一下 if gTimes >= 10: gLock.release() break print("%s当前想取%s元钱,剩余%s元钱,不足!" % (threading.current_thread(),money,gMoney)) gLock.release() def main(): for x in range(5): Consumer(name='消费者线程%d'%x).start() for x in range(5): Producer(name='生产者线程%d'%x).start() if __name__ == '__main__': main() 

Condition edition of producers and consumers modes:

LockVersions of producers and consumers can be a normal mode of operation. But there is a shortage, among consumers, always through while Truethe cycle of death and locked way to judge enough money. Lock is a very CPU-intensive behavior. So this is not the best way. There is a better way is to use threading.Conditionto achieve. threading.ConditionIt can be blocked in a wait state in the absence of data. Once the appropriate data, you can also use notifythe relevant function to notify the other threads in a wait state. So that you can not do some useless locking and unlocking operation. You can improve the performance of the program. First of threading.Conditionrelated functions to be introduced, threading.Conditionsimilar threading.Lock, can be locked in a modified global data, it can also be unlocked after the modification is completed. Some commonly used functions will be a brief introduction:

  1. acquire: Locked.
  2. release: Unlock.
  3. wait: The current thread in a wait state, and will release the lock. It may be other threads notifyand notify_allwake-up functions. After being awakened we will continue to wait for the lock, continue with the following code lock.
  4. notify: Notification of a waiting thread, the default is the first one waiting thread.
  5. notify_all: Notify all waiting threads. notifyAnd notify_allwill not release the lock. And the need to releasecall before.

ConditionVersion of the producer-consumer model code is as follows:

import threading
import random
import time

gMoney = 1000
gCondition = threading.Condition()
gTimes = 0 gTotalTimes = 5 class Producer(threading.Thread): def run(self): global gMoney global gCondition global gTimes while True: money = random.randint(100, 1000) gCondition.acquire() if gTimes >= gTotalTimes: gCondition.release() print('当前生产者总共生产了%s次'%gTimes) break gMoney += money print('%s当前存入%s元钱,剩余%s元钱' % (threading.current_thread(), money, gMoney)) gTimes += 1 time.sleep(0.5) gCondition.notify_all() gCondition.release() class Consumer(threading.Thread): def run(self): global gMoney global gCondition while True: money = random.randint(100, 500) gCondition.acquire() # 这里要给个while循环判断,因为等轮到这个线程的时候 # 条件有可能又不满足了 while gMoney < money: if gTimes >= gTotalTimes: gCondition.release() return print('%s准备取%s元钱,剩余%s元钱,不足!'%(threading.current_thread(),money,gMoney)) gCondition.wait() gMoney -= money print('%s当前取出%s元钱,剩余%s元钱' % (threading.current_thread(), money, gMoney)) time.sleep(0.5) gCondition.release() def main(): for x in range(5): Consumer(name='消费者线程%d'%x).start() for x in range(2): Producer(name='生产者线程%d'%x).start() if __name__ == '__main__': main() 

Queue thread-safe queue:

In the thread, access to global variables, lock is a regular process. If you want some of the data are stored in a queue, then it built a Python module called a thread-safe queuemodule. The Python queue module provides synchronization queue based thread-safe, comprising a FIFO (First In First Out) queue Queue, LIFO (last in, first out) queue LifoQueue. These queues are implemented locking primitives (it can be understood as an atomic operation that or do not do, or have done), can be used directly in multiple threads. You can use queue to achieve synchronization between threads. Related functions as follows:

  1. Initialization Queue (maxsize): Creates a FIFO queue.
  2. qsize (): returns the size of the queue.
  3. empty (): whether the queue is empty.
  4. full (): whether the queue is full.
  5. get (): Take the last data from the queue.
  6. put (): the data into a queue.

Producers and consumers to use multi-threaded download mode expression package:

import threading
import requests
from lxml import etree
from urllib import request import os import re from queue import Queue class Producer(threading.Thread): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } def __init__(self,page_queue,img_queue,*args,**kwargs): super(Producer, self).__init__(*args,**kwargs) self.page_queue = page_queue self.img_queue = img_queue def run(self): while True: if self.page_queue.empty(): break url = self.page_queue.get() self.parse_page(url) def parse_page(self,url): response = requests.get(url,headers=self.headers) text = response.text html = etree.HTML(text) imgs = html.xpath("//div[@class='page-content text-center']//a//img") for img in imgs: if img.get('class') == 'gif': continue img_url = img.xpath(".//@data-original")[0] suffix = os.path.splitext(img_url)[1] alt = img.xpath(".//@alt")[0] alt = re.sub(r'[,。??,/\\·]','',alt) img_name = alt + suffix self.img_queue.put((img_url,img_name)) class Consumer(threading.Thread): def __init__(self,page_queue,img_queue,*args,**kwargs): super(Consumer, self).__init__(*args,**kwargs) self.page_queue = page_queue self.img_queue = img_queue def run(self): while True: if self.img_queue.empty(): if self.page_queue.empty(): return img = self.img_queue.get(block=True) url,filename = img request.urlretrieve(url,'images/'+filename) print(filename+' 下载完成!') def main(): page_queue = Queue(100) img_queue = Queue(500) for x in range(1,101): url = "http://www.doutula.com/photo/list/?page=%d" % x page_queue.put(url) for x in range(5): t = Producer(page_queue,img_queue) t.start() for x in range(5): t = Consumer(page_queue,img_queue) t.start() if __name__ == '__main__': main() 

GIL Global Interpreter Lock:

Python comes with interpreters, CPython. CPythonMulti-threaded interpreter is actually a fake multi-threaded (multi-core CPU, you can only use one core, can not take advantage of multi-core). At the same time only one thread execution, in order to ensure that only one thread in execution, in CPythona thing called interpreter GIL(Global Intepreter Lock), called the global interpreter lock. The interpreter lock is necessary. Because the CPythonmemory management of the interpreter is not thread safe. Of course, in addition to CPythonthe interpreter, as well as other interpreters, some explanation is no GILlock, see the following:

  1. Jython: The Java implementation of the Python interpreter. GIL lock does not exist. For more details, see: https://zh.wikipedia.org/wiki/Jython
  2. IronPython: The .netinterpreter implemented in Python. GIL lock does not exist. For more details, see: https://zh.wikipedia.org/wiki/IronPython
  3. PyPy: The Pythoninterpreter implemented in Python. GIL lock exists. For more details, see: https://zh.wikipedia.org/wiki/PyPy
    multithreaded GIL Although it is a fake. But in dealing with some IO operations (such as file read and write requests and network) can still improve the efficiency to a great extent. Recommendations on the use of multi-threaded IO operations to improve efficiency. On some CPU computing operation it does not recommend the use of multi-threaded, multi-process and the proposed use.

Best not multi-threaded download scripts sister operations:

import requests
from lxml import etree
import threading
from queue import Queue import csv class BSSpider(threading.Thread): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } def __init__(self,page_queue,joke_queue,*args,**kwargs): super(BSSpider, self).__init__(*args,**kwargs) self.base_domain = 'http://www.budejie.com' self.page_queue = page_queue self.joke_queue = joke_queue def run(self): while True: if self.page_queue.empty(): break url = self.page_queue.get() response = requests.get(url, headers=self.headers) text = response.text html = etree.HTML(text) descs = html.xpath("//div[@class='j-r-list-c-desc']") for desc in descs: jokes = desc.xpath(".//text()") joke = "\n".join(jokes).strip() link = self.base_domain+desc.xpath(".//a/@href")[0] self.joke_queue.put((joke,link)) print('='*30+"第%s页下载完成!"%url.split('/')[-1]+"="*30) class BSWriter(threading.Thread): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } def __init__(self, joke_queue, writer,gLock, *args, **kwargs): super(BSWriter, self).__init__(*args, **kwargs) self.joke_queue = joke_queue self.writer = writer self.lock = gLock def run(self): while True: try: joke_info = self.joke_queue.get(timeout=40) joke,link = joke_info self.lock.acquire() self.writer.writerow((joke,link)) self.lock.release() print('保存一条') except: break def main(): page_queue = Queue(10) joke_queue = Queue(500) gLock = threading.Lock() fp = open('bsbdj.csv', 'a',newline='', encoding='utf-8') writer = csv.writer(fp) writer.writerow(('content', 'link')) for x in range(1,11): url = 'http://www.budejie.com/text/%d' % x page_queue.put(url) for x in range(5): t = BSSpider(page_queue,joke_queue) t.start() for x in range(5): t = BSWriter(joke_queue,writer,gLock) t.start() if __name__ == '__main__': main()

Guess you like

Origin www.cnblogs.com/csnd/p/11469326.html