Python crawler丨From single thread to multi-thread, multi-process to accelerate data acquisition and analysis

foreword

When using a crawler to crawl data, when the amount of data to be crawled is relatively large and the data needs to be obtained quickly, you can consider writing a single-threaded crawler as a multi-threaded crawler. Let's learn some of its basics and how to write code.

1. Processes and threads

A process can be understood as an instance of a running program. A process is an independent unit that owns resources, but a thread is not an independent unit. Because the overhead of each scheduling process is relatively large, threads are introduced for this purpose. A process can have multiple threads, and multiple threads can exist in a process at the same time. These threads share the resources of the process, and the consumption of thread switching is very small. Therefore, the purpose of introducing processes in the operating system is to better enable the concurrent execution of multi-channel programs, improve resource utilization and system throughput; and the purpose of introducing threads is to reduce the time and space overhead paid by programs during concurrent execution, and improve The concurrency performance of the operating system.

The following is described with a simple example. Open the "task manager" of the local computer as shown in Figure 1. These running programs are called processes. If you compare a process to a job and assign 10 people to do the job, these 10 people are 10 threads. Therefore, within a certain range, the efficiency of multi-threading is higher than that of single-threading.

insert image description here


2. Multithreading and single threading in Python

In our usual learning process, we mainly use single-threaded crawlers. Generally speaking, if the crawled resources are not particularly large, you can use a single thread.

In Python, it is single-threaded by default, which is simply understood as: the code is run sequentially, for example, the first line of code is run first, and then the second line is run, and so on. In the knowledge learned in the previous chapters, it is practiced in the form of a single thread.

For example, when downloading images from a website in batches, since downloading images is a time-consuming operation, if the download is still performed in a single-threaded manner, the efficiency will be particularly low, which means that it takes more time to wait for the download. In order to save time, we can consider using multi-threaded methods to download pictures at this time.

The threading module is a module specially used for multi-thread programming in Python. It encapsulates thread and is more convenient to use. For example, it is necessary to use multi-threading for the two events of writing code and playing games. The case code is as follows.

import threading
import time
# 定义第一个
def coding():
    for x in range(3):
        print('%s正在写代码\n' % x)
        time.sleep(1)
# 定义第二个
def playing():
    for x in range(3):
        print('%s正在玩游戏\n' % x)
        time.sleep(1)
# 如果使用多线程执行
def multi_thread():
    start = time.time()
    #  Thread创建第一个线程,target参数为函数命
    t1 = threading.Thread(target=coding)
    t1.start()  # 启动线程
    # 创建第二个线程
    t2 = threading.Thread(target=playing)
    t2.start()
    # join是确保thread子线程执行完毕后才能执行下一个线程
    t1.join()
    t2.join()
    end = time.time()
    running_time = end - start  
    print('总共运行时间 : %.5f 秒' % running_time)
# 执行
if __name__ == '__main__':
    multi_thread()  # 执行单线程

The running result is shown in the figure.
Picture Figure 2. Multi-threaded running results

So how much time will it take to execute a single thread, the case code is as follows.

import time
# 定义第一个
def coding():
    for x in range(3):
        print('%s正在写代码\n' % x)
        time.sleep(1)
# 定义第二个
def playing():
    start = time.time()
    for x in range(3):
        print('%s正在玩游戏\n' % x)
        time.sleep(1)
    end = time.time()
    running_time = end - start
    print('总共运行时间 : %.5f 秒' % running_time)
def single_thread():
    coding()
    playing()
# 执行
if __name__ == '__main__':
    single_thread()  # 执行单线程

The running result is shown in the figure:
Picture Figure 3. Single-threaded running results

After the above multi-threaded and single-threaded running results, it can be seen that writing code and playing games are executed together in multi-threaded, while writing codes first and then playing games in single-threaded. In terms of time, there may be only a slight gap. When the execution workload is large, it will be found that multi-threading will consume less time. From this case, we can also know that when there are not many tasks to be performed Sometimes, you only need to write a single thread.


Three, single-threaded to multi-threaded

Take the image crawling of a live broadcast as an example, the case code is as follows.

import requests
from lxml import etree
import time
import os

dirpath = '图片/'
if not os.path.exists(dirpath):
    os.mkdir(dirpath)  # 创建文件夹

header = {
    
    
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}
def get_photo():
    url = 'https://www.huya.com/g/4079/'  # 目标网站
    response = requests.get(url=url, headers=header)  # 发送请求
    data = etree.HTML(response.text)  # 转化为html格式
    return data

def jiexi():
    data = get_photo()
    image_url = data.xpath('//a//img//@data-original')
    image_name = data.xpath('//a//img[@class="pic"]//@alt')
    for ur, name in zip(image_url, image_name):
        url = ur.replace('?imageview/4/0/w/338/h/190/blur/1', '')
        title = name + '.jpg'
        response = requests.get(url=url, headers=header)  # 在此发送新的请求
        with open(dirpath + title, 'wb') as f:
            f.write(response.content)
        print("下载成功" + name)
        time.sleep(2)

if __name__ == '__main__':
        jiexi()

If you need to change it to a multi-threaded crawler, you only need to modify the main function. For example, create 4 threads for crawling. The example code is as follows.

if __name__ == "__main__":
    threads = []
    start = time.time()
    # 创建四个进程
    for i in range(1, 5):
        thread = threading.Thread(target=jiexi(), args=(i,))
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()
    end = time.time()
    running_time = end - start
    print('总共消耗时间 : %.5f 秒' % running_time)
    print("全部完成!")  # 主程序

4. Book recommendation

picture

This book introduces common techniques for Python3 web crawlers. Firstly, it introduces the basic knowledge of web pages, then introduces urllib, Requests request library, XPath, Beautiful Soup and other analysis libraries, then introduces selenium's crawling of dynamic websites and the Scrapy crawler framework, and finally introduces the Linux foundation, which is convenient for readers to deploy independently Well-written crawler scripts.

This book is mainly for beginners who are interested in web crawlers. If you need to buy books, you can click here to enter.

participate

1️⃣ Participation methods: follow, like, bookmark, any comment (each person can comment up to three)
2️⃣ Winning method: 2 people will be randomly selected by the program , and each partner will get a book
3️⃣Event time: until 2023-08- 13 22:00:00

Note: After the event ends, the winners will be announced on my home page as scheduled, and will be delivered home with free shipping.

insert image description here

Guess you like

Origin blog.csdn.net/m0_63947499/article/details/132240115