What problem does parallel programming solve?

Multi-threaded crawlers refer to concurrently requesting web pages and parsing responses through multiple threads to improve the efficiency and speed of crawlers. It can be implemented in Python using modules such as threading, Queue, and requests.

Parallel programming is a style of programming that utilizes multiple processors/cores/threads to execute code simultaneously. It can solve the following problems:

insert image description here

Improve program performance

In a multi-task or multi-process scenario, using parallel programming can effectively improve the operating efficiency and response speed of the program, make full use of computing resources, and enable the program to complete tasks faster.

Solve single point of failure

Bugs or crashes in traditional serial programs can cause the entire program to stop running. Through parallel programming, the task is divided into multiple subtasks, even if a problem occurs in one of the tasks, it will not affect the normal operation of the entire program.

Troubleshoot data sharing and synchronization issues

In a multi-process or multi-thread environment, multiple tasks may share the same data resource, so mechanisms such as locks or semaphores need to be used to ensure data correctness, reliability, and synchronization, and to avoid data competition, deadlock, problems such as hunger.

Support large-scale distributed computing

In the field of cloud computing and big data, the amount of data is huge, and the processing capacity of a single machine is limited. A large-scale distributed computing framework is needed to support the storage, processing and analysis of massive data. Therefore, parallel programming is an important means to realize these frameworks.

In short, parallel programming can improve the performance, reliability and scalability of programs, and is applicable to scenarios such as multi-tasking, multi-process, multi-threading, and distributed computing, and is an indispensable technology in the field of modern computer programming.

multithreaded programming

Multithreaded programming refers to running multiple threads in a program at the same time, and each thread can perform different tasks independently. Multithreaded programming can improve the performance and responsiveness of your program, especially if you process large amounts of data or need to perform multiple tasks simultaneously.

In multithreaded programming, you need to pay attention to the following points:

1. Thread safety

When multiple threads access shared resources at the same time, it is necessary to ensure the consistency and correctness of the data and avoid problems such as race conditions.

2. Synchronization mechanism

In order to ensure thread safety, synchronization mechanisms such as locks, semaphores, and condition variables need to be used.

3. Thread scheduling

When multiple threads are running at the same time, CPU time slices need to be allocated reasonably to avoid a thread occupying CPU resources for a long time, causing other threads to fail to run.

4. Thread pool

In order to avoid frequently creating and destroying threads, you can use a thread pool to manage threads and improve program performance and efficiency.

In actual programming, multiple programming languages ​​and frameworks can be used to implement multi-threaded programming, such as Java's Thread class, Python's threading module, and C++'s std::thread library. At the same time, a variety of tools and technologies can be used to debug and optimize multi-threaded programs, such as debuggers, performance analysis tools, multi-threaded programming models, etc.

Detailed explanation of multithreaded programming

Multithreaded programming is a style of programming that utilizes multiple threads (concurrent execution streams) to simultaneously execute code and complete tasks. It has the following characteristics:

Concurrent execution: Multiple threads can execute concurrently, using CPU and other resources.

Shared memory: multiple threads share the address space and memory resources of the process, including global variables, code segments, data segments, etc., so attention should be paid to accessing and modifying shared data.

Lightweight: Each thread is a lightweight execution flow, which facilitates the creation, destruction and switching of threads.

High complexity: Due to problems such as race conditions and deadlocks in multi-threading, the development and debugging complexity is high.

In Python, multithreaded programming can be achieved using the threading module. Commonly used methods include:

Create a thread: Create a new thread object and arrange its operation through the threading.Thread class.

import threading

def worker():
    """线程执行函数"""
    print('Hello, world!')

# 创建新的线程并启动
t = threading.Thread(target=worker)
t.start()

Thread synchronization: Python provides multiple thread synchronization mechanisms (such as Lock, Event, Semaphore, Condition, etc.), which can coordinate the behavior of different threads.

import threading

# 创建一个信号量,初始值为 1
sem = threading.Semaphore(1)

def worker():
    sem.acquire()
    try:
        """操作共享资源"""
    finally:
         sem.release()
线程池为了避免线程频繁创建和销毁的开销,可以使用线程池技术(例如 concurrent.futures 模块)来复用线程,提高程序效率。

from concurrent.futures import ThreadPoolExecutor

def worker():
    """线程执行函数"""
    print('Hello, world!')

# 创建线程池
with ThreadPoolExecutor(max_workers=4) as executor:
    for i in range(10):
        executor.submit(worker)

It should be noted that when performing multi-threaded programming, it is necessary to pay attention to issues such as the atomicity of shared data between threads, thread start-stop, and synchronization, so as to avoid data competition, deadlock, and other related issues.

Here is an example of a simple multi-threaded crawler:

import requests
from queue import Queue
import threading

# 定义线程数量和目标网址
thread_num = 4
url = 'http://www.example.com'

# 创建队列用于存放待下载的 URL
url_queue = Queue()

# 将网址入队
for i in range(100):
    url_queue.put(url)

# 定义线程执行函数
def worker():
    while True:
        try:
            # 获取待下载的 URL
            url = url_queue.get(block=False)
            # 下载并解析响应
            response = requests.get(url)
            content = response.text
            # 接下来可以进行数据处理或保存等操作
        except Exception as e:
            print(e)
            break

# 创建多个线程并启动
threads = []
for i in range(thread_num):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()

# 等待所有线程执行完毕
for t in threads:
    t.join()

In this example, we first define the number of threads and the target URL, and create a queue to store the URLs to be downloaded. Then, we create multiple threads, and each thread takes out the URL to be downloaded from the queue, and uses the requests library to download and parse. It should be noted that in multi-threaded crawlers, we need to pay attention to the synchronous processing of data to avoid problems such as data competition. Finally, we wait for all threads to finish executing and output relevant information.

Guess you like

Origin blog.csdn.net/weixin_44617651/article/details/130941877