Python web crawler guide 2: multi-threaded web crawler, dynamic content crawling (to be continued)

This article is the course notes of "Python Web Crawler Advanced Guide" by Ma Soldier Education , part of which is generated by AI. Courseware: Chapter 1 , Chapter 2 , Chapter 3 , Chapter 4 , Chapter 5 , Chapter 6 , Chapter 7 , Chapter 9 , Chapter 10 .

1. Multi-threaded web crawler

1.1 The basic content of threads,GIL

Here are some basic concepts:

  • program:
    • A collection of instructions written in a programming language to implement certain functions;
    • The program itself is just a set of static instructions and data and does not directly occupy computer resources.
  • process:
    • The started program is called a process, and a process has at least one thread;
    • Process is the execution unit in the operating system. Each process contains the resources required for program execution, such as memory space, file handles, system status, etc., so processes are independent of each other and data isolated;
    • Processes are typically scheduled by the operating system to enable concurrent execution among multiple processes.
  • Thread:
    • Threads are the basic unit of CPU scheduling and execution. A process can contain multiple threads;
    • Multiple threads share the resources of the same process (except CPU resources), including memory space and system status, so data exchange between threads can be more convenient
    • Threads were introduced to achieve multi-tasking concurrency more efficiently, because the creation and switching overhead of threads is much smaller than that of processes.

  Generally speaking, a program is a collection of instructions and data that describes the execution process of a task. A process is an execution unit in the operating system and contains the resources required for program execution. A thread is an execution unit within a process. Multiple threads share the resources of the process to achieve more efficient multi-tasking concurrency.

  When multiple threads of multiple processes are running at the same time, the CPU will allocate CPU time slices through a scheduling algorithm so that each thread of each process can be executed. Commonly used process scheduling algorithms are:

  • First come, first served (FCFS): CPU time slices are allocated according to the arrival time of processes or threads.
  • Priority scheduling: Allocate CPU time slices according to the priority of the process or thread. Processes or threads with higher priority will get more CPU time slices.
  • Round-robin scheduling: Allocate CPU time slices according to the numbering order of processes or threads. Each process or thread will receive an equal amount of CPU time slices.
  • Preemptive scheduling: allows the operating system to preempt an executing process or thread at any time and allocate CPU time slices to other processes or threads. This scheduling method ensures high responsiveness and fairness, but requires dealing with the overhead of context switching.

  In a multi-core CPU, multiple threads can run simultaneously, but each core can only execute one thread at a time. The operating system uses thread scheduling algorithms to decide which threads are assigned to which cores. The goal of thread scheduling is to maximize the performance of multi-core processors and ensure balance and fairness among threads.

  Open the Task Manager and you can see the processes currently active on your computer:

insert image description here
  CPython is the official implementation of the Python programming language and one of the most commonly used Python interpreters. In most cases, "Python" refers to CPython.

  It should be noted that although CPythonis the most commonly used Python implementation, there are other Python implementations, such as Jython(running on the Java virtual machine), IronPython(running on the .NET platform), PyPy(a high-performance JITcompiler implementation), MicroPython (For IoTdevices and embedded systems.) etc., they have unique advantages in some specific scenarios.

  CPythonAn important feature of Python is the global interpreter lock ( GIL), which limits only one thread to executing Python code at a time. This means that in a multi-threaded program, although multiple threads can be used, multiple threads cannot execute Python code in parallel at the same time (usually multiple threads switch back and forth, and the waiting thread can perform other operations), which makes CPython more efficient on multi-core CPUs I/O. cannot take full advantage of multi-core performance.

  When performing computationally intensive tasks (image processing and video encoding, large-scale matrix operations, data processing, etc.), multi-threading cannot truly achieve multi-core parallel processing because multiple threads cannot execute on different cores at the same time. At this time, it is recommended to use the multiprocessing module or the concurrent.futures module to create multiple processes and make full use of the parallel computing capabilities of multi-core processors.

  Each process has its own independent interpreter GIL, so computationally intensive tasks can be performed in parallel on multiple cores.

  Despite GILlimiting the effectiveness of multithreading on computationally intensive tasks, I/Omultithreading is still an appropriate model when handling intensive tasks. Because I/Ooperations (such as file reading and writing, network requests, database operations, image uploading and downloading, user interface applications, etc.) often involve waiting, the thread can perform other tasks during the waiting period, thereby making full use of CPU time.

  In addition, in the case of a single thread, the blocking of an I/O operation will cause the entire program to suspend execution until the I/O is completed. Using multithreading avoids this situation because other threads can still continue execution.

  Use system monitoring tools (such as top, htop, etc.) to observe CPU usage and waiting for I/O. If the CPU usage is high and the I/O wait time is relatively small, it may be computationally intensive; if the CPU usage is low and the I/O wait time is long, it may be I/O intensive. This is just a rough idea. Measure to judge.

1.2 Two ways to create threads

Threading official documentation

ThreadingIt is a Python standard library dedicated to Python multi-threaded programming. In Python, there are two main ways to create thread objects:

  1. Pass objective function
import threading

# 目标函数
def my_task(param):
    print("Thread task with param:", param)

# 创建线程对象,传递目标函数和参数
my_thread = threading.Thread(target=my_task, args=("Hello",))
my_thread.start()
Thread task with param: Hello
  1. Inherit the Thread class
import threading

# 如果不需要自定义属性,则不需要重写init方法
class MyThread(threading.Thread):
    def __init__(self, param):
        super().__init__()
        self.param = param
        self.custom_data = ['Hello']  # 自定义属性,用于存储数据
    
    def run(self):
        print("Thread task with param:", self.param)
        self.custom_method()  # 调用自定义方法
    
    def custom_method(self):
        print("Custom method called.")
        self.custom_data.append(self.param)
    
    def get_custom_data(self):
        return self.custom_data

# 创建自定义线程对象
my_thread = MyThread("World")
my_thread.start()
my_thread.join()

# 调用自定义方法和属性
custom_data = my_thread.get_custom_data()
print("Custom data:", custom_data)
Thread task with param: Hello
Custom method called.
Custom data: ['Hello','World']

  In the first method, we directly create a thread object and pass the target function and parameters. In the second way, we inherit Threadthe class and override runthe method. In the run method, we first execute the thread task and then call the custom method. In the main thread, we called the get_custom_data method to get the value of the custom attribute.

  Both methods can create thread objects, but there are some differences:

  • Passing the target function : simple and intuitive, no need to create new classes. If you just need to simply execute a function in parallel, it is more convenient to pass the target function
  • Inherit the Thread class : This method is achieved by inheriting the Thread class and overriding its run method. You can define the tasks to be performed by the thread in the run method. This method is suitable for situations that require more control and encapsulation. Custom methods and properties can be added to subclasses, which is suitable for complex multi-threaded scenarios.

  In the case of inheriting threading.Threadthe class and overriding runthe method, if you set targetthe parameters at the same time, only runthe method will actually be called. This is because runthe method is the method to be executed by default when the thread object is started, and targetthe parameter is used to specify an alternative target function.
  If you override runthe method, the thread object will automatically call your overridden method when it starts run, instead of executing targetthe target function specified by the parameter. targetTherefore, setting the parameter does not make sense in this case .

threading.ThreadCategory 1.3

  In python, we mainly create thread objects by inheriting the threading.Thread class. The following introduces the syntax of the Thread class and the meaning of each parameter.

class threading.Thread(group=None, target=None, name=None, args=(), kwargs={
    
    }, *, daemon=None)
  • group: Thread group, reserved for future expansion of ThreadGroup class implementation, currently not supported.
  • target: Specifies the target function to be executed by the thread. The thread object calls the run method when it is started, and the target parameter specifies the task to be performed in the run method.
  • name: Set the name of the thread (string identifier).
    • Thread names can be used to identify and distinguish different threads in multi-threaded programs. This is very useful when debugging and understanding multi-threaded programs.
    • By default, a unique name is constructed in the form "Thread-N", or "Thread-N (target)" if the target parameter is specified.
  • args: Positional parameters (in tuple form) passed to the target function.
  • kwargs: Keyword arguments (in dictionary form) passed to the target function.
  • daemon: Set whether the thread is a daemon thread, the default is False. In a multi-threaded program, there are two types of threads: Main Thread and Daemon Thread.
    • The main thread is the entry point of the program. It will wait for all non-daemon threads to complete their execution before ending. The end of the main thread means that the program is about to exit.
    • The daemon thread is a background thread. If all non-daemon threads end, the daemon thread will be forcibly terminated, even if its task has not been completed, so they are suitable for some background tasks that do not need to be fully executed, such as logging, monitoring, etc.

The following introduces Threadthe main methods and properties of the class:

  1. start()Method: Start the thread and call the run method of the thread object to perform the thread task.

  2. join()Method: Wait for thread execution to complete.

    • When a thread object's join()method is called, the main thread (or the current thread) will be blocked until the target thread completes execution.
    • The optional parameter is timeout, indicating the maximum waiting time (s). If the target thread does not complete execution within the specified time, the main thread will continue execution.
    • If you do not use join()the method, the main thread may complete before the target thread executes, so join()the method can ensure coordination between threads and the correct execution order.
  3. is_alive()Method: Used to check whether the thread is active, that is, whether it is executing.

  4. nameProperty: used to get or set the name of the thread.

  5. identProperty: Used to obtain the thread's unique identifier.

  6. daemonAttribute: used to set whether the thread is a daemon thread. The daemon thread will end when the main thread ends.

  7. targetProperty: used to get or set the target function to be executed by the thread.

  8. argsand kwargsattributes: used to obtain the parameters of the thread function, which are positional parameters and keyword parameters respectively.

The following is an example that demonstrates how to use the main methods and properties of the Thread class:

import threading
import time
class MyThread(threading.Thread):
    def __init__(self, name, seconds):
        super().__init__()
        self.name = name
        self.seconds = seconds
        self.custom_data = []  # 自定义属性,用于存储数据
    
    def run(self):
        print(f"Thread {
      
      self.name} is running for {
      
      self.seconds} seconds.")
        self.custom_method()  # 调用自定义方法
        time.sleep(2)
        print(f"Thread {
      
      self.name} is complete.")
    
    def custom_method(self):
        print(f"Custom method of Thread {
      
      self.name} is called.")
        self.custom_data.append(self.name)
    
    def get_custom_data(self):
        return self.custom_data

# 创建自定义线程对象并启动
thread = MyThread(name="MyThread", seconds=3)
thread.start()

# 获取线程名称和标识符、活动状态和是否为守护程序
print("Thread name:", thread.name)
print("Thread identifier:", thread.ident)
print("Is thread alive:", my_thread.is_alive())
print("Is daemon thread:", my_thread.daemon)

# 等待线程执行完成
thread.join()

# 使用自定义方法和属性
custom_data = thread.get_custom_data()
print("Custom data for Thread:", custom_data)

print("Main thread finished.")
Thread MyThread is running for 3 seconds.
Custom method of Thread MyThread is called.
Thread name: MyThread
Thread identifier: 7656
Is thread alive: False
Is daemon thread: False
Thread MyThread is complete.
Custom data for Thread: ['MyThread']
Main thread finished.

1.4 Common thread methods and lock mechanisms

  1. Common methods of threads
threading module functions effect
threading.active_count() Returns the number of currently active threads. The return value is consistent with the length of the list returned by enumerate().
threading.current_thread() Returns the current thread object.
threading.enumerate() Returns a list of all active thread objects, including daemon threads and empty threads created by current_thread()
threading.main_thread() Return the main thread object. Generally, the main thread is the thread created when the Python interpreter starts
threading.get_ident() Returns the identifier of the current thread (a non-zero integer).
  1. Thread Safety and Locking Mechanisms

  In multi-threaded programming, problems can arise when multiple threads access and operate on shared resources at the same time. For example, one thread is modifying the value of a variable, while another thread is accessing and modifying the same variable at the same time. This may lead to data inconsistency or program crash. The following is an example of ticket sales at a station:

import threading
import time

ticket = 100  # 全局变量

def sale_ticket():
	# 在函数中要修改 ticket 全局变量的值,就必须在函数内部使用global ticket声明
    global ticket
    for i in range(1000): # 模拟1000个人买票
	    while ticket >0:  # 持续售票,直到所有票都售完
	        print(threading.current_thread().name + '--》正在出售第{}张票'.format(ticket))
	        ticket -= 1
	        time.sleep(0.1)

def start():
    for i in range(2):
        t = threading.Thread(target=sale_ticket)
        t.start()

if __name__ == '__main__':
    start()  # 调用自定义的 start() 函数,创建线程对象并启动线程
Thread-1 (sale_ticket)--》正在出售第62张票
Thread-2 (sale_ticket)--》正在出售第61张票
Thread-1 (sale_ticket)--》正在出售第60张票Thread-8 (sale_ticket)--》正在出售第60张票

Thread-1 (sale_ticket)--》正在出售第58张票

  The print result shows that the 60th ticket was operated by two threads at the same time, resulting in a ticketing error (the threads are scheduled in real time, and the results will be different each time).

  To avoid this race condition (a problem that occurs when multiple threads access a shared resource), we can use locks to protect the shared resource and ensure that only one thread can access the resource at a time. threadingThe module provides Lockand RLockclasses to implement the locking mechanism:

  • LockLock: Mutex lock is also the most basic lock. Only one thread is allowed to hold the lock at a time, and other threads need to wait for the lock to be released. When one thread acquires the lock, other threads will be blocked until the lock is released.
  • RLockLock: Reentrant lock, also known as recursive lock. The same thread can acquire the same lock multiple times without causing deadlock. After each lock is acquired, the lock counter will increase, and the lock must be released the same number of times before the lock can actually be released.
method effect
threading.Lock() Create lock object
lock.acquire(blocking=True, timeout=None) Acquire a lock, blocking the current thread until the lock is available or times out.
lock.release() Release the lock, allowing other threads to acquire the lock.
lock.locked() Returns Trueif the lock has been acquired by a thread, otherwise returns False.
lock.__enter__() Used as part of the context manager to acquire locks.
lock.__exit__(exc_type, exc_value, traceback) Used as part of the context manager to release locks.

   RLockThe creation and method of Lockare completely consistent with , so we won’t go into details again.

  We can manually call the acquire()and release()methods to manage the acquisition and release of locks, or we can use the context manager to complete it. When entering the with statement block, lock.__enter__()the method is called and the lock is acquired; when exiting the with statement block, lock.__exit__()the method is called and the lock is released; therefore, using the with statement will make the code clearer and more concise.

import threading
import time

ticket = 100  # 全局变量
lock = threading.Lock()  # 创建一个线程锁

def sale_ticket():	
    global ticket
    for i in range(1000):
	    while ticket >0:  # 持续售票,直到所有票都售完
	        with lock:  # 使用线程锁进行同步
	            print(threading.current_thread().name + '--》正在出售第{}张票'.format(ticket))
	            ticket -= 1
	        time.sleep(0.1)

def start():
    for i in range(2):
        t = threading.Thread(target=sale_ticket)
        t.start()


if __name__ == '__main__':
    start()  # 调用自定义的 start() 函数,创建线程对象并启动线程
  1. deadlock

  Improper use of locks can lead to deadlocks . Deadlock refers to multiple threads waiting for each other to release resources, thereby falling into a state where execution cannot continue. The following is a classic example that shows how to use two threads and two locks to create a deadlock situation:

import threading

lock1 = threading.Lock()
lock2 = threading.Lock()

def worker1():
    with lock1:
        print("Worker 1 acquired lock 1")
        # 为了模拟死锁,故意在获取第一个锁后休眠一段时间
        # 从而在 worker2 尝试获取 lock2 时,无法释放 lock1
        # 导致 worker1 和 worker2 互相等待
        import time
        time.sleep(1)
        print("Worker 1 waiting for lock 2")
        with lock2:
            print("Worker 1 acquired lock 2")

def worker2():
    with lock2:
        print("Worker 2 acquired lock 2")
        print("Worker 2 waiting for lock 1")
        with lock1:
            print("Worker 2 acquired lock 1")

if __name__ == "__main__":
    thread1 = threading.Thread(target=worker1)
    thread2 = threading.Thread(target=worker2)
    
    thread1.start()
    thread2.start()
    
    thread1.join()
    thread2.join()
    
    print("Main thread finished")

  The following code changes the first lock acquisition in worker2 from lock2 to lock1, so that both threads will acquire locks in the same order, avoiding deadlock.

def worker2():
    with lock1:  # 修改为使用相同的锁顺序
        print("Worker 2 acquired lock 1")
        print("Worker 2 waiting for lock 2")
        with lock2:
            print("Worker 2 acquired lock 2")

  To avoid deadlock, you can also consider minimizing or avoiding time-consuming operations within the lock before acquiring the lock. In addition, using a timeout mechanism ( lock1.acquire(timeout=1)) can prevent threads from waiting forever while acquiring the lock.

1.5 Producer-Consumer Model

1.5.1 Introduction to the producer-consumer model

  The Producer-Consumer Pattern is a common multi-threaded design pattern used to solve the collaboration problem between producers and consumers. In this model, there are two types of threads:

  1. Producer: Responsible for generating (producing) data or tasks and putting them into a shared buffer (queue). Producers continue to produce data until a certain condition is reached. If the buffer is full, the producer may need to wait.

  2. Consumer: Responsible for obtaining data or tasks from the shared buffer and processing them. The consumer keeps getting data from the buffer until a certain condition is reached. If the buffer is empty, the consumer may need to wait.

  The goal of the producer-consumer model is to achieve effective coordination between producers and consumers to avoid resource competition, improve efficiency and reduce thread waiting time. Here is a simple schematic diagram:

+----------------+     +----------------+     +----------------+
|   生产者        |     |    缓冲区       |     |    消费者       |
|                |<--->|                |<--->|                |
| 生成数据并放入    |     |  存储和协调数据  |     |  从缓冲区获取    |
|   缓冲区中       |     |    交换的地方   |     |    数据并处理    |
+----------------+     +----------------+     +----------------+

Here's a simple example:

import threading
import random
import time

g_money = 0
lock = threading.Lock()  # 创建锁对象

# 生产者线程类
class Producer(threading.Thread):
    def run(self):
        global g_money
        for _ in range(10):
            with lock:  # 获取锁,进入临界区
                money = random.randint(1, 1000)
                g_money += money
                print(threading.current_thread().name, '挣了{}钱,当前余额为:{}'.format(money, g_money))
                time.sleep(0.1)

# 消费者线程类
class Customer(threading.Thread):
    def run(self):
        global g_money
        for _ in range(10):
            with lock:   # 获取锁,进入临界区
                money = random.randint(1000, 10000)
                if money <= g_money:
                    g_money -= money
                    print(threading.current_thread().name, '花了{}钱,当前余额为:{}'.format(money, g_money))
                else:
                    print(threading.current_thread().name, '想花{}钱,但是余额不足,当前余额为:{}'.format(money, g_money))
                time.sleep(0.1)

# 启动函数,创建生产者和消费者线程并启动
def start():
    for i in range(5):
        th = Producer(name='生产者{}'.format(i))
        th.start()

    for i in range(5):
        cust = Customer(name='--------消费者{}'.format(i))
        cust.start()

if __name__ == '__main__':
    start()

  This code example simulates g_moneythe situation when a producer and consumer perform read and write operations on a shared resource ( ). Thread lock is used to ensure that only one thread can operate the balance at a time, avoiding resource competition and inconsistency issues. Each thread performs multiple operations, including earning money, spending money, and printing the current balance.

1.5.2 ConditionClass coordination thread

  When we execute the producer and consumer codes we just simulated, we will find that there are often situations where consumers want to consume when the balance is insufficient, or even when the balance is insufficient, but all producers have completed production. At this time, condition variables (Condition) can be used to coordinate the interaction between producer and consumer threads.

  Condition variables allow a thread to wait for a certain condition to be met and to notify other threads when the condition is met. In Python's threadingmodule, Conditionclasses provide this mechanism for condition variables.

  ConditionThe object itself is also a lock object, and you can also use acquire(self)the and release(self)methods to acquire and release the lock of the condition variable, so there is no need to use an additional lock object at this time. ConditionThe main methods and concepts of the class include:

method describe
__init__(self, lock=None) Constructor, create a condition variable object. The optional parameter is a lock object lock, which is used to internally manage the synchronization of wait and notify operations, otherwise a new lock object will be created.
acquire(self) Acquires the lock on the condition variable.
release(self) Release the lock on the condition variable.
wait(self, timeout=None) Release the lock and enter the waiting state until other threads call notify()or notify_all()to wake up. After waking up, continue to wait for the lock
optional parameter timeout. If the condition is not met after the specified time, the thread will reacquire the lock and continue execution.
notify(self, n=1) Notify one of the threads in the waiting queue that the condition has been met and wake it up (the first one by default).
notify_all(self) Notify all threads in the waiting queue that the condition has been met and wake them up (waking up must be before releasing the lock).

The general pattern for using Conditionclasses is as follows:

  1. Acquires the lock on the condition variable.
  2. Check whether a certain condition is met. If the condition is not met, call wait()the method and wait for the condition to be met.
  3. When the conditions are met, perform relevant operations.
  4. Release the lock on the condition variable.

Here is the improved code:

import threading
import random
import time

g_money = 0
lock = threading.Condition()  				# 创建条件变量对象
g_time = 0

# 生产者线程类
class Producer(threading.Thread):
    def run(self):
        global g_money
        global g_time
        for _ in range(10):
            lock.acquire()  				# 获取条件锁
            money = random.randint(1, 1000)
            g_money += money
            g_time += 1
            print(threading.current_thread().name, '挣了{}钱,当前余额为:{}'.format(money, g_money))
            time.sleep(0.1)
            lock.notify_all()  				# 通知等待的消费者
            lock.release()     				# 释放锁

# 消费者线程类
class Customer(threading.Thread):
    def run(self):
        global g_money
        for _ in range(10):
            lock.acquire()  				# 获取锁
            money = random.randint(1000, 10000)
            while g_money < money: 			# 余额不足时一直等待
                if g_time >= 50:  			# 当已经进行了50次生产时,结束消费者线程
                    lock.release()
                    return
                print(threading.current_thread().name, '想花{}钱,但是余额不足,余额为:{}'.format(money, g_money))
                lock.wait()  

            g_money -= money  				# 开始消费
            print(threading.current_thread().name, '------------花了{}钱,当前余额为:{}'.format(money, g_money))
            lock.release()  				# 释放锁

# 启动函数,创建生产者和消费者线程并启动
def start():
    for i in range(5):
        th = Producer(name='生产者{0}'.format(i))
        th.start()

    for i in range(5):
        cust = Customer(name='--------消费者{}'.format(i))
        cust.start()

if __name__ == '__main__':
    start()
  • Increase the global variable g_timeto indicate the number of productions. If g_time50 productions are reached, it means that all producers have finished production. If the balance is still insufficient at this time, all consumer threads will end.
  • The consumer thread needs to purchase goods. It checks whether the current balance is enough in the loop, and if it is not enough, it keeps waiting.
  • When the producer thread makes money, it will lock.notify_all()notify the waiting consumer thread through.
  • This process uses condition variables and locks to ensure synchronization between producers and consumers, avoiding race conditions and deadlock problems.

1.6 Safety queues in threads

   Python's built-in queuemodule implements queue formats, including Queue(first in first out), LifoQueue(last in first out), PriorityQueue(priority queue). These queue types implement thread-safe data structures, allowing multiple threads to operate the queue at the same time without triggering Race conditions and other issues. Using these queue types, multi-threaded applications such as producer-consumer model and task scheduling can be more easily implemented. The following is a more detailed explanation:

  • Putting data (Put) : When the producer thread calls putthe method to put data into the queue, the queue will automatically obtain a mutex lock to ensure that other threads cannot access the queue at the same time. Once the data is put into the queue, the queue releases the mutex lock and then uses the condition variable to notify the consumer thread that is waiting for the data. If the queue is full, the producer thread is blocked until the queue has enough space.

  • Get data (Get) : When the consumer thread calls getthe method to obtain data, the queue will automatically obtain a mutex lock to ensure that other threads cannot access the queue at the same time. If the queue is empty, the consumer thread will be blocked until the queue has data to consume. Once the data is fetched, the queue releases the mutex lock and then uses the condition variable to notify the producer thread that is waiting for the data space.

  • Wait and Notify : Condition variables play an important role in the process of waiting and notification. When the consumer thread calls getthe method, if the queue is empty, it will enter the waiting state and release the mutex lock. When the producer thread puts in new data, it acquires the mutex lock and notifies the waiting consumer thread through the condition variable. Similarly, the producer thread will enter the waiting state when the queue is full, waiting for the consumer thread to release space.

  In short, the thread queue ensures thread-safe operations in a multi-threaded environment through internal mutex locks and condition variables. This mechanism effectively avoids common problems in multi-thread programming such as race conditions and deadlocks, while providing convenient data sharing and inter-thread communication.

  Mutex lock ( Mutex) : A mutex lock is used internally in the queue to protect access to queue data. A mutex is a synchronization mechanism that ensures that only one thread can access protected data at any time. When a thread needs to operate on data in the queue, it attempts to acquire a mutex lock. If the lock is already held by another thread, it blocks and waits until the lock is released.
   Condition variable ( Condition) : Condition variables are used internally in the queue to implement the waiting and notification mechanism between threads. Condition variables allow one or more threads to wait for a specific condition to be met. When the condition is met, the condition variable notifies the waiting threads to continue execution. In thread queues, condition variables are often used to inform consumers that new data is available in the queue, or to inform producers that the queue is not full.

Main methods of queue:

method describe
q = Queue(maxsize)
q = LifoQueue(maxsize)
q = PriorityQueue(maxsize)
Create a new queue object. maxsizeOptional, used to set the maximum capacity of the queue.
q.put(item, block=True, timeout=None) Put iteminto queue. The default blockis True, which means blocking and waiting when the queue is full, otherwise it will not block.
Optional parameter timeoutis used to set the blocking wait time.
q.get(block=True, timeout=None) Get an element from the queue. The parameters have the same meaning.get
q.put_nowait(item)
q.get_nowait()
Similar to putthe and getmethods, but does not block and will throw an exception if the queue is full or empty.
q.qsize() Returns the number of elements currently in the queue.
q.empty() Determine whether the queue is empty
q.full() Determine if the queue is full
q.task_done() Mark a task as completed. After the consumer obtains an element, it should call task_done()to notify the queue that the task has been completed.
q.join() Blocks and waits until all tasks in the queue have been processed.

When using thread queues, avoid thread blocking and the program being unable to exit normally.

from queue import Queue  

q = Queue(5)  # 创建一个容量为5的队列

# 向队列中存放数据
for i in range(4):
    q.put(i)

for _ in range(5):
    try:
        print(q.get(block=False))  # 尝试从队列中获取数据,不阻塞
    except :
        print('队列为空,程序结束')
        break

  In the above example, we set the setting q.get( block=False)to not block when the queue is empty. Otherwise, the get operation will be blocked until there is data in the queue to obtain, and the program cannot exit normally at this time. You can use q.get_nowait()to achieve the same effect.

Here is a simple example using a thread queue:

from queue import Queue  
import random  
import time  
import threading  

# 生产者线程函数,向队列中添加随机整数
def add_value(q):
    while True:
        q.put(random.randint(100, 1000))  # 将随机整数放入队列
        time.sleep(1) 					  # 线程休眠1秒

# 消费者线程函数,从队列中取出元素并打印
def get_value(q):
    while True:
        value = q.get()  				 # 从队列中获取元素
        print('取出了元素: {0}'.format(value))  

# 启动函数,创建队列和线程,并启动线程
def start():
    q = Queue(10)  						 # 创建队列,最大容量为10
    t1 = threading.Thread(target=add_value, args=(q,)) 
    t2 = threading.Thread(target=get_value, args=(q,))  
    t1.start()  						# 启动生产者线程
    t2.start()  						# 启动消费者线程

if __name__ == '__main__':
    start()  # 调用启动函数,开始执行生产者和消费者线程

  args=(q,)Indicates putting the queue object qinto a tuple, and then passing this tuple as a parameter to the thread function ( argaccepting tuple objects). In this way, inside add_valuethe and get_valuefunctions, the queue object can be accessed through the function parameters.

1.6 Multi-threaded crawling King of Glory wallpaper

  The URL of the official website of Honor of Kings high-definition wallpapers is https://pvp.qq.com/web201605/wallpaper.shtml. Each wallpaper comes in 7 sizes. Below we use the crawler code to download all wallpapers, each wallpaper contains all sizes.

insert image description here

1.6.1 Web page analysis

  Next, first determine whether these wallpaper images are static resources on the web page or Ajax requests (see section 2.1 of this article for details). There are two ways. We first enable developer mode (F12) and then refresh the page

  1. 通过地址栏判断:壁纸一共34页,跳转到下一页,发现地址栏URL没变,说 应该是Ajax请求。点击Fetch/XHR,发现确实有Ajax请求(点击Preview还看不到图片)
    insert image description here
  2. 通过源代码判断。
    • 我们打开element标签,用左侧箭头选取一张壁纸,定位到这张壁纸在element中的位置,发现其在标签<div class="p_newhero_item">下,在下级标签ul中,还可以看到其余尺寸的信息。
    • Ctrl+U打开网页源码,Ctrl+C复制刚才的标签信息进行查找,发现源码中确实有<div class="p_newhero_item">标签,但是相关信息被注释掉了,这也说明这些壁纸是Ajax请求。
      insert image description here

insert image description here

  这两种方式都表示,王者荣耀壁纸是Ajax请求。我们在ALL中选择worklist元素,在preview中展开可以看到一页中20张图片的信息,这里才是真实的数据源。

insert image description here
  我们将这个数据源的Headers中的URL(其中page=0字段表示是第一页),复制粘贴到浏览器地址栏中,就可以看到其响应的数据:
insert image description here
  再将这些数据全部复制粘贴在json.cn网页中,可以看到右侧显示栏报错,说明这还不是json格式,因为在json字典格式之外,最外侧还多了jQuery11130793949928178278_1692852974592()。我们将{}之外的这部分内容去掉,就可以看到json格式内容了。其中,每个object就是一张图片,sProdImg是每个尺寸的图片链接地址。
insert image description here
  我们选择其中一个图片地址,粘贴在地址栏打开,发现打不开。这是因为URL被编码了,所以我们需要对其进行解码操作。

# 选择一张sProdImgNo_8.jpg,解析URL
from urllib import  parse

result=parse.unquote('https%3A%2F%2Fshp.qpic.cn%2Fishow%2F2735081516%2F1692089105_829394697_8720_sProdImgNo_8.jpg%2F200')
#"http://shp.qpic.cn/ishow/2735032519/1585137454_84828260_27866_sProdImgNo_8.jpg/0"
print(result)
https://shp.qpic.cn/ishow/2735081516/1692089105_829394697_8720_sProdImgNo_8.jpg/200

  我们打开解析的网址,发现图片非常小。我们在element中点选这张图片的最大尺寸,可以看到其地址信息为https://shp.qpic.cn/ishow/2735081516/1692089072_829394697_3690_sProdImgNo_1.jpg/0,与我们刚刚解析的地址区别,就是最后一个数字为0。我们将刚刚的解析地址最后一个数字改为0,就可以看到大尺寸的壁纸了。
insert image description here
  所以,我们先要找到壁纸的数据源,然后解析URL,最后将URL末尾的数字200替换为0。

1.6.2 爬取第一页的壁纸

  1. 获取URL和请求头
    URL就是刚刚第一页worklist元素,Headers中的Request URL,下拉还可以看到User-Agent信息。我们需要设置headers来应对反爬。建议在headers中还写一个Referer信息,表示是从哪个网址跳转过去的。
    insert image description here
import requests  					

# 定义headers,模拟浏览器请求
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',
    'referer': 'https://pvp.qq.com/web201605/wallpaper.shtml'
}

url='https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=0&iOrder=0&iSortNumClose=1&jsoncallback=jQuery111304982467749514099_1692856287807&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1692856287809'
resp=requests.get(url,headers=headers)
print(resp.text)

  我们运行此代码,可以看到返回的还不是json格式,同刚才讲的一样我们需要去除{}之外的内容。我们可以使用replace函数进行替换,然后用eval函数处理,就得到json格式数据。

  此时也可以将URL中&jsoncallback=jQuery111304982467749514099_1692856287807字段删除,返回的结果就是字典格式,然后可以用.json()方法将其转为json格式。

  1. 解析URL

  接下来我们对json格式网页内容进行解析。在开发者模式中,所有壁纸信息都在List标签下,一共包含20个Object,每个Object的sProdImgNo_x标签中就是我们需要的壁纸URL。

  我们可以写一个exact_url函数来提取这些壁纸URL(sProdImgNo_1到8),并对这些URL进行解析,然后将末尾的200替换为0。

  1. 获取壁纸名

  最后,我们需要将每套壁纸都存在对应壁纸名的文件夹中。其中,壁纸名就是sProdName标签中的文本,只不过还要经过解析。比如下图的字符串,解析后就是“鹤归松栖-赵怀真”。
insert image description here

  此时我们可以打印最终得到的壁纸名和对应的URL,看看结果是否显示正确。

import  requests
from  urllib import  parse
from urllib import  request
import os

headers={
    
    
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',
'referer': 'https://pvp.qq.com/web201605/wallpaper.shtml'
}
            
def send_request():
    url='https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=0&iOrder=0&iSortNumClose=1&jsoncallback=jQuery111306942951976771379_1692875716815&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1692875716817'
    resp=requests.get(url,headers=headers).text
    start_index,end_index = resp.find("(") + 1 ,resp.rfind(")")  
    resp=resp[start_index:end_index]
    return eval(resp)
    
# 提取每个Object中的sProdImgNo_{}标签指向的URL信息
def exact_url(data):   								# data就是json数据中的20个Object信息   
    image_url_lst=[]
    for i in range(1,9):							# 提取8个sProdImgNo_标签下的URL信息并解码替换
        image_url=parse.unquote(data['sProdImgNo_{}'.format(i)]).replace('200','0')
        image_url_lst.append(image_url)
    return  image_url_lst
       
def parse_json(json_data):
    d={
    
    }											# 字典格式存储壁纸名称和对应的8个URL
    data_lst=json_data['List']
    for data in data_lst:
       image_url_lst=exact_url(data)   				# 获取8个URL
       sProdName=parse.unquote(data['sProdName'])   # 获取壁纸名称并解析为中文
       d[sProdName]=image_url_lst
    for item in d:
        print(item,d[item])
    #save_jpg(d)
    
                
def start():
    json_data=send_request()
    parse_json(json_data)
if __name__ == '__main__':
    start()
鹤归松栖-赵怀真 ['https://shp.qpic.cn/ishow/2735082210/1692672112_829394697_11169_sProdImgNo_1.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672112_829394697_11169_sProdImgNo_2.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672112_829394697_11169_sProdImgNo_3.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672112_829394697_11169_sProdImgNo_4.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672113_829394697_11169_sProdImgNo_5.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672113_829394697_11169_sProdImgNo_6.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672113_829394697_11169_sProdImgNo_7.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672114_829394697_11169_sProdImgNo_8.jpg/0']
鹤归松栖-云缨 ['https://shp.qpic.cn/ishow/2735082210/1692672073_829394697_8584_sProdImgNo_1.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672073_829394697_8584_sProdImgNo_2.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672073_829394697_8584_sProdImgNo_3.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672074_829394697_8584_sProdImgNo_4.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672074_829394697_8584_sProdImgNo_5.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672074_829394697_8584_sProdImgNo_6.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672075_829394697_8584_sProdImgNo_7.jpg/0', 'https://shp.qpic.cn/ishow/2735082210/1692672075_829394697_8584_sProdImgNo_8.jpg/0']
...

  下面编写一个save_jpg函数,用于从壁纸的URL链接来下载图片。我们可以用urllib.request.urlretrieve(url,path)来完成此操作。

import os

folder_name='image'
if not os.path.exists(folder_name):
    os.mkdir(folder_name)				# 在当前路径下创建image文件夹,用于保存爬取的图片
    print(f"'{
      
      folder_name}'文件夹已创建")
            
def save_jpg(d):			  # d就是刚刚的{壁纸名:URL}字典,字典中key就是地址名
    for key in d:        
    	# 以壁纸名来命名存储的文件夹名,strip(' ')用于去除壁纸名中可能出现的空格
        dir_path=os.path.join('image',key.strip(' '))
        if not os.path.exists(dir_path):
            os.mkdir(dir_path)
        #下载图片并保存
        for index, image_url in enumerate(d[key]):
            img_path=os.path.join(dir_path,'{}.jpg').format(index+1)
            if not os.path.exists(img_path):
	            request.urlretrieve(image_url,img_path)
	            print('{}下载完毕'.format(d[key][index]))

1.6.3 使用生产者-消费者模式进行多线程下载

  上一节的代码可以正常运行,但是单线程下载速度太慢了,特别是有34页壁纸,每一页20张,每张8个尺寸,一共就是5440张。

  我们可以使用上一节讲的生产者-消费者安全队列进行多线程下载。其中,生产者队列存储的是每一页的壁纸URL,消费者队列负责从队列中取出壁纸URL,然后进行下载存储。

from queue import  Queue

page_queue=Queue(34)         # page_queue用于存储每一页的URL,容量34
image_url_queue=Queue(200)   # 用于存储网页中每张壁纸的URL,容量大于160就行。
for  i in range(0,34):
    page_url=f'https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={
      
      i}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1595215093279'
    page_queue.put(page_url)    

  下面我们需要创建生产者线程。因为生产者需要从page_queue中取出page_url,然后将解析到的image_url放入image_url_queue中,所以生产者线程有page_queue,image_url_queue两个参数,这两个参数需要一开始就初始化。完整代码如下:(在URL中去除了&jsoncallback=jQuery111306942951976771379_1692875716815字段,这样就不需要额外处理URL)

import  os
import  requests
import threading
from urllib import  parse
from queue import  Queue
from  urllib import  request


headers={
    
    
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36',
'referer': 'https://pvp.qq.com/web201605/wallpaper.shtml'
}

# 提取每个Object中的sProdImgNo_{}标签指向的URL信息
def exact_url(data):   								# data就是json数据中的20个Object信息   
    image_url_lst=[]
    for i in range(1,9):							# 提取8个sProdImgNo_标签下的URL信息并解码替换
        image_url=parse.unquote(data['sProdImgNo_{}'.format(i)]).replace('200','0')
        image_url_lst.append(image_url)
    return  image_url_lst    

#生产者线程,存储壁纸的名称和URL
class Producer(threading.Thread):
    def __init__(self,page_queue,image_url_queue):
        super().__init__()
        self.page_queue=page_queue					# 存储页面URL
        self.image_url_queue=image_url_queue		# 存储壁纸URL
    def run(self):
        while not  self.page_queue.empty():         # 如果页面URL队列不为空,就获取壁纸URL
            page_url=self.page_queue.get()
            resp=requests.get(page_url,headers=headers)
            json_data=resp.json()					
            d = {
    
    }									# key和value分别是壁纸名和其URL
            data_lst = json_data['List']			# 20个Object(壁纸)
            for data in data_lst:
                image_url_lst = exact_url(data)		# 每张壁纸的8个URL
                sProdName = parse.unquote(data['sProdName']) # 壁纸名称
                d[sProdName] = image_url_lst
            for key in d:
                # 拼接路径,注意,路径不能有特殊符号
                # 所以如果爬取到的壁纸名称有特殊符号,则需要处理。否则报错系统找不到指定的路径  
                dir_path = os.path.join('image', key.strip(' ').replace('·','').replace('1:1',''))
                if not os.path.exists(dir_path):
                    os.mkdir(dir_path)				# 创建每张壁纸的存储文件夹                
                for index, image_url in enumerate(d[key]):
                   #生产图片的名称和url放入队列
                   image_path=os.path.join(dir_path,f'{
      
      index+1}.jpg')
                   self.image_url_queue.put({
    
    'image_path':image_path,'image_url':image_url})


#消费者线程获取壁纸名称和URL,并进行本地下载
class Customer(threading.Thread):
    def __init__(self,image_url_queue):
        super().__init__()
        self.image_url_queue=image_url_queue
    def run(self):
        while True:
            try:
                image=self.image_url_queue.get(timeout=20)
                request.urlretrieve(image['image_url'],image['image_path'])
                print(f'{
      
      image["image_path"]}下载完成')
            except:
                break

#定义一个启动线程的函数
def start():
	page_queue=Queue(34)         # page_queue用于存储每一页的URL,容量34
	image_url_queue=Queue(200)   # 用于存储网页中每张壁纸的URL
    for  i in range(0,34):
        page_url=f'https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={
      
      i}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1595215093279'
        page_queue.put(page_url)

    #创建生产者线程对象
    for i in range(5):
        t=Producer(page_queue,image_url_queue)
        t.start()

    #创建消费者线程对象
    for i in range(10):
        t=Customer(image_url_queue)
        t.start()
if __name__ == '__main__':
    start()

二、动态网页爬取(待续)

2.1 动态网页基础知识

2.1.1 动态网页和静态网页

动态网页和静态网页是两种不同类型的网页,它们在生成和呈现内容的方式上有所不同。

  1. 内容生成方式

    • 静态网页的内容是在服务器上提前创建好的,是固定不变的,无论用户如何访问,都呈现相同的内容。
    • 动态网页的内容是根据用户的请求或操作实时生成的,内容可以根据用户的请求、操作或其他因素而改变。
  2. 加载速度

    • 静态网页通常是由 HTML、CSS 和少量 JavaScript 组成,不需要服务器端的处理,因此加载速度较快。
    • 动态网页需要服务器在用户请求时动态生成内容,所以加载速度较慢。
  3. 互动性

    • 动态网页具有更高的互动性,可以根据用户的输入、操作或其他条件来生成不同的内容,实现个性化的用户体验。
    • 静态网页通常没有太多的互动性,用户只能浏览提前生成的内容。
  4. 更新和维护

    • 静态网页的更新和维护较为简单,只需要替换文件即可。
    • 动态网页可能需要更多的服务器端编程和数据库管理,因此在更新和维护上可能需要更多的工作。

  总之,静态网页适用于内容相对固定、不需要频繁更新和个性化互动的情况,而动态网页适用于需要实时生成内容、提供个性化互动体验的场景(社交平台、电商平台、新闻/博客、在线游戏等)。

2.1.2 Ajax

  Ajax(Asynchronous JavaScript and XML)是一种用于创建交互式、动态网页应用的技术。它允许在不刷新整个页面的情况下,通过在后台与服务器进行异步通信,更新页面的部分内容,为用户提供更好的体验。

  Ajax 的核心思想是利用前端的 JavaScript 异步请求技术,将数据传输和处理与用户界面的呈现分离开来,从而实现更流畅的用户体验。传统上,在网页中,用户在与服务器进行通信时需要刷新整个页面,而 Ajax 可以在后台请求和处理数据,然后仅更新页面的特定部分,而不会影响其他内容。

Ajax 可以用于以下方面:

  1. 数据加载: 在页面加载后,使用 Ajax 可以异步加载数据,例如从服务器获取新闻、商品信息等,而不必等待整个页面加载完成。

  2. 表单提交: 使用 Ajax 可以在不刷新页面的情况下,将用户输入的数据发送到服务器进行处理,然后根据服务器响应更新页面内容。

  3. 实时更新: Ajax 可以用于实现实时更新功能,如实时聊天、社交媒体动态更新等。

  4. 用户反馈: 使用 Ajax 可以实现用户反馈功能,如点赞、评论等,无需刷新整个页面。

  5. 搜索建议: 在用户输入时,可以使用 Ajax 获取匹配的搜索建议,实现更好的用户体验。

  6. 动态表格: 可以使用 Ajax 在表格中动态加载数据,例如在分页中切换页面内容。

  在之前的1.6章节中,点击王者荣耀壁纸的下一页按钮,可以发现地址栏的URL没有改变,但是壁纸的已经动态的更改了。另外在百度中搜索图片,随着我们鼠标的滑动,可以看到页面不停的加载进来更多的图片,而地址栏的地址也没有变化,这里也是使用到了Ajax 技术。

  下面我们打开百度,搜索美女图片。在开发者模式中,选择Fetch/XHR就可以看到Ajax请求。

insert image description here
  右侧的URL就是数据地址,将其在地址栏打开,显示的就是json格式的数据。复制之后在json.cn中粘贴,显示如下:
insert image description here
insert image description here

2.1.3 动态网页的爬取方式

   静态网页的源代码都包含了完整的页面内容,所以可以使用基本的网络请求库(如requests)来获取网页的源代码,然后使用解析库(如Beautiful Soup)来提取所需的数据。

  动态网页的内容是在用户请求时生成的,源代码可能并不包含所有的内容。比如动态网页的源代码中看不到通过Ajax加载的数据,只能看到地址栏URL加载的html代码。

对于动态网页的爬取,有三种方法:

  1. 分析 AJAX 请求: 动态网页通常通过 AJAX 请求获取额外的数据,你可以分析这些请求的 URL 和参数,然后使用 Python 的网络请求库来模拟这些请求,获取数据(例如1.6节中,我们分析出了壁纸数据的真实URL,然后在地址栏打开得到对应的json数据,再进行后续的解析)。

    • 优点:可以直接请求到数据,解析难度小,代码量少,性能高。
    • 缺点:分析接口比较复杂,众多request中,可能不知道哪一个包含真正的数据源,特别是一些经过JS混淆的接口(需要JS功底),而且容易被发现是爬虫程序。
  2. 模拟浏览器行为: 使用自动化测试工具或库(如 Selenium)模拟浏览器行为,完整加载和执行页面的 JavaScript,然后获取完整的页面内容。

    • Advantages: What the browser can request Seleniumcan also be requested, and the crawler is more stable.
    • Disadvantages: a lot of code, low performance
  3. API calls: If the website provides API interfaces, you can directly call these interfaces to obtain data without parsing the entire web page.

2.2 selenium

2.2.1 Introduction to selenium

  The following is using selenium to operate Google Chrome, open the bing homepage and enter developer mode. After selecting the search box, you can see that the corresponding source code here is the input_id tag, and the id attribute is, so you can use id for positioning sb_form_q.
insert image description here

  • Automatically save screenshots
  • Get web page source code
  • Type python in the search bar
# coding:utf-8

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver=webdriver.Chrome()					# 初始化驱动,构造浏览器
driver.get('https://cn.bing.com/')
driver.save_screenshot('bing.jpg') 			#自动进行网页截图
html=driver.page_source             		# 获取网页源代码
print(html)

# 定位搜索框并向搜索框中输入待搜索的数据,input_tag表示搜索框
input_tag=driver.find_element(by=By.ID,value='sb_form_q')
input_tag.send_keys('python')				# 向bing搜索框中输入python
time.sleep(5)
driver.quit() 								#退出整个浏览器

2.2.2 Commonly used positioning methods in selenium

  driver.find_element()It can be used to locate the first element that meets the criteria. If you want to locate all elements that meet the criteria, you can use it driver.find_elements().

insert image description here
In the code in the previous section, we can also use these listed methods to locate the search box:

# 使用name属性值定位搜索框
driver.find_element(by=By.NAME,value='q').send_keys('world')

# 使用XPath语法定位搜索框
driver.find_element(by=By.XPATH,value='//input[@class="sb_form_q"]').send_keys('hello')

# 通过CSS id选择器定位元素
driver.find_element(by=By.CSS_SELECTOR ,value='#sb_form_q').send_keys('world')

# 通过CSS 类选择器定位元素
driver.find_element(by=By.CSS_SELECTOR,value='.sb_form_q').send_keys('hello')

# 通过class名称定位元素
driver.find_element(by=By.CLASS_NAME,value='sb_form_q').send_keys('python')

# 根据标签名称定位有多少个input标签
input_list=driver.find_elements(by=By.TAG_NAME,value='input')
print(len(input_list))
driver.quit()
3

2.2.3 Form operations

  1. click button
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver=webdriver.Chrome()
driver.get('https://cn.bing.com/')
input_tag=driver.find_element(by=By.ID,value='sb_form_q')
input_tag.send_keys('python')
time.sleep(5)

# 定位搜索按钮
button_tag=driver.find_element(by=By.ID,value='search_icon')
# 表单元素操作,单击搜索按钮
button_tag.click()
# 获取源代码并输出
html=driver.page_source
print(html)
# 休眠10秒后退出浏览器
time.sleep(10)
driver.quit()
  1. Manipulating dropdown lists and checkboxes

The following is a piece of code that opens the 12306 web page and operates the registration page

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.ui import  Select

driver=webdriver.Chrome()
driver.get('https://kyfw.12306.cn/otn/regist/init')

select=Select(driver.find_element(by=By.ID,value='passengerType'))		# 定位到下接列表框
select.select_by_index(2) 												# 通过索引选择下拉列表框(索引2学生)
time.sleep(5)

select.select_by_value('2') 											# 通过value选择下拉列表框
time.sleep(5)

select.select_by_visible_text('成人')									# 通过可见文本选择下拉列表项
time.sleep(5)

# 定位复选框并进行单击
checkbox_tag=driver.find_element(by=By.ID,value='checkAgree')
checkbox_tag.click()

time.sleep(5)
driver.quit()

Guess you like

Origin blog.csdn.net/qq_56591814/article/details/132409331