[Network Security Takes You to Practice Crawlers - 100 Practices] Practice 7: Multithreading + Task Queue + Mutual Exclusion Lock

Table of contents

1. Multi-thread analysis:

2. Code implementation

3. Small circle of network security


1. Multi-thread analysis:

When it comes to multithreaded crawler packages, there are several popular Python libraries that can be used to build powerful multithreaded crawlers.

Implementation function:
use the requests library to send HTTP requests and get response data.
Use the beautifulsoup4 library to parse HTML or XML response content.
Use the threading module to create and manage threads.
Use the queue module to create a task queue for storing URLs to be crawled.
Use mutex (Lock) for thread synchronization to avoid data competition.

Sample code:

import requests
from bs4 import BeautifulSoup
import threading
from queue import Queue

# 创建任务队列
url_queue = Queue()

# 设置要爬取的URL列表
urls = ['http://example.com/page1', 'http://example.com/page2', 'http://example.com/page3']

# 将URL加入任务队列
for url in urls:
    url_queue.put(url)

# 互斥锁
lock = threading.Lock()

# 爬取函数
def crawl():
    while True:
        # 从队列中获取URL
        url = url_queue.get()
        
        # 发送HTTP请求
        response = requests.get(url)
        
        # 解析HTML响应
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 在这里进行你需要的数据提取和处理操作
        # ...

        # 使用互斥锁输出结果
        with lock:
            print(f"URL: {url}, Data: {data}")
        
        # 标记任务完成
        url_queue.task_done()

# 创建并启动多个线程
num_threads = 4  # 设置线程数量

for _ in range(num_threads):
    t = threading.Thread(target=crawl)
    t.daemon = True  # 设置为守护线程
    t.start()

# 阻塞,直到队列中的所有任务完成
url_queue.join()

The above code creates a crawler with multiple threads, and each thread obtains URLs from the task queue and performs crawling operations. Note that in actual use, you may need to make appropriate modifications and optimizations according to the actual situation, such as adding exception handling, setting a reasonable number of threads, and so on.
 



2. Code implementation

operation result

After the crawling is complete, you can see that the number of pages crawled is not in order, because it is multi-threaded

Consequences of single account/IP multi-threading

Goal 1: Create an object lock

The defined location ensures that this lock can be used when operating resources

csv_lock = threading.Lock()

Create a thread lock object csv_lock. Thread locks are used to protect shared resources to ensure that only one thread can access the resource at a time to avoid race conditions


Goal 2: Implement locking resources while manipulating them

Implement locking and releasing before and after manipulating objects

        csv_lock.acquire()
        csv_w.writerow((title.strip(), link, type_texts, money, email, phone))
        csv_lock.release()

(1) csv_lock.acquire(): Get the thread lock, which means that the current thread will start to operate on the shared resource, and other threads need to wait

(2) csv_w.writerow((title.strip(), link, type_texts, money, email, phone)): To operate on shared resources, here is to write a row of data to the CSV file. csv_wIs a CSV writer object, writerow()the method is used to write a row of data, the parameter is a tuple, and each element in the tuple corresponds to a column of data

(3) csv_lock.release(): Release the thread lock, indicating that the current thread has completed the operation on the shared resource, and other threads can continue to compete to acquire the lock


Goal 3: Implement multithreading

    threads = []
    for page in range(1, 5):
        thread = threading.Thread(target=crawl_page, args=(page,))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

(1) threads = []: Create an empty list threads to store the created thread objects

(2) for page in range(1, 5):: Through loop iteration, page takes values ​​1, 2, 3, 4 in sequence, indicating the page number to be crawled. It is assumed here that the content of 4 pages is to be crawled.

(3) thread = threading.Thread(target=crawl_page, args=(page,)): Create a thread object thread, specify the function to be executed as crawl_page, and pass in the parameter page.

(4) threads.append(thread): Add the created thread object thread to the list threads for subsequent operations.

(5) thread.start(): Start the thread so that it starts to execute the crawl_page function.

(6) for thread in threads:: By looping through the thread objects in the list threads.

(7) thread.join(): Call the join() method of the thread object to let the main thread wait for the thread to finish executing. This ensures that all threads are executed before continuing to execute subsequent code


Code:

Note: Fill in your own cookie

(If you don't fill in the cookie, there is a certain chance that you may crawl)

import time
import requests
import csv
from bs4 import BeautifulSoup
import threading

def get_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36',
            'Cookie':'!!!!!!!!!!!!!'
        }

        response = requests.get(url, headers=headers, timeout=10)
        return response.text
    except:
        return ""

def get_TYC_info(page):
    TYC_url = f"https://www.tianyancha.com/search?key=&sessionNo=1688538554.71584711&base=hub&cacheCode=00420100V2020&city=wuhan&pageNum={page}"
    html = get_page(TYC_url)

    soup = BeautifulSoup(html, 'lxml')
    GS_list = soup.find('div', attrs={'class': 'index_list-wrap___axcs'})
    GS_items = GS_list.find_all('div', attrs={'class': 'index_search-box__7YVh6'})

    for item in GS_items:
        title = item.find('div', attrs={'class': 'index_name__qEdWi'}).a.span.text
        link = item.a['href']
        company_type_div = item.find('div', attrs={'class': 'index_tag-list__wePh_'})

        if company_type_div is not None:
            company_type = company_type_div.find_all('div', attrs={'class': 'index_tag-common__edIee'})
            type_texts = [element.text for element in company_type]
        else:
            type_texts = ''

        money = item.find('div', attrs={'class': 'index_info-col__UVcZb index_narrow__QeZfV'}).span.text

        for u in [link]:
            html2 = get_page(u)
            soup2 = BeautifulSoup(html2, 'lxml')
            email_phone_div = soup2.find('div', attrs={'class': 'index_detail__JSmQM'})

            if email_phone_div is not None:
                phone_div = email_phone_div.find('div', attrs={'class': 'index_first__3b_pm'})
                email_div = email_phone_div.find('div', attrs={'class': 'index_second__rX915'})

                if phone_div is not None:
                    phone_element = phone_div.find('span', attrs={'class': 'link-hover-click'})
                    if phone_element is not None:
                        phone = phone_element.find('span',attrs={'class':'index_detail-tel__fgpsE'}).text
                    else:
                        phone = ''
                else:
                    phone = ''

                if email_div is not None:
                    email_element = email_div.find('span', attrs={'class': 'index_detail-email__B_1Tq'})
                    if email_element is not None:
                        email = email_element.text
                    else:
                        email = ''
                else:
                    email = ''
            else:
                phone = ''
                email = ''

            csv_lock.acquire()
            csv_w.writerow((title.strip(), link, type_texts, money, email, phone))
            csv_lock.release()

def crawl_page(page):
    get_TYC_info(page)
    print(f'第{page}页已爬完')

if __name__ == '__main__':
    with open('5.csv', 'a', encoding='utf-8', newline='') as f:
        csv_w = csv.writer(f)
        csv_w.writerow(('公司名', 'URL', '类型', '资金', '电子邮件', '电话号码'))

        csv_lock = threading.Lock()

        threads = []
        for page in range(1, 5):
            thread = threading.Thread(target=crawl_page, args=(page,))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

        time.sleep(2)



3. Small circle of network security

README.md Book Bansheng/Network Security Knowledge System-Practice Center-Code Cloud-Open Source China (gitee.com) https://gitee.com/shubansheng/Treasure_knowledge/blob/master/README.md

GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge

Guess you like

Origin blog.csdn.net/qq_53079406/article/details/131581786