How to use Python to crawl websites for performance testing

Yiniu Cloud Agent.png

Introduction

Website performance testing is a method of evaluating a website's responsiveness, stability, reliability, and resource consumption. Website performance testing can help website developers and operation and maintenance personnel discover and solve website performance bottlenecks, improve user experience and satisfaction. This article will introduce how to use Python to write a simple crawler program to simulate the behavior of users visiting the website, and collect and analyze the performance data of the website.

overview

Python is a widely used high-level programming language, which is concise, easy to read, flexible and cross-platform. Python also has many powerful third-party libraries, which can easily realize various functions. This article will use the following libraries to write crawlers:

requests : A simple and elegant HTTP library that can send various HTTP requests and get the server's response content and status code.
BeautifulSoup : A library for parsing and extracting HTML and XML documents, which can easily obtain elements such as links, text, and pictures in web pages.
threading : A library for implementing multi-threaded programming, which can create multiple threads to perform tasks concurrently, improving the efficiency and speed of crawlers.
time : A library for dealing with time, which can get the current time, calculate the time difference, set the delay, etc.
statistics : A library for statistical analysis, which can calculate the average, median, standard deviation and other indicators.

text

1. Import the required libraries

First, we need to import the above mentioned libraries in order to use them in the code behind. We can importimport libraries using statements such as:

# 导入requests库
import requests

2. Set the crawler proxy

Since we want to simulate the behavior of users visiting the website, we need to use a proxy server to hide our real IP address and prevent it from being identified and blocked by the target website. We can use the proxy server provided by Yiniu Cloud, which has the following parameters:

proxyHost: the domain name or IP address of the proxy server
proxyPort: the port number of the proxy server
proxyUser: username of the proxy server
proxyPass: the password of the proxy server

We can use the following code to set the crawler proxy:

# 设置爬虫代理
# 亿牛云 代理服务器
proxyHost = "www.16yun.cn"
proxyPort = "8080"
# 代理验证信息
proxyUser = "16YUN"
proxyPass = "16IP"

# 构造代理字典
proxies = {
    
    
    "http": f"http://{
      
      proxyUser}:{
      
      proxyPass}@{
      
      proxyHost}:{
      
      proxyPort}",
    "https": f"https://{
      
      proxyUser}:{
      
      proxyPass}@{
      
      proxyHost}:{
      
      proxyPort}"
}

3. Define the crawler function

Next, we need to define a crawler function, which accepts a parameter url, indicating the address of the web page to be crawled. The main functions of this function are:

Use the requests library to send GET requests to get web page content and response status codes
Use the BeautifulSoup library to parse the content of the web page, extract the links in it, and store it in a list
Use the time library to record the time when the request is sent and received, and calculate the request response time and request delay time
Use requests library to get request data bandwidth

We can use the following code to define the crawler function:

# 定义爬虫函数
def spider(url):
    # 发送GET请求，获取网页内容和响应状态码
    response = requests.get(url, proxies=proxies)
    content = response.text
    status_code = response.status_code

    # 解析网页内容，提取其中的链接，并存入一个列表中
    soup = BeautifulSoup(content, "html.parser")
    links = []
    for link in soup.find_all("a"):
        href = link.get("href")
        if href and href.startswith("http"):
            links.append(href)

    # 记录请求发送和接收的时间，计算请求响应时间和请求延迟时间
    send_time = response.request.sent_at
    receive_time = response.received_at
    response_time = receive_time - send_time
    latency_time = response.elapsed

    # 获取请求数据带宽
    bandwidth = response.request.size + response.size

    # 返回结果
    return {
    
    
        "url": url,
        "status_code": status_code,
        "links": links,
        "response_time": response_time,
        "latency_time": latency_time,
        "bandwidth": bandwidth
    }

4. Define multithreaded functions

Since we want to crawl multiple web pages, we can use multi-threading technology to improve the efficiency and speed of the crawler. We need to define a multithreaded function that accepts two parameters:

urls: a list indicating the address of the web page to be crawled
num_threads: an integer indicating the number of threads to create

The main functions of this function are:

Use the threading library to create a specified number of threads and distribute the list of urls equally to each thread
Use the spider function to crawl the webpage in each thread and store the results in a shared list
Use the time library to record the start and end times of multi-threads, and calculate the total time of multi-thread execution

We can use the following code to define a multithreaded function:

# 定义多线程函数
def multi_threading(urls, num_threads):
    # 创建指定数量的线程，并将urls列表平均分配给每个线程
    threads = []
    chunk_size = len(urls) // num_threads
    for i in range(num_threads):
        start = i * chunk_size
        end = (i + 1) * chunk_size if i < num_threads - 1 else len(urls)
        chunk = urls[start:end]
        thread = threading.Thread(target=spider_chunk, args=(chunk,))
        threads.append(thread)

    # 在每个线程中爬取网页，并将结果存入一个共享的列表中
    results = []
    def spider_chunk(chunk):
        for url in chunk:
            result = spider(url)
            results.append(result)

    # 记录多线程开始和结束的时间，计算多线程执行的总时间
    start_time = time.time()
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    end_time = time.time()
    total_time = end_time - start_time

    # 返回结果
    return {
    
    
        "results": results,
        "total_time": total_time
    }

5. Define data statistics function

Finally, we need to define a data statistics function, which accepts a parameter results, which represents the list of crawler results. The main functions of this function are:

Use the statistics library to calculate the average, median, maximum, minimum and standard deviation of various performance indicators
Use the requests library to get the domain name and IP address of the target website

We can use the following code to define data statistics functions:

# 定义数据统计函数
def data_analysis(results):
    # 导入statistics库
    import statistics
    # 导入requests库
    import requests

    # 计算各项性能指标的平均值、中位数、最大值、最小值和标准差
    status_codes = [result["status_code"] for result in results]
    response_times = [result["response_time"] for result in results]
    latency_times = [result["latency_time"] for result in results]
    bandwidths = [result["bandwidth"] for result in results]

    mean_status_code = statistics.mean(status_codes)
    median_status_code = statistics.median(status_codes)
    max_status_code = max(status_codes)
    min_status_code = min(status_codes)
    stdev_status_code = statistics.stdev(status_codes)

    mean_response_time = statistics.mean(response_times)
    median_response_time = statistics.median(response_times)
    max_response_time = max(response_times)
    min_response_time = min(response_times)
    stdev_response_time = statistics.stdev(response_times)

    mean_latency_time = statistics.mean(latency_times)
    median_latency_time = statistics.median(latency_times)
    max_latency_time = max(latency_times)
    min_latency_time = min(latency_times)
    stdev_latency_time = statistics.stdev(latency_times)

    mean_bandwidth = statistics.mean(bandwidths)
    median_bandwidth = statistics.median(bandwidths)
    max_bandwidth = max(bandwidths)
    min_bandwidth = min(bandwidths)
    stdev_bandwidth = statistics.stdev(bandwidths)

    # 获取目标网站的域名和IP地址
    url = results[0]["url"]
    domain = url.split("/")[2]
    ip_address = requests.get(f"http://ip-api.com/json/{
      
      domain}").json()["query"]

    # 返回结果
    return {
    
    
        "mean_status_code": mean_status_code,
        "median_status_code": median_status_code,
        "max_status_code": max_status_code,
        "min_status_code": min_status_code,
        "stdev_status_code": stdev_status_code,

        "mean_response_time": mean_response_time,
        "median_response_time": median_response_time,
        "max_response_time": max_response_time,
        "min_response_time": min_response_time,
        "stdev_response_time": stdev_response_time,

        "mean_latency_time": mean_latency_time,
        "median_latency_time": median_latency_time,
        "max_latency_time": max_latency_time,
        "min_latency_time": min_latency_time,
        "stdev_latency_time": stdev_latency_time,

        "mean_bandwidth": mean_bandwidth,
        "median_bandwidth": median_bandwidth,
        "max_bandwidth": max_bandwidth,
        "min_bandwidth": min_bandwidth,
        "stdev_bandwidth": stdev_bandwidth,

        "domain": domain,
        "ip_address": ip_address
    }

highlights

The highlights of this article are as follows:

A simple and efficient crawler program written in Python, which can crawl the web content and performance data of any website
Using the proxy server provided by Yiniu Cloud can hide the real IP address and prevent it from being identified and blocked by the target website
Using multi-threading technology, the efficiency and speed of crawlers can be improved, and the scenario where multiple users visit the website at the same time is simulated
Using the data statistics function, you can analyze the results of the crawler and calculate the average, median, maximum, minimum and standard deviation of various performance indicators
Using the requests library, you can obtain the domain name and IP address of the target website, as well as request data bandwidth

the case

To demonstrate the method of this paper, we chose a target website: https://cn.bing.com, which is a world-renowned search engine. We will use the following steps for performance testing:

First, we need to prepare a list of web page addresses to crawl. We can use Bing's search function, enter some keywords, such as "Python", "crawler", "performance test", etc., and then get the link of the search result page and store it in a list. We can do this with the following code:

# 准备要爬取的网页地址列表
urls = []
keywords = ["Python", "爬虫", "性能测试"]
for keyword in keywords:
    # 使用Bing搜索关键词，并获取搜索结果页面的链接
    search_url = f"https://www.bing.com/search?q={
      
      keyword}"
    search_response = requests.get(search_url, proxies=proxies)
    search_content = search_response.text
    search_soup = BeautifulSoup(search_content, "html.parser")
    for result in search_soup.find_all("li", class_="b_algo"):
        link = result.find("a").get("href")
        urls.append(link)

Then, we need to set the number of threads to create. We can decide based on the number of web pages to be crawled and computer performance. Here we assume we want to create 4 threads. We can do this with the following code:

# 设置要创建的线程数量
num_threads = 4

Next, we need to call the multi-thread function, pass in the web page address list and the number of threads, and get the results of the crawler and the total time of multi-thread execution. We can do this with the following code:

# 调用多线程函数，得到爬虫的结果和多线程执行的总时间
result = multi_threading(urls, num_threads)
results = result["results"]
total_time = result["total_time"]

Finally, we need to call the data statistics function, pass in the results of the crawler, and get the statistical data of various performance indicators and the domain name and IP address of the target website. We can do this with the following code:

# 调用数据统计函数，得到各项性能指标的统计数据和目标网站的域名和IP地址
data = data_analysis(results)

epilogue

This article introduces how to use Python to write a simple crawler program to simulate the behavior of users visiting websites, and collect and analyze website performance data. This article uses libraries such as requests, BeautifulSoup, threading, time, and statistics to realize functions such as crawler proxy, multi-threading, and data statistics. This article also gives a specific case to demonstrate how to test the performance of the Bing search engine and get some interesting results.