Python web crawler principle and proxy IP use

Table of contents

Preface

1. Principle of Python web crawler

2. Python web crawler case

Step 1: Analyze the web page

Preface

With the development of the Internet, the amount of information on the Internet has become larger and larger. Obtaining this data is an important task for data analysts and researchers. Python is an efficient programming language that is widely used in the fields of web development and data analysis. Python web crawlers can automatically visit websites and extract data from them. This article will introduce the principle of Python web crawler and the use of proxy IP, and provide an example.

1. Principle of Python web crawler

Python is an efficient programming language that is popular in the field of web development and data analysis. Python's excellent modules make it more suitable for large-scale data processing and Web service programming. Web crawlers are one of the most commonly used tools by Python developers.

A web crawler is an automated program that can simulate the behavior of a human browser and automatically search and obtain information on the Internet. Python web crawlers usually include the following steps:

URL analysis: Python web crawler needs to specify the URL of the website to be crawled. By accessing the link, the crawler program will automatically parse the HTML content on the web page, identify the hyperlinks, and further discover other links to obtain a list of websites that need to be crawled.
Page download: Python web crawler first needs to initiate an HTTP request. Once the server accepts the HTTP request, it will return the page that needs to be rendered by the browser in the form of HTML code. Python web crawlers need to use libraries, such as requests, urllib, etc., to initiate HTTP requests and download page data.
Content parsing: Python web crawlers often use parsing libraries to parse data. Parsing libraries can extract specific tags, text, or attributes and convert them to Python data types, such as lists or dictionaries. Beautiful Soup is one of the most popular parsing libraries in Python.
Data processing: Python web crawlers need to process and analyze data. Python's data analysis libraries pandas and NumPy provide various processing and analysis tools. Crawlers can use these tools to clean and process data.

The above is the general process of Python web crawler. Below, we will further illustrate this with examples.

2. Python web crawler case

We will take the collection of Douban movie Top250 data as an example to introduce in detail the implementation method of Python web crawler.

Step 1: Analyze the web page

Before visiting any web page, we need to understand the structure and elements of that web page. In Python, we can use the requests library to access web pages and get HTML markup. Here is the sample code:

import requests

url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text

print(html)

After getting the HTML tags, we can use the Beautiful Soup library to analyze the HTML page. It provides a convenient way to find and extract data from HTML pages. Here is the sample code:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify()) #输出格式化的 HTML 代码

Running the above code, we can see the beautified HTML code in the console.

Step 2: Extract data

After analyzing the web page, we need to extract useful data. In our example, we will extract information such as movie name, rating, movie type, director and actors from Douban Movie Top250.

# 获取标题信息
titles = [title.text for title in soup.select('div.hd a span')]
print(titles)

# 获取评分信息
scores = [score.text for score in soup.select('div.star span.rating_num')]
print(scores)

# 获取信息文本
lists = [list.text for list in soup.select('div.info div.bd p')]
print(lists)

# 处理信息文本
directors = []
actors = []
for list in lists:
    temp_str = list.strip().split('\n')[0]
    index = temp_str.find('导演')
    if index != -1:
        directors.append(temp_str[index + 3:])
        actors.append(temp_str[:index - 1])
    else:
        directors.append('')
        actors.append(temp_str)
print(directors)
print(actors)

Step 3: Store data

Finally, we need to store the data to a file for further processing and analysis. In Python, we can use the Pandas library to store data into a CSV file.

import pandas as pd

data = {'电影名称': titles, '电影评分': scores, '导演': directors, '演员': actors}
df = pd.DataFrame(data)
print(df)

df.to_csv('douban_movies.csv', index=False)

3. Use proxy IP

Python web crawlers usually need to use proxy IPs to avoid the website's anti-crawler mechanism. A proxy IP is an IP address on another server that hides our real IP address and location, thereby bypassing website access restrictions. In Python, we can use proxy IP to access the website to achieve privacy protection.

Using proxy IP can be achieved by adding some parameters. For example, we can use the proxies parameter in the requests library to specify the proxy IP:

proxies = {'http': 'http://user:<password>@<ip_address>:<port>',
           'https': 'https://user:<password>@<ip_address>:<port>'}
response = requests.get(url, proxies=proxies)

In the above code, we specify the proxy IP for HTTP and HTTPS protocols. Where user:password is the username and password of the proxy IP, ip_address and port are the IP address and port number of the proxy server.

We can also use scrapy framework to implement the use of proxy IP. The scrapy framework provides multiple methods to set and switch proxy IPs. For example, we can use the downloader middleware in scrapy to specify the proxy IP, such as randomly selecting the proxy IP:

import random

class RandomProxyMiddleware(object):
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('PROXY_LIST'))

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy

In the above code, we implemented a middleware named RandomProxyMiddleware, which randomly selects a proxy IP as the proxy for the request. The proxy IP list can be configured in scrapy's settings file.

4. Summary

Python web crawler is a powerful data scraping and analysis tool that can scrape large amounts of data from the Internet for various data analysis and mining. In this article, we introduce the basic principles and usage of Python web crawler, and provide an example of obtaining movie information from the top 250 Douban movies. We also covered how to use proxy IPs to avoid website anti-crawler mechanisms. I hope this article will be helpful to beginners of Python web crawling.