[100 days proficient in python] Day42: python web crawler development_HTTP request library requests common syntax and actual combat

Table of contents

1 HTTP protocol

2  HTTP与HTTPS

3 HTTP request process

 3.1 HTTP request process

3.2 GET request and POST request

3.3 Common request headers

3.4 HTTP response

4 HTTP request library requests common syntax

4.1 Send GET request

 4.2 Send POST request

4.3 Request parameters and headers

4.4 Encoding format

4.5 Requests advanced operation - file upload

4.6 Requests advanced operation - get cookie

4.7 Request advanced operation - certificate verification

5 combat

Use the requests library to grab the titles and links of the 2023 college entrance examination news


1 HTTP protocol

        HTTP Protocol (Hypertext Transfer Protocol): HTTP is a protocol for transferring data between a client and a server. It is based on the request-response model, the client sends an HTTP request and the server returns an HTTP response. The HTTP protocol is mainly used for communication between web browsers and servers to obtain, transmit, and display web pages and resources.

        In web crawling, the HTTP (Hypertext Transfer Protocol) protocol plays a vital role, and it is a protocol for transferring data between a client and a server. The following are some key roles of the HTTP protocol in crawlers:

  1. Obtaining webpage content: The crawler uses the HTTP protocol to send a request to the server to obtain the content of the webpage. By sending a GET request, the crawler can ask the server to return the HTML code of the web page.

  2. Sending requests: Crawlers can use different HTTP request methods, such as GET, POST, PUT, etc., to send different types of requests to the server. GET requests are used to fetch resources, while POST requests are used to submit data, PUT requests are used to update resources, and so on.

  3. Passing parameters: The crawler can pass various data, such as query parameters, form data, etc., through URL parameters or request body parameters of HTTP requests. This is useful when scraping specific data or doing searches.

  4. Set request headers: Crawlers can set request headers in HTTP requests, including User-Agent, Referer, Cookie, etc., to simulate different types of browser behaviors, or to bypass website anti-crawling measures.

  5. Processing the response: The server returns an HTTP response, which contains the status code, response headers, and response body. The crawler can judge whether the request is successful according to the status code, obtain information from the response header, and extract the web page content from the response body.

  6. Parsing HTML content: The crawler extracts the required information from HTML content by parsing it. This usually involves using a library such as Beautiful Soup to parse the DOM structure of the web page.

  7. Simulated login: For websites that require login to access, the crawler can submit the login form by simulating a POST request to obtain the logged-in data.

  8. Anti-crawling processing: The crawler may encounter the anti-crawling mechanism of the website, such as limiting access frequency, verification code, etc. In this case, crawlers need to properly adjust request headers, use proxy IP, etc. to bypass these restrictions.

        In short, the HTTP protocol is the basis of the crawler's work. By sending a request to the server and parsing the server's response, the crawler can obtain the required data from the web page, and then process, analyze and store it. At the same time, understanding the various characteristics and mechanisms of the HTTP protocol can help crawlers operate and interact with servers more effectively.

1.1 HTTP request structure

         An HTTP request consists of the following parts:

  1. Request Line: Contains the request method, target URL, and protocol version.
  2. Request Headers: Contains meta information about the request, such as User-Agent, Accept, Cookie, etc.
  3. Empty line: used to separate request header and request body.
  4. Request Body (Request Body): It only appears when using methods such as POST, and contains the actual data of the request.

1.2 HTTP response structure

         An HTTP response consists of the following parts:

  1. Status Line: Contains protocol version, status code and status information.
  2. Response Headers: Contains meta information about the response, such as Content-Type, Content-Length, etc.
  3. Empty line: used to separate the response header and response body.
  4. Response Body: Contains the actual data of the response, such as HTML content, JSON data, etc.

1.3 Common HTTP methods

  1. GET: Used to get data from the server and append the data in the URL.
  2. POST: Used to submit data to the server and include the data in the request body.
  3. PUT: Used to update resources on the server, including data in the request body.
  4. DELETE: Used to delete resources from the server, append data in the URL.
  5. HEAD: Similar to GET, but only returns the response header, which is used to obtain the meta information of the resource.
  6. OPTIONS: Used to query the HTTP methods supported by the server.

1.4 Common HTTP status codes:

  1. 200 OK: The request was successful.
  2. 201 Created: The resource was created successfully.
  3. 400 Bad Request: The request is wrong.
  4. 401 Unauthorized: The request is unauthorized.
  5. 403 Forbidden: The server rejects the request.
  6. 404 Not Found: The requested resource does not exist.
  7. 500 Internal Server Error: Internal server error.

Example: The following is a simple example that demonstrates how to use Python's http.servermodules to create a simple HTTP server and send GET and POST requests. You can run this example in a terminal, then visit the corresponding URL in your browser.

# 创建一个简单的HTTP服务器
# 在终端运行:python http_server_example.py
import http.server
import socketserver

class MyHandler(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-type', 'text/html')
        self.end_headers()
        self.wfile.write(b'Hello, GET request!')

    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length)
        self.send_response(200)
        self.send_header('Content-type', 'text/html')
        self.end_headers()
        response = f'Hello, POST request! Data: {post_data.decode()}'
        self.wfile.write(response.encode())

if __name__ == "__main__":
    PORT = 8000
    with socketserver.TCPServer(("", PORT), MyHandler) as httpd:
        print(f"Serving at port {PORT}")
        httpd.serve_forever()

        Visiting in the browser http://localhost:8000can see the server response. It is possible to use tools such as curlor requestslibrary to send HTTP requests and receive responses.

2  HTTP与HTTPS

        HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol Secure) are both protocols for transferring data between a client and a server, but there are important security and encryption differences between them.

HTTP (Hypertext Transfer Protocol): HTTP is a protocol for transferring hypertext data, which communicates between web browsers and web servers. The HTTP protocol is transmitted in clear text, which means that the transmitted data is not encrypted and may be easily eavesdropped and tampered with. It usually uses port 80 for communication .

HTTPS (Hypertext Transfer Protocol Secure): HTTPS is a secure version of HTTP that protects transmitted data by using encryption and authentication mechanisms. In HTTPS, data is transmitted encrypted, making it more difficult to eavesdrop and tamper with. In order to achieve encryption, HTTPS uses the SSL (Secure Sockets Layer) or TLS (Transport Layer Security) protocol. HTTPS usually uses port 443 for communication .

Main difference:

  1. Security: The most notable difference is security. HTTP does not encrypt data, while HTTPS protects data transmission through encryption to ensure data confidentiality and integrity.

  2. Encryption: HTTPS uses the SSL or TLS protocol to encrypt data so that data cannot be easily eavesdropped or tampered with during transmission. HTTP does not provide encryption and data may be monitored and modified by third parties.

  3. Authentication: HTTPS can also authenticate the server during the encryption process to ensure that you communicate with the correct server. HTTP does not provide this functionality and may be vulnerable to man-in-the-middle attacks.

  4. URL prefix: HTTP URLs start with "http://", while HTTPS URLs start with "https://".

        While HTTPS is superior to HTTP in terms of security, HTTPS is slightly slower than HTTP due to some computational overhead involved in the encryption and decryption process. However, with the improvement of computing power, the performance gap of HTTPS gradually narrows.

        In the modern web, protecting user privacy and data security is very important, therefore, many websites are switching to using HTTPS to ensure the protection of user data.

3 HTTP request process

 3.1 HTTP request process

         The HTTP request process involves the client sending a request to the server, the server processing the request and returning a response. The following is the basic process of an HTTP request:

  1. The client initiates an HTTP request, including the request method (GET, POST, etc.), target URL, request header, request body, etc.
  2. The server receives and processes the request, and finds the corresponding resource according to the request method and URL.
  3. The server generates an HTTP response, including status code, response header, response body, etc.
  4. The server sends a response back to the client.
  5. The client receives the response and processes the response content.

3.2 GET request and POST request

         GET and POST are HTTP request methods used to send requests to the server.

  • GET request: used to obtain data from the server, passing parameters through the URL, the request parameters are visible in the URL, suitable for obtaining data.
  • POST request: used to submit data to the server, request parameters are passed in the request body, and operations such as adding and modifying data are performed.

3.3 Common request headers

         Request Headers in an HTTP request contain additional information about the request, such as user agent, content type, etc. Here are some common request headers:

  • User-Agent: Identifies the type and version of the client (usually a browser).
  • Content-Type: Specifies the media type of the request body (such as application/json, application/x-www-form-urlencoded, etc.).
  • Authorization: Contains authentication credentials for authentication.
  • Referer: Indicates the source URL of the request, used to prevent CSRF attacks.
  • Cookie: Contains the client's cookie information and is used to maintain the session state.

3.4 HTTP response

         The HTTP response contains the processing result of the request by the server, including status code, response header, response body, etc.

  • Status Code (Status Code): Indicates the processing status of the server to the request, such as 200 OK means success, 404 Not Found means resource not found.
  • Response Headers: Contains meta information about the response, such as Content-Type, Server, etc.
  • Response Body: Contains the actual response content, such as the HTML content of the web page, JSON data, etc.

The following is an example that demonstrates using the Python requestslibrary to send a GET request, then parse and print the response:

import requests

url = 'https://www.example.com'
response = requests.get(url)

print("Status Code:", response.status_code)
print("Headers:", response.headers)
print("Content:", response.text)

4 HTTP request library requests common syntax

  requestsis a commonly used Python library for sending HTTP requests and handling HTTP responses. Here is requestsan example of basic usage of the library:

First, make sure you have the library installed requests. If it is not installed, you can install it with the following command:

pip install requests

You can then import the library in your Python code requestsand use it to send HTTP requests and handle responses.

4.1 Send GET request

        Using requests.get()the method, the following example demonstrates how to use requeststhe library to send a simple GET request and process the response:

import requests

# 发送GET请求获取网页内容
url = 'https://www.baidu.com'  # 替换为您要访问的网页URL
response = requests.get(url)
response.encoding = 'utf-8'  # 指定编码为UTF-8
html_content = response.text

# 输出网页内容
print(html_content)

Common syntax:

Initiate a GET request :

import requests

response = requests.get('https://www.example.com')
print(response.text)  # 输出响应内容

Initiate a GET request with parameters :

params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://www.example.com', params=params)

 Send request and set headers :

headers = {'User-Agent': 'My User Agent'}
response = requests.get('https://www.example.com', headers=headers)

Get response status code :

response = requests.get('https://www.example.com')
status_code = response.status_code

 Get response header information :

response = requests.get('https://www.example.com')
headers = response.headers

Get response content (bytes)

response = requests.get('https://www.example.com')
content = response.content

 Get response content (text)

response = requests.get('https://www.example.com')
text = response.text

Process the JSON data in the response :

response = requests.get('https://api.example.com/data.json')
data = response.json()

 Processing timeout :

try:
    response = requests.get('https://www.example.com', timeout=5)  # 5秒超时
except requests.Timeout:
    print("请求超时")

Handle exceptions :

try:
    response = requests.get('https://www.example.com')
    response.raise_for_status()  # 抛出HTTP错误状态码异常
except requests.HTTPError as http_err:
    print(f"HTTP错误: {http_err}")
except requests.RequestException as req_err:
    print(f"请求异常: {req_err}")

 4.2 Send POST request

The following example demonstrates how to use requeststhe library to send a POST request with data:

import requests

# 登录URL和登录所需的数据
login_url = 'https://mail.163.com/'
login_data = {
    'username': 'your_username',  # 替换为您的邮箱用户名
    'password': 'your_password'   # 替换为您的邮箱密码
}

# 创建会话对象
session = requests.Session()

# 发送POST请求模拟登录
response = session.post(login_url, data=login_data)

# 检查登录是否成功
if '退出' in response.text:
    print("Login successful.")
else:
    print("Login failed.")

        In this sample code, we use requests.Session()to create a session object, so that session state can be maintained across multiple requests. We then use session.post()the method to send a POST request to simulate a login. In this example, we use the login page of mailbox 163 as a demonstration, you need to replace login_urland login_datawith the actual login URL and data required for login.

        Please note that this is just a simple example, and actual websites may have more complex login logic, such as verification codes, dynamic tokens, etc. At the same time, when a crawler visits a website, it needs to abide by the rules and policies of the website to ensure that your actions are legal and compliant.

Common syntax:

Send a POST request 

data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://www.example.com', data=data)

 POST request to send JSON data :

import json

data = {'key1': 'value1', 'key2': 'value2'}
headers = {'Content-Type': 'application/json'}
response = requests.post('https://www.example.com', data=json.dumps(data), headers=headers)

4.3 Request parameters and headers

        When using requeststhe library to send HTTP requests, you can pass additional information through request parameters and headers. Request parameters are usually used for GET requests or requests with query parameters, while request headers are used to pass various information, such as user agent, cookies, etc. The following is sample code about request parameters and headers:

import requests

# 请求参数示例
params = {
    'key1': 'value1',
    'key2': 'value2'
}

# 请求头部示例
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer': 'https://www.baiud.com',
    'Cookie': 'your_cookie_data'
}

# 发送GET请求,带参数和头部
url = 'https://www.baidu.com'  # 替换为您要访问的网页URL
response = requests.get(url, params=params, headers=headers)

# 输出响应内容
print(response.text)

4.4 Encoding format

        When using requestsa library to send HTTP requests, the encoding format (also known as character set or character encoding) refers to the rules used to decode the content of the response. requestsThe library tries to automatically identify and format the encoding of the response, but sometimes you may need to manually set the encoding to ensure that the response content is parsed correctly.

Here are some explanations and examples of encoding formats:

  1. Automatically identify encoding: By default, the library will try to automatically identify the encoding format of the response requestsbased on the fields in the response header . Content-TypeFor example, if Content-Typeincluded charset=utf-8, requestsUTF-8 encoding will be used to decode the response content.

  2. Manually set the encoding: If the automatically recognized encoding is incorrect, you can manually set the encoding to solve the garbled problem. By response.encodingsetting this to the appropriate encoding, you can ensure that the response content is decoded correctly.

Here's an example that demonstrates how to manually format the encoding to correctly parse the response content:

import requests

# 发送GET请求获取网页内容
url = 'https://www.baidu.com'  # 替换为您要访问的网页URL
response = requests.get(url)
response.encoding = 'utf-8'  # 手动设置编码为UTF-8

# 输出响应内容
print(response.text)

4.5 Requests advanced operation - file upload

    requestsThe library allows you to send file upload requests, i.e. send a file to the server as part of the request. This is useful when interacting with APIs that include file upload functionality.

        To send a file upload request, you can use requests.post()the method, filespassing the file to upload as a parameter. filesThe argument should be a dictionary where keys are field names and values ​​are file objects. File objects can be open()created with functions.

The following is a simple file upload example, assuming you want to upload a local file to the server:

import requests

# 目标URL和文件路径
url = 'https://www.example.com/upload'  # 替换为实际的上传URL
file_path = 'path/to/your/file.txt'  # 替换为实际的文件路径

# 创建文件对象
with open(file_path, 'rb') as file:
    files = {'file': file}  # 'file'是字段名称,可以根据实际情况更改

    # 发送文件上传请求
    response = requests.post(url, files=files)

# 输出响应内容
print(response.text)

In this example, we use open()the function to open the file in binary mode, and then pass the file object as filesthe argument. In filesthe dictionary, the keys are the field names the server expects to receive, and the values ​​are file objects. You need to 'file'replace with the actual field name.

Note that the actual server may require other additional fields or parameters such as authentication, token, etc. You need to adjust the code according to the actual situation.

4.6 Requests advanced operation - get cookie

        In requeststhe library, you can response.cookiesget the cookie information received from the server through properties. Cookies are some key-value pairs set by the server in the HTTP response header to store state information between the client and the server. The following are detailed instructions and examples for obtaining cookies:

import requests

# 发送GET请求获取网页内容
url = 'https://www.example.com'  # 替换为您要访问的网页URL
response = requests.get(url)

# 获取响应中的Cookie信息
cookies = response.cookies

# 打印Cookie信息
for cookie in cookies:
    print("Name:", cookie.name)
    print("Value:", cookie.value)

        In this example, we use requests.get()the method to send a GET request, and use response.cookiesthe attribute to get the cookie information in the response. response.cookiesReturns an RequestsCookieJarobject that you can iterate over to get the name and value of each cookie.

        Note that multiple cookies may be included in the response, each cookie is a key-value pair. You can further process these cookie information according to actual needs, such as storing in the session, sending to the next request, and so on.

Also, if you want to manually set cookies and use them in subsequent requests, you can do so by adding a field to the request header Cookie. For example:

import requests

# 设置Cookie
cookies = {'cookie_name': 'cookie_value'}

# 发送GET请求并添加Cookie到请求头部
url = 'https://www.example.com'  # 替换为您要访问的网页URL
response = requests.get(url, cookies=cookies)

# 处理响应...

 In this example, we use cookiesparameters to add the cookie information to be sent to the request. This is useful for situations where cookies need to be handled manually.

4.7 Request advanced operation - certificate verification

In requeststhe library, you can verifycontrol whether to verify the SSL certificate through parameters. SSL certificate verification is the process used to ensure a secure encrypted connection with a server. By default, requeststhe library verifies SSL certificates, but you can verifydisable verification or provide a custom certificate by setting a parameter.

The following are detailed instructions and examples for certificate verification:

  1. Default Validation: By default, requeststhe library validates SSL certificates. It is safe practice to ensure that communication with the server is encrypted. For example:

import requests

# 发送GET请求
url = 'https://www.example.com'  # 替换为您要访问的网页URL
response = requests.get(url)

# 处理响应...

Disable Validation: In some cases, you may want to disable certificate validation, such as servers accessing self-signed certificates. You can disable validation by verifysetting the parameter to :False

import requests

# 发送GET请求并禁用证书验证
url = 'https://www.example.com'  # 替换为您要访问的网页URL
response = requests.get(url, verify=False)

# 处理响应...

 Note that disabling certificate validation reduces security and should only be used with an understanding of the risks.

Custom certificate: If you need to connect to a server using a custom certificate, you can provide the path to the certificate file as verifythe value of the parameter:

import requests

# 发送GET请求并使用自定义证书进行验证
url = 'https://www.example.com'  # 替换为您要访问的网页URL
response = requests.get(url, verify='/path/to/custom/certificate.pem')

# 处理响应...

 In this example, /path/to/custom/certificate.pemthe path to your custom certificate file.

 Please note that in order to protect your data security, it is recommended to keep certificate verification turned on in practical applications. If you need to disable or customize certificate validation in a specific situation, make sure you understand the possible security risks and take appropriate action.

5 combat

Use the requests library to grab the titles and links of the 2023 college entrance examination news

import requests
from bs4 import BeautifulSoup
import time

def fetch_news_by_page(page_number):
    keyword = "2023年高考录取"
    results_per_page = 10
    pn = (page_number - 1) * results_per_page

    # 构造搜索的URL,包括搜索关键词和分页参数
    url = f"https://www.baidu.com/s?wd={keyword}&pn={pn}"
    
    # 添加头部信息模拟浏览器请求
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
        "Referer": "https://www.baidu.com/"
    }

    # 发送请求
    response = requests.get(url, headers=headers)

    # 如果请求成功
    if response.status_code == 200:
        # 解析网页内容
        soup = BeautifulSoup(response.text, 'html.parser')
        news_list = []

        # 找到所有的新闻标题和链接
        for news in soup.find_all('div', class_='result'):
            title_elem = news.find('h3', class_='t')
            title = title_elem.get_text() if title_elem else None

            link_elem = news.find('a')
            link = link_elem['href'] if link_elem and 'href' in link_elem.attrs else None

            if title and link:
                news_list.append({"title": title, "link": link})

        return news_list
    else:
        print("请求失败,状态码:", response.status_code)
        return None

if __name__ == "__main__":
    for page in range(1, 4):  # 输出前三页
        print(f"第{page}页的搜索结果:")
        news = fetch_news_by_page(page)
        if news:
            for idx, item in enumerate(news, start=1):
                print(f"{idx}. {item['title']}")
                print(f"   Link: {item['link']}")
                print("=" * 50)
        else:
            print("没有搜索结果。")
        time.sleep(2)  # 添加延时,模拟人类浏览行为

The output is as follows:

         This code is a Python web crawler, which is used to grab news headlines and links about "2023 college entrance examination admission" from the Baidu search engine.

  1. First, it imports the requests library (for sending HTTP requests), the BeautifulSoup library (for parsing HTML documents), and the time library (for pausing program execution).

  2. First, it defines a function fetch_news_by_page(), which accepts a parameter page_number, indicating the number of pages to be fetched.

  3. Inside the function, first define the search keyword "2023 college entrance examination admission" and the number of results displayed per page results_per_page.

  4. Then, a Baidu search URL is constructed, including search keywords and pagination parameters. The f-string formatting string is used here to insert page_number and results_per_page into the URL.

  5. Next, a headers dictionary is defined, which contains two fields User-Agent and Referer, which are used to simulate a browser sending a request.

  6. Use the requests.get() function to send a GET request, passing the headers dictionary as a parameter.

  7. If the request is successful (that is, the HTTP status code is 200), the returned HTML document is parsed using BeautifulSoup.

  8. In the parsed HTML document, find all news headlines and links. The find_all() function is used here to find all div elements with class 'result', and then find the h3 tag (class is 't') and a tag in each div element.

  9. If the title and link are found, they are added to the news_list list.

  10. Finally, if the request fails, print out the failed status code and return None.

  11. In the main program, call the fetch_news_by_page() function to traverse the search results of the first three pages and print them out. In order to avoid frequent network requests, there is a 2-second pause after each print result.

Previous:

[100 days proficient in python] Day41: Python web crawler development_Introduction to crawler basics_LeapMay's blog-CSDN blog Web crawler (Web Crawler), is an automated program for browsing and grabbing information on the Internet. Crawlers can traverse web pages, collect data, and extract information for further processing and analysis. Web crawlers play an important role in search engines, data collection, information monitoring and other fields. 1.1 How it works Initial URL selection: The crawler starts with one or more initial URLs, which are usually the homepage or other pages of the website you wish to start crawling. Send HTTP request: For each initial URL, the crawler sends an HTTP request to get the web page content. https://blog.csdn.net/qq_35831906/article/details/132377113?spm=1001.2014.3001.5502

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/132381253