Table of contents
3.2 GET request and POST request
4 HTTP request library requests common syntax
4.3 Request parameters and headers
4.5 Requests advanced operation - file upload
4.7 Request advanced operation - certificate verification
Use the requests library to grab the titles and links of the 2023 college entrance examination news
1 HTTP protocol
HTTP Protocol (Hypertext Transfer Protocol): HTTP is a protocol for transferring data between a client and a server. It is based on the request-response model, the client sends an HTTP request and the server returns an HTTP response. The HTTP protocol is mainly used for communication between web browsers and servers to obtain, transmit, and display web pages and resources.
In web crawling, the HTTP (Hypertext Transfer Protocol) protocol plays a vital role, and it is a protocol for transferring data between a client and a server. The following are some key roles of the HTTP protocol in crawlers:
Obtaining webpage content: The crawler uses the HTTP protocol to send a request to the server to obtain the content of the webpage. By sending a GET request, the crawler can ask the server to return the HTML code of the web page.
Sending requests: Crawlers can use different HTTP request methods, such as GET, POST, PUT, etc., to send different types of requests to the server. GET requests are used to fetch resources, while POST requests are used to submit data, PUT requests are used to update resources, and so on.
Passing parameters: The crawler can pass various data, such as query parameters, form data, etc., through URL parameters or request body parameters of HTTP requests. This is useful when scraping specific data or doing searches.
Set request headers: Crawlers can set request headers in HTTP requests, including User-Agent, Referer, Cookie, etc., to simulate different types of browser behaviors, or to bypass website anti-crawling measures.
Processing the response: The server returns an HTTP response, which contains the status code, response headers, and response body. The crawler can judge whether the request is successful according to the status code, obtain information from the response header, and extract the web page content from the response body.
Parsing HTML content: The crawler extracts the required information from HTML content by parsing it. This usually involves using a library such as Beautiful Soup to parse the DOM structure of the web page.
Simulated login: For websites that require login to access, the crawler can submit the login form by simulating a POST request to obtain the logged-in data.
Anti-crawling processing: The crawler may encounter the anti-crawling mechanism of the website, such as limiting access frequency, verification code, etc. In this case, crawlers need to properly adjust request headers, use proxy IP, etc. to bypass these restrictions.
In short, the HTTP protocol is the basis of the crawler's work. By sending a request to the server and parsing the server's response, the crawler can obtain the required data from the web page, and then process, analyze and store it. At the same time, understanding the various characteristics and mechanisms of the HTTP protocol can help crawlers operate and interact with servers more effectively.
1.1 HTTP request structure
An HTTP request consists of the following parts:
- Request Line: Contains the request method, target URL, and protocol version.
- Request Headers: Contains meta information about the request, such as User-Agent, Accept, Cookie, etc.
- Empty line: used to separate request header and request body.
- Request Body (Request Body): It only appears when using methods such as POST, and contains the actual data of the request.
1.2 HTTP response structure
An HTTP response consists of the following parts:
- Status Line: Contains protocol version, status code and status information.
- Response Headers: Contains meta information about the response, such as Content-Type, Content-Length, etc.
- Empty line: used to separate the response header and response body.
- Response Body: Contains the actual data of the response, such as HTML content, JSON data, etc.
1.3 Common HTTP methods
- GET: Used to get data from the server and append the data in the URL.
- POST: Used to submit data to the server and include the data in the request body.
- PUT: Used to update resources on the server, including data in the request body.
- DELETE: Used to delete resources from the server, append data in the URL.
- HEAD: Similar to GET, but only returns the response header, which is used to obtain the meta information of the resource.
- OPTIONS: Used to query the HTTP methods supported by the server.
1.4 Common HTTP status codes:
- 200 OK: The request was successful.
- 201 Created: The resource was created successfully.
- 400 Bad Request: The request is wrong.
- 401 Unauthorized: The request is unauthorized.
- 403 Forbidden: The server rejects the request.
- 404 Not Found: The requested resource does not exist.
- 500 Internal Server Error: Internal server error.
Example: The following is a simple example that demonstrates how to use Python's http.server
modules to create a simple HTTP server and send GET and POST requests. You can run this example in a terminal, then visit the corresponding URL in your browser.
# 创建一个简单的HTTP服务器
# 在终端运行:python http_server_example.py
import http.server
import socketserver
class MyHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
self.wfile.write(b'Hello, GET request!')
def do_POST(self):
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
response = f'Hello, POST request! Data: {post_data.decode()}'
self.wfile.write(response.encode())
if __name__ == "__main__":
PORT = 8000
with socketserver.TCPServer(("", PORT), MyHandler) as httpd:
print(f"Serving at port {PORT}")
httpd.serve_forever()
Visiting in the browser http://localhost:8000
can see the server response. It is possible to use tools such as curl
or requests
library to send HTTP requests and receive responses.
2 HTTP与HTTPS
HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol Secure) are both protocols for transferring data between a client and a server, but there are important security and encryption differences between them.
HTTP (Hypertext Transfer Protocol): HTTP is a protocol for transferring hypertext data, which communicates between web browsers and web servers. The HTTP protocol is transmitted in clear text, which means that the transmitted data is not encrypted and may be easily eavesdropped and tampered with. It usually uses port 80 for communication .
HTTPS (Hypertext Transfer Protocol Secure): HTTPS is a secure version of HTTP that protects transmitted data by using encryption and authentication mechanisms. In HTTPS, data is transmitted encrypted, making it more difficult to eavesdrop and tamper with. In order to achieve encryption, HTTPS uses the SSL (Secure Sockets Layer) or TLS (Transport Layer Security) protocol. HTTPS usually uses port 443 for communication .
Main difference:
Security: The most notable difference is security. HTTP does not encrypt data, while HTTPS protects data transmission through encryption to ensure data confidentiality and integrity.
Encryption: HTTPS uses the SSL or TLS protocol to encrypt data so that data cannot be easily eavesdropped or tampered with during transmission. HTTP does not provide encryption and data may be monitored and modified by third parties.
Authentication: HTTPS can also authenticate the server during the encryption process to ensure that you communicate with the correct server. HTTP does not provide this functionality and may be vulnerable to man-in-the-middle attacks.
URL prefix: HTTP URLs start with "http://", while HTTPS URLs start with "https://".
While HTTPS is superior to HTTP in terms of security, HTTPS is slightly slower than HTTP due to some computational overhead involved in the encryption and decryption process. However, with the improvement of computing power, the performance gap of HTTPS gradually narrows.
In the modern web, protecting user privacy and data security is very important, therefore, many websites are switching to using HTTPS to ensure the protection of user data.
3 HTTP request process
3.1 HTTP request process
The HTTP request process involves the client sending a request to the server, the server processing the request and returning a response. The following is the basic process of an HTTP request:
- The client initiates an HTTP request, including the request method (GET, POST, etc.), target URL, request header, request body, etc.
- The server receives and processes the request, and finds the corresponding resource according to the request method and URL.
- The server generates an HTTP response, including status code, response header, response body, etc.
- The server sends a response back to the client.
- The client receives the response and processes the response content.
3.2 GET request and POST request
GET and POST are HTTP request methods used to send requests to the server.
- GET request: used to obtain data from the server, passing parameters through the URL, the request parameters are visible in the URL, suitable for obtaining data.
- POST request: used to submit data to the server, request parameters are passed in the request body, and operations such as adding and modifying data are performed.
3.3 Common request headers
Request Headers in an HTTP request contain additional information about the request, such as user agent, content type, etc. Here are some common request headers:
- User-Agent: Identifies the type and version of the client (usually a browser).
- Content-Type: Specifies the media type of the request body (such as application/json, application/x-www-form-urlencoded, etc.).
- Authorization: Contains authentication credentials for authentication.
- Referer: Indicates the source URL of the request, used to prevent CSRF attacks.
- Cookie: Contains the client's cookie information and is used to maintain the session state.
3.4 HTTP response
The HTTP response contains the processing result of the request by the server, including status code, response header, response body, etc.
- Status Code (Status Code): Indicates the processing status of the server to the request, such as 200 OK means success, 404 Not Found means resource not found.
- Response Headers: Contains meta information about the response, such as Content-Type, Server, etc.
- Response Body: Contains the actual response content, such as the HTML content of the web page, JSON data, etc.
The following is an example that demonstrates using the Python requests
library to send a GET request, then parse and print the response:
import requests
url = 'https://www.example.com'
response = requests.get(url)
print("Status Code:", response.status_code)
print("Headers:", response.headers)
print("Content:", response.text)
4 HTTP request library requests common syntax
requests
is a commonly used Python library for sending HTTP requests and handling HTTP responses. Here is requests
an example of basic usage of the library:
First, make sure you have the library installed requests
. If it is not installed, you can install it with the following command:
pip install requests
You can then import the library in your Python code requests
and use it to send HTTP requests and handle responses.
4.1 Send GET request
Using requests.get()
the method, the following example demonstrates how to use requests
the library to send a simple GET request and process the response:
import requests
# 发送GET请求获取网页内容
url = 'https://www.baidu.com' # 替换为您要访问的网页URL
response = requests.get(url)
response.encoding = 'utf-8' # 指定编码为UTF-8
html_content = response.text
# 输出网页内容
print(html_content)
Common syntax:
Initiate a GET request :
import requests
response = requests.get('https://www.example.com')
print(response.text) # 输出响应内容
Initiate a GET request with parameters :
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://www.example.com', params=params)
Send request and set headers :
headers = {'User-Agent': 'My User Agent'}
response = requests.get('https://www.example.com', headers=headers)
Get response status code :
response = requests.get('https://www.example.com')
status_code = response.status_code
Get response header information :
response = requests.get('https://www.example.com')
headers = response.headers
Get response content (bytes)
response = requests.get('https://www.example.com')
content = response.content
Get response content (text)
response = requests.get('https://www.example.com')
text = response.text
Process the JSON data in the response :
response = requests.get('https://api.example.com/data.json')
data = response.json()
Processing timeout :
try:
response = requests.get('https://www.example.com', timeout=5) # 5秒超时
except requests.Timeout:
print("请求超时")
Handle exceptions :
try:
response = requests.get('https://www.example.com')
response.raise_for_status() # 抛出HTTP错误状态码异常
except requests.HTTPError as http_err:
print(f"HTTP错误: {http_err}")
except requests.RequestException as req_err:
print(f"请求异常: {req_err}")
4.2 Send POST request
The following example demonstrates how to use requests
the library to send a POST request with data:
import requests
# 登录URL和登录所需的数据
login_url = 'https://mail.163.com/'
login_data = {
'username': 'your_username', # 替换为您的邮箱用户名
'password': 'your_password' # 替换为您的邮箱密码
}
# 创建会话对象
session = requests.Session()
# 发送POST请求模拟登录
response = session.post(login_url, data=login_data)
# 检查登录是否成功
if '退出' in response.text:
print("Login successful.")
else:
print("Login failed.")
In this sample code, we use
requests.Session()
to create a session object, so that session state can be maintained across multiple requests. We then usesession.post()
the method to send a POST request to simulate a login. In this example, we use the login page of mailbox 163 as a demonstration, you need to replacelogin_url
andlogin_data
with the actual login URL and data required for login.Please note that this is just a simple example, and actual websites may have more complex login logic, such as verification codes, dynamic tokens, etc. At the same time, when a crawler visits a website, it needs to abide by the rules and policies of the website to ensure that your actions are legal and compliant.
Common syntax:
Send a POST request
data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://www.example.com', data=data)
POST request to send JSON data :
import json
data = {'key1': 'value1', 'key2': 'value2'}
headers = {'Content-Type': 'application/json'}
response = requests.post('https://www.example.com', data=json.dumps(data), headers=headers)
4.3 Request parameters and headers
When using requests
the library to send HTTP requests, you can pass additional information through request parameters and headers. Request parameters are usually used for GET requests or requests with query parameters, while request headers are used to pass various information, such as user agent, cookies, etc. The following is sample code about request parameters and headers:
import requests
# 请求参数示例
params = {
'key1': 'value1',
'key2': 'value2'
}
# 请求头部示例
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Referer': 'https://www.baiud.com',
'Cookie': 'your_cookie_data'
}
# 发送GET请求,带参数和头部
url = 'https://www.baidu.com' # 替换为您要访问的网页URL
response = requests.get(url, params=params, headers=headers)
# 输出响应内容
print(response.text)
4.4 Encoding format
When using requests
a library to send HTTP requests, the encoding format (also known as character set or character encoding) refers to the rules used to decode the content of the response. requests
The library tries to automatically identify and format the encoding of the response, but sometimes you may need to manually set the encoding to ensure that the response content is parsed correctly.
Here are some explanations and examples of encoding formats:
-
Automatically identify encoding: By default, the library will try to automatically identify the encoding format of the response
requests
based on the fields in the response header .Content-Type
For example, ifContent-Type
includedcharset=utf-8
,requests
UTF-8 encoding will be used to decode the response content. -
Manually set the encoding: If the automatically recognized encoding is incorrect, you can manually set the encoding to solve the garbled problem. By
response.encoding
setting this to the appropriate encoding, you can ensure that the response content is decoded correctly.
Here's an example that demonstrates how to manually format the encoding to correctly parse the response content:
import requests
# 发送GET请求获取网页内容
url = 'https://www.baidu.com' # 替换为您要访问的网页URL
response = requests.get(url)
response.encoding = 'utf-8' # 手动设置编码为UTF-8
# 输出响应内容
print(response.text)
4.5 Requests advanced operation - file upload
requests
The library allows you to send file upload requests, i.e. send a file to the server as part of the request. This is useful when interacting with APIs that include file upload functionality.To send a file upload request, you can use
requests.post()
the method,files
passing the file to upload as a parameter.files
The argument should be a dictionary where keys are field names and values are file objects. File objects can beopen()
created with functions.
The following is a simple file upload example, assuming you want to upload a local file to the server:
import requests
# 目标URL和文件路径
url = 'https://www.example.com/upload' # 替换为实际的上传URL
file_path = 'path/to/your/file.txt' # 替换为实际的文件路径
# 创建文件对象
with open(file_path, 'rb') as file:
files = {'file': file} # 'file'是字段名称,可以根据实际情况更改
# 发送文件上传请求
response = requests.post(url, files=files)
# 输出响应内容
print(response.text)
In this example, we use
open()
the function to open the file in binary mode, and then pass the file object asfiles
the argument. Infiles
the dictionary, the keys are the field names the server expects to receive, and the values are file objects. You need to'file'
replace with the actual field name.Note that the actual server may require other additional fields or parameters such as authentication, token, etc. You need to adjust the code according to the actual situation.
4.6 Requests advanced operation - get cookie
In requests
the library, you can response.cookies
get the cookie information received from the server through properties. Cookies are some key-value pairs set by the server in the HTTP response header to store state information between the client and the server. The following are detailed instructions and examples for obtaining cookies:
import requests
# 发送GET请求获取网页内容
url = 'https://www.example.com' # 替换为您要访问的网页URL
response = requests.get(url)
# 获取响应中的Cookie信息
cookies = response.cookies
# 打印Cookie信息
for cookie in cookies:
print("Name:", cookie.name)
print("Value:", cookie.value)
In this example, we use
requests.get()
the method to send a GET request, and useresponse.cookies
the attribute to get the cookie information in the response.response.cookies
Returns anRequestsCookieJar
object that you can iterate over to get the name and value of each cookie.Note that multiple cookies may be included in the response, each cookie is a key-value pair. You can further process these cookie information according to actual needs, such as storing in the session, sending to the next request, and so on.
Also, if you want to manually set cookies and use them in subsequent requests, you can do so by adding a field to the request header Cookie
. For example:
import requests
# 设置Cookie
cookies = {'cookie_name': 'cookie_value'}
# 发送GET请求并添加Cookie到请求头部
url = 'https://www.example.com' # 替换为您要访问的网页URL
response = requests.get(url, cookies=cookies)
# 处理响应...
In this example, we use
cookies
parameters to add the cookie information to be sent to the request. This is useful for situations where cookies need to be handled manually.
4.7 Request advanced operation - certificate verification
In requests
the library, you can verify
control whether to verify the SSL certificate through parameters. SSL certificate verification is the process used to ensure a secure encrypted connection with a server. By default, requests
the library verifies SSL certificates, but you can verify
disable verification or provide a custom certificate by setting a parameter.
The following are detailed instructions and examples for certificate verification:
-
Default Validation: By default,
requests
the library validates SSL certificates. It is safe practice to ensure that communication with the server is encrypted. For example:
import requests
# 发送GET请求
url = 'https://www.example.com' # 替换为您要访问的网页URL
response = requests.get(url)
# 处理响应...
Disable Validation: In some cases, you may want to disable certificate validation, such as servers accessing self-signed certificates. You can disable validation by verify
setting the parameter to :False
import requests
# 发送GET请求并禁用证书验证
url = 'https://www.example.com' # 替换为您要访问的网页URL
response = requests.get(url, verify=False)
# 处理响应...
Note that disabling certificate validation reduces security and should only be used with an understanding of the risks.
Custom certificate: If you need to connect to a server using a custom certificate, you can provide the path to the certificate file as verify
the value of the parameter:
import requests
# 发送GET请求并使用自定义证书进行验证
url = 'https://www.example.com' # 替换为您要访问的网页URL
response = requests.get(url, verify='/path/to/custom/certificate.pem')
# 处理响应...
In this example,
/path/to/custom/certificate.pem
the path to your custom certificate file.Please note that in order to protect your data security, it is recommended to keep certificate verification turned on in practical applications. If you need to disable or customize certificate validation in a specific situation, make sure you understand the possible security risks and take appropriate action.
5 combat
Use the requests library to grab the titles and links of the 2023 college entrance examination news
import requests
from bs4 import BeautifulSoup
import time
def fetch_news_by_page(page_number):
keyword = "2023年高考录取"
results_per_page = 10
pn = (page_number - 1) * results_per_page
# 构造搜索的URL,包括搜索关键词和分页参数
url = f"https://www.baidu.com/s?wd={keyword}&pn={pn}"
# 添加头部信息模拟浏览器请求
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
"Referer": "https://www.baidu.com/"
}
# 发送请求
response = requests.get(url, headers=headers)
# 如果请求成功
if response.status_code == 200:
# 解析网页内容
soup = BeautifulSoup(response.text, 'html.parser')
news_list = []
# 找到所有的新闻标题和链接
for news in soup.find_all('div', class_='result'):
title_elem = news.find('h3', class_='t')
title = title_elem.get_text() if title_elem else None
link_elem = news.find('a')
link = link_elem['href'] if link_elem and 'href' in link_elem.attrs else None
if title and link:
news_list.append({"title": title, "link": link})
return news_list
else:
print("请求失败,状态码:", response.status_code)
return None
if __name__ == "__main__":
for page in range(1, 4): # 输出前三页
print(f"第{page}页的搜索结果:")
news = fetch_news_by_page(page)
if news:
for idx, item in enumerate(news, start=1):
print(f"{idx}. {item['title']}")
print(f" Link: {item['link']}")
print("=" * 50)
else:
print("没有搜索结果。")
time.sleep(2) # 添加延时,模拟人类浏览行为
The output is as follows:
This code is a Python web crawler, which is used to grab news headlines and links about "2023 college entrance examination admission" from the Baidu search engine.
First, it imports the requests library (for sending HTTP requests), the BeautifulSoup library (for parsing HTML documents), and the time library (for pausing program execution).
First, it defines a function fetch_news_by_page(), which accepts a parameter page_number, indicating the number of pages to be fetched.
Inside the function, first define the search keyword "2023 college entrance examination admission" and the number of results displayed per page results_per_page.
Then, a Baidu search URL is constructed, including search keywords and pagination parameters. The f-string formatting string is used here to insert page_number and results_per_page into the URL.
Next, a headers dictionary is defined, which contains two fields User-Agent and Referer, which are used to simulate a browser sending a request.
Use the requests.get() function to send a GET request, passing the headers dictionary as a parameter.
If the request is successful (that is, the HTTP status code is 200), the returned HTML document is parsed using BeautifulSoup.
In the parsed HTML document, find all news headlines and links. The find_all() function is used here to find all div elements with class 'result', and then find the h3 tag (class is 't') and a tag in each div element.
If the title and link are found, they are added to the news_list list.
Finally, if the request fails, print out the failed status code and return None.
In the main program, call the fetch_news_by_page() function to traverse the search results of the first three pages and print them out. In order to avoid frequent network requests, there is a 2-second pause after each print result.