Table of contents
1. Install the requests module
pip install requests
2. Send a request and get a json string response
Example of crawling interface, here is a Get request as an example, the interface requested here will return a JSON string.
import requests
import json
url = 'https://blog.csdn.net/community/home-api/v1/get-business-list'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
data = {
"page": "1",
"size": "20",
"businessType": "lately",
"noMore": "false",
"username": "qq_33697094"
}
# 发送get请求(如果是post请求,使用requests.post)
result = requests.get(url, headers=headers, params=data)
# 使用 result.content.decode 获取该接口返回的json字符串或者html页面为
responseStr = result.content.decode('utf-8')
# 将接口返回的json字符转为字典
dic = json.loads(responseStr)
titles = []
for item in dic["data"]["list"]:
titles.append(item["title"])
print(titles)
If the interface returns a json string, you can also directly use result.json() to receive the data returned by the interface as a dictionary as follows.
# 发送请求
result = requests.get(url, headers=headers, params=data)
# 获取结果为字典(json对象)
dic = result.json()
3. Send a request, get the html web page and parse and get the text
The above example is to send a request, which returns a json string. Sometimes we want to get the data under a certain URL link page, such as a certain url, it does not return a json string, it is a webpage composed of multiple requests and data. At this time, you can use BeautifulSoup or lxml library To parse html and get the desired data.
Both BeautifulSoup and lxml libraries are libraries for parsing html. The lxml parser is more powerful and faster. It can easily parse html and xml. It is recommended to use the lxml parser.
Install the lxml module
pip install lxml
Example of parsing data from returned html
import requests
from lxml import html
url = 'https://blog.csdn.net/qq_33697094?type=lately'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
# 发送请求
result = requests.get(url, headers=headers)
# 获取该接口返回的 html 页面并格式化
tree = html.fromstring(result.text)
# 获取 <div class='blog-list-box-top'的 div 标签下 h4 标签里的文本
titles = tree.xpath("//div[@class='blog-list-box-top']/h4/text()")
# 获取 class属性是'blog-list-box'的article 标签下 a 标签里的 href 属性
urls = tree.xpath("//article[@class='blog-list-box']/a/@href")
print(titles)
print(urls)
The above is to use xpath to locate html elements. For the syntax and usage of xpath, you can refer to the following articles:
lxml library and XPath to extract web page data
The basic use of lxml library
XPath in Selenium
Selenium locates elementsXPath
in Selenium: How to Find & Write
How to use XPath in Selenium