An example of getting started with python crawlers

1. Install the requests module

pip install requests

2. Send a request and get a json string response

Example of crawling interface, here is a Get request as an example, the interface requested here will return a JSON string.

import requests
import json

url = 'https://blog.csdn.net/community/home-api/v1/get-business-list'
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
data = {
    
    
    "page": "1",
    "size": "20",
    "businessType": "lately",
    "noMore": "false",
    "username": "qq_33697094"
}
# 发送get请求(如果是post请求,使用requests.post)
result = requests.get(url, headers=headers, params=data)
# 使用 result.content.decode 获取该接口返回的json字符串或者html页面为
responseStr = result.content.decode('utf-8')
# 将接口返回的json字符转为字典
dic = json.loads(responseStr)

titles = []
for item in dic["data"]["list"]:
    titles.append(item["title"])
print(titles)

If the interface returns a json string, you can also directly use result.json() to receive the data returned by the interface as a dictionary as follows.

# 发送请求
result = requests.get(url, headers=headers, params=data)
# 获取结果为字典(json对象)
dic = result.json()

3. Send a request, get the html web page and parse and get the text

The above example is to send a request, which returns a json string. Sometimes we want to get the data under a certain URL link page, such as a certain url, it does not return a json string, it is a webpage composed of multiple requests and data. At this time, you can use BeautifulSoup or lxml library To parse html and get the desired data.

Both BeautifulSoup and lxml libraries are libraries for parsing html. The lxml parser is more powerful and faster. It can easily parse html and xml. It is recommended to use the lxml parser.

Install the lxml module

pip install lxml

Example of parsing data from returned html

import requests
from lxml import html

url = 'https://blog.csdn.net/qq_33697094?type=lately'
headers = {
    
    
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
# 发送请求
result = requests.get(url, headers=headers)
# 获取该接口返回的 html 页面并格式化
tree = html.fromstring(result.text)
# 获取 <div class='blog-list-box-top'的 div 标签下 h4 标签里的文本
titles = tree.xpath("//div[@class='blog-list-box-top']/h4/text()")
# 获取 class属性是'blog-list-box'的article 标签下 a 标签里的 href 属性
urls = tree.xpath("//article[@class='blog-list-box']/a/@href")
print(titles)
print(urls)

The above is to use xpath to locate html elements. For the syntax and usage of xpath, you can refer to the following articles:

lxml library and XPath to extract web page data
The basic use of lxml library
XPath in Selenium
Selenium locates elementsXPath
in Selenium: How to Find & Write
How to use XPath in Selenium

Guess you like

Origin blog.csdn.net/qq_33697094/article/details/131380900