Article Directory
foreword
Sample code for crawling NBA player data using Python. By sending an HTTP request, parse the HTML page, then extract the required ranking, name, team and score information, and save the result to a file.
Import required libraries and modules
import requests
from lxml import etree
- Use
requests
the library to send HTTP requests. - Use
lxml
a library for HTML parsing.
Set request header and request address
url = 'https://nba.hupu.com/stats/players'
headers ={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
- Set request header information, including User-Agent.
- Set the address of the request to 'https://nba.hupu.com/stats/players'.
Send HTTP request and get response
resp = requests.get(url, headers=headers)
- Use
requests
the library to send an HTTP GET request, and pass in the request URL and request header information. - Save the returned response in a variable
resp
.
Handle the response result
e = etree.HTML(resp.text)
- Use
etree.HTML
functions to parse the returned response text into an actionable HTML element tree object. - Save the parsed result in a variable
e
.
Analytical data
nos = e.xpath('//table[@class="players_table"]//tr/td[1]/text()')
names = e.xpath('//table[@class="players_table"]//tr/td[2]/a/text()')
teams = e.xpath('//table[@class="players_table"]//tr/td[3]/a/text()')
scores = e.xpath('//table[@class="players_table"]//tr/td[4]/text()')
- Use XPath expressions to extract the required data from the HTML element tree.
- Save the ranking (nos), names (names), teams (teams) and scores (scores) in the corresponding variables.
save the result to a file
with open('nba.txt', 'w', encoding='utf-8') as f:
for no, name, team, score in zip(nos, names, teams, scores):
f.write(f'排名:{
no} 姓名:{
name} 球队:{
team} 得分:{
score}\n')
- Open a file
nba.txt
for write mode ('w') and UTF-8 encoding. - Use
zip
a function to iterate over rankings, names, teams, and scores at the same time, combining them into a tuple. - Write the data of each line to the file according to the specified format.
full code
# 引入 requests 库,用于发送 HTTP 请求
import requests
# 引入 lxml 库,用于解析 HTML
from lxml import etree
# 设置请求的地址
url = 'https://nba.hupu.com/stats/players'
# 设置请求头信息,包括用户代理(User-Agent)
headers ={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
# 发送HTTP GET请求,并传入请求地址和请求头信息,将返回的响应保存在变量resp中
resp = requests.get(url, headers=headers)
# 使用etree.HTML函数将返回的响应文本解析为一个可操作的HTML元素树对象
e = etree.HTML(resp.text)
# 使用XPath表达式从HTML元素树中提取需要的数据
nos = e.xpath('//table[@class="players_table"]//tr/td[1]/text()')
names = e.xpath('//table[@class="players_table"]//tr/td[2]/a/text()')
teams = e.xpath('//table[@class="players_table"]//tr/td[3]/a/text()')
scores = e.xpath('//table[@class="players_table"]//tr/td[4]/text()')
# 打开一个文件`nba.txt`,以写入模式('w')进行操作,编码方式为UTF-8
with open('nba.txt', 'w', encoding='utf-8') as f:
# 使用zip函数同时遍历排名、姓名、球队和得分,将它们合并成一个元组
for no, name, team, score in zip(nos, names, teams, scores):
# 将每一行的数据按照指定格式写入文件中
f.write(f'排名:{
no} 姓名:{
name} 球队:{
team} 得分:{
score}\n')
detailed analysis
# pip install requests
import requests
Import requests
the library, which is used to send HTTP requests.
# pip install lxml
from lxml import etree
Import lxml
the library, which is used to parse HTML.
# 发送的地址
url = 'https://nba.hupu.com/stats/players'
Set the address that needs to send the request.
headers ={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'}
Set request header information, including User-Agent. This information tells the server that our request is sent from a browser, not a crawler, so as to avoid being blocked by the anti-crawler mechanism.
# 发送请求
resp = requests.get(url,headers = headers)
Use requests.get
the method to send an HTTP GET request, and pass in the request URL and request header information. Save the returned response in a variable resp
.
e = etree.HTML(resp.text)
Use etree.HTML
the function to parse the returned response text into an actionable HTML element tree object. etree.HTML
Accepts a parameter of type string, used here resp.text
to get the text content of the response.
nos = e.xpath('//table[@class="players_table"]//tr/td[1]/text()')
names = e.xpath('//table[@class="players_table"]//tr/td[2]/a/text()')
teams = e.xpath('//table[@class="players_table"]//tr/td[3]/a/text()')
scores = e.xpath('//table[@class="players_table"]//tr/td[4]/text()')
Use XPath expressions to extract the required data from the HTML element tree. Here four XPath expressions are used to extract the ranking, name, team and score data, and save them in the nos
, names
, teams
and scores
variables respectively.
with open('nba.txt','w',encoding='utf-8') as f:
for no,name,team,score in zip(nos,names,teams,scores):
f.write(f'排名:{
no} 姓名:{
name} 球队:{
team} 得分:{
score}\n')
Open a nba.txt
file named in write mode ('w') and use UTF-8 encoding. Then, use zip
the function to iterate over the rank, name, team, and score all at once, combining them into a tuple. By looping through each tuple, write the data of each row to the file according to the specified format.
In this way, the code realizes the crawling of NBA player data and saves the result to nba.txt
a file.
running result
conclusion
Through the sample code in this article, you can learn how to use Python to crawl NBA player data. We use the requests library to send HTTP requests, the lxml library for HTML parsing, and the XPath expression to extract the required data. Finally save the result to a file. This example can help you understand the basic principles and operation steps of the crawler, and also be able to obtain data about NBA players. I hope this article will help you understand and master Python crawler technology.