Article Directory
1 Introduction
Currently my reptile series updated to the regular expression, we can use the library request + regular expression crawling up some simple pages. Because I personally like to play basketball, so I chose Tiger bashing network as a crawling target. Of course, this is just an entry-level case, would later use to write a parsing library crawling Tiger bashing network data.
2, web analytics
The data you want to crawl a site, you have to analyze the site's page source code. In this case, log on through a browser Tiger bashing network, opening the NBA in scoring, the full URL is: Tiger Fight Network NBA in scoring .
Currently scoring the first mosaic defender, scoring a total of five whole, 237 players.
By clicking on the second page, third page analysis that may constitute URL is: https://nba.hupu.com/stats/players/pts/2,https://nba.hupu.com/stats/players/ PTS /. 3 , only the last change in the number of pages, the URL can be configured by a string concatenation. F12 then click to view the source code and found that the players are in a data table class called players_table, the players data are in each tr, but because the first line is the first line, do not need to crawl, so pay attention to this discharge line go with.
3, coding
First, define a way to get HTML
def getHtml(pageNum):
'''
传入参数页数,获得该链接的HTML内容
:param pageNum: 页数
:return: HTML内容
'''
url = "https://nba.hupu.com/stats/players/pts/" + str(pageNum)
response = requests.get(url=url)
if response.status_code == 200:
return response.text
else:
return None
Second, the definition of regular expressions, the use of re library findall () Gets all the content in line with the regular expression
def getData():
pointList = [] # 定义一个列表存储数据
regExp = '<tr>.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?"bg_b">(.*?)<.*?</tr>'# 定义一个正则表达式,获取球员名称,球队和得分数据
for i in range(1,6):
html = getHtml(i)
# print(html)
results = re.findall(regExp,html,re.S)
for result in results:
pointList.append(result)
return pointList
The output is:
4, complete code
import re
import requests
def getHtml(pageNum):
'''
传入参数页数,获得该链接的HTML内容
:param pageNum: 页数
:return: HTML内容
'''
url = "https://nba.hupu.com/stats/players/pts/" + str(pageNum)
response = requests.get(url=url)
if response.status_code == 200:
return response.text
else:
return None
def getData():
pointList = [] # 定义一个列表存储数据
regExp = '<tr>.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?"bg_b">(.*?)<.*?</tr>'# 定义一个正则表达式,获取球员名称,球队和得分数据
for i in range(1,6):
html = getHtml(i)
# print(html)
results = re.findall(regExp,html,re.S)
for result in results:
pointList.append(result)
return pointList
if __name__ == '__main__':
pointList = getData()
for list in pointList:
print(list)
Please indicate the wrong place! Thought that it was in trouble if you can give a praise! We welcome comments section or private letter exchange!