Python reptile's get Tiger bashing tennis player score data

1 Introduction

  Currently my reptile series updated to the regular expression, we can use the library request + regular expression crawling up some simple pages. Because I personally like to play basketball, so I chose Tiger bashing network as a crawling target. Of course, this is just an entry-level case, would later use to write a parsing library crawling Tiger bashing network data.

2, web analytics

  The data you want to crawl a site, you have to analyze the site's page source code. In this case, log on through a browser Tiger bashing network, opening the NBA in scoring, the full URL is: Tiger Fight Network NBA in scoring .
  Currently scoring the first mosaic defender, scoring a total of five whole, 237 players.
Here Insert Picture Description
  By clicking on the second page, third page analysis that may constitute URL is: https://nba.hupu.com/stats/players/pts/2,https://nba.hupu.com/stats/players/ PTS /. 3 , only the last change in the number of pages, the URL can be configured by a string concatenation. F12 then click to view the source code and found that the players are in a data table class called players_table, the players data are in each tr, but because the first line is the first line, do not need to crawl, so pay attention to this discharge line go with.
Here Insert Picture Description

3, coding

First, define a way to get HTML

def getHtml(pageNum):
    '''
    传入参数页数,获得该链接的HTML内容
    :param pageNum: 页数
    :return: HTML内容
    '''
    url = "https://nba.hupu.com/stats/players/pts/" + str(pageNum)
    response = requests.get(url=url)
    if response.status_code == 200:
        return response.text
    else:
        return None

Second, the definition of regular expressions, the use of re library findall () Gets all the content in line with the regular expression

def getData():
    pointList = []  # 定义一个列表存储数据
    regExp = '<tr>.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?"bg_b">(.*?)<.*?</tr>'# 定义一个正则表达式,获取球员名称,球队和得分数据
    for i in range(1,6):
        html = getHtml(i)
        # print(html)
        results = re.findall(regExp,html,re.S)
        for result in results:
            pointList.append(result)
    return pointList

The output is:
Here Insert Picture Description

4, complete code

import re
import requests

def getHtml(pageNum):
    '''
    传入参数页数,获得该链接的HTML内容
    :param pageNum: 页数
    :return: HTML内容
    '''
    url = "https://nba.hupu.com/stats/players/pts/" + str(pageNum)
    response = requests.get(url=url)
    if response.status_code == 200:
        return response.text
    else:
        return None

def getData():
    pointList = []  # 定义一个列表存储数据
    regExp = '<tr>.*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>.*?"bg_b">(.*?)<.*?</tr>'# 定义一个正则表达式,获取球员名称,球队和得分数据
    for i in range(1,6):
        html = getHtml(i)
        # print(html)
        results = re.findall(regExp,html,re.S)
        for result in results:
            pointList.append(result)
    return pointList

if __name__ == '__main__':
    pointList = getData()
    for list in pointList:
        print(list)

Please indicate the wrong place! Thought that it was in trouble if you can give a praise! We welcome comments section or private letter exchange!

Published 30 original articles · won praise 72 · views 10000 +

Guess you like

Origin blog.csdn.net/Orange_minger/article/details/104793890