Using Python+matplotlib football player's shot data visualization (drawing scatter plot)

The visualization of shooting data is essentially a scatter plot, except that the size of the points changes according to the expected goal value (predicted goal probability), which improves the intuition and visibility.

1. https://understat.com league data network

The shooting data of football players comes from https://understat.com, enter the homepage, and search for Mbappe "Mbappe" (see Figure 1).

Figure 1 https://understat.com league data network home page search

Enter Kylian Mbappé's page, Mbappe's player_id=3423, so his page URL is https://understat.com/player/3423. The https://understat.com/ website provides league data from the 2014/2015 season to the present (the crawled page is https://understat.com/player/{player_id}, where Ronaldo’s player_id is 2371, Messi’s player_id is 2097, Neymar's player_id is 2099, Mbappe player_id is 3423), including shot position (X, Y), expected goal (goal probability) (xG), shot result (result), shot method (shotType ), season (season).

Shooting results (results) include: interception (intercepted by a player), goal, shot missed, save (saved by the goalkeeper), post shot (shot on the post).

Shot types (shotType) include: header shot, left foot shot, right foot shot and other parts of the body.

The shooting result is divided into five types: 1) Goal (goal); 2) Shoton post (shot on the goal post); 3) Savedshot (the goalkeeper kept it); 4) Blockedshot (intercepted); 5) Missedshot (shot) Partial).

Mbappe's data starts from the 2015/2016 season, and the catalog is the 2022 and 2023 seasons (see Figure 2).

Figure 2 Kylian Mbappé page

2. Web page analysis

Click the right mouse button to view the original code, and find that there are multiple super-long string variables in the <script>...</script> tag.

The fourth <script> in order is the shot data (see Figure 3).

Figure 3 Page code (partial)

to fetch is 

<script>

    var shotData = JSON.parse('...')

</script>

Content in quotes within the structure. The content is JSON structure data. Note: JSON is in the form of a string. Although it looks like a dictionary, it is not a Python dictionary. It is a string for Python, but it can be converted with the json module.

json.loads() ==> convert JSON string to dictionary or list of dictionaries

json.dumps() ==> Convert a dictionary or list of dictionaries to a JSON string

JSON can have two representation structures: object and array

The object structure starts with "{" braces and ends with "}" braces. The middle part is separated by "," to separate the key-value pair (key/value) code as follows:

{  

     key1:value1,     

     key2:value2,   

         ...  

}  

Among them: the keyword needs to be an invariant type, such as: a string; and the value can be any other data, such as: a string, a value, a Boolean value, an object or null.

The array structure starts with "[" square brackets and ends with "]" square brackets. The middle part uses "," to divide the object. The code representation is as follows:

[

  {

     key1:value1,

     key2:value2

  },

  {

    key3: value3,

      key4:value4

  }

]

It can be represented by a list of dictionaries in Python (Python two-dimensional data).

3. Data extraction and decoding

The webpage crawled this time uses a JSON array structure, which is converted into a Python structure into a list, and the elements are dictionaries.

Intercept the first and last two subsections of data in the variable (the data of Ronaldo), and list them below for preliminary analysis. From the data, it is a Python single-byte hexadecimal number in the form of a string (the decimal value is greater than 32 and less than 128, ASCII code) + data, it needs to be converted into a Python byte stream first, then decoded into a JSON string, and then converted into a Python dictionary list with json.loads().

>>> a = r'\x5B\x7B\x22id\x22\x3A\x2232535\x22,\x22minute\x22\x3A\x2218\x22,\x22result\x22\x3A\x22SavedShot\x22,\x22X\x22\x3A\x220.845\x22,\x22Y\x22\x3A\x220.49900001525878906\x22,\x22xG\x22\x3A\x220.06659495085477829\x22,\x22player\x22\x3A\x22Cristiano\x20Ronaldo\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x222371\x22,\x22situation\x22\x3A\x22SetPiece\x22,\x22season\x22\x3A\x222014\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x225834\x22,\x22h_team\x22\x3A\x22Real\x20Madrid\x22,\x22a_team\x22\x3A\x22Cordoba\x22,\x22h_goals\x22\x3A\x222\x22,\x22a_goals\x22\x3A\x220\x22,\x22date\x22\x3A\x222014\x2D08\x2D25\x2019\x3A00\x3A00\x22,\x22player_assisted\x22\x3A\x22Luka\x20Modric\x22,\x22lastAction\x22\x3A\x22Pass\x22\x7D,\x7B\x22id\x22\x3A\x22422004\x22,\x22minute\x22\x3A\x2223\x22,\x22result\x22\x3A\x22SavedShot\x22,\x22X\x22\x3A\x220.885\x22,\x22Y\x22\x3A\x220.5\x22,\x22xG\x22\x3A\x220.7612988352775574\x22,\x22player\x22\x3A\x22Cristiano\x20Ronaldo\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x222371\x22,\x22situation\x22\x3A\x22Penalty\x22,\x22season\x22\x3A\x222020\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x2215790\x22,\x22h_team\x22\x3A\x22Juventus\x22,\x22a_team\x22\x3A\x22Inter\x22,\x22h_goals\x22\x3A\x223\x22,\x22a_goals\x22\x3A\x222\x22,\x22date\x22\x3A\x222021\x2D05\x2D15\x2016\x3A00\x3A00\x22,\x22player_assisted\x22\x3Anull,\x22lastAction\x22\x3A\x22Standard\x22\x7D\x5D'

>>> b = eval("b'" + a + "'") # Put the string into b'...' and convert it to a byte stream with eval()

>>> b

b'[{"id":"32535","minute":"18","result":"SavedShot","X":"0.845","Y":"0.49900001525878906","xG":"0.06659495085477829","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"SetPiece","season":"2014","shotType":"RightFoot","match_id":"5834","h_team":"RealMadrid","a_team":"Cordoba","h_goals":"2","a_goals":"0","date":"2014-08-2519:00:00","player_assisted":"Luka Modric","lastAction":"Pass"},{"id":"422004","minute":"23","result":"SavedShot","X":"0.885","Y":"0.5","xG":"0.7612988352775574","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"Penalty","season":"2020","shotType":"RightFoot","match_id":"15790","h_team":"Juventus","a_team":"Inter","h_goals":"3","a_goals":"2","date":"2021-05-1516:00:00","player_assisted":null,"lastAction":"Standard"}]'

>>> type(b) # The test result is a byte stream

<class 'bytes'>

>>> b.decode() # decode() decodes to a string, because it is ASCII code and all encodings are compatible

'[{"id":"32535","minute":"18","result":"SavedShot","X":"0.845","Y":"0.49900001525878906","xG":"0.06659495085477829","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"SetPiece","season":"2014","shotType":"RightFoot","match_id":"5834","h_team":"RealMadrid","a_team":"Cordoba","h_goals":"2","a_goals":"0","date":"2014-08-2519:00:00","player_assisted":"LukaModric","lastAction":"Pass"},{"id":"422004","minute":"23","result":"SavedShot","X":"0.885","Y":"0.5","xG":"0.7612988352775574","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"Penalty","season":"2020","shotType":"RightFoot","match_id":"15790","h_team":"Juventus","a_team":"Inter","h_goals":"3","a_goals":"2","date":"2021-05-1516:00:00","player_assisted":null,"lastAction":"Standard"}]'

The important data include shot position (X, Y), expected goal (xG), shot result (result), season (season). Expected goal is the concept of predicted goal, xG=1 means 100% goal, X and Y are relative values, the value is between 0~1, matplotlib drawing is 0~100, so it needs to be enlarged by 100 times, result=Goal is Goals, season=2014 means 2014/2015 season.

>>> import json # import json module

>>> json.loads(b.decode()) # Convert JSON data to a list of dictionaries

[{'id':'32535', 'minute': '18', 'result': 'SavedShot', 'X': '0.845', 'Y':'0.49900001525878906', 'xG': '0.06659495085477829', 'player': 'Cristiano Ronaldo','h_a': 'h', 'player_id': '2371', 'situation': 'SetPiece', 'season': '2014','shotType': 'RightFoot', 'match_id': '5834', 'h_team': 'Real Madrid', 'a_team':'Cordoba', 'h_goals': '2', 'a_goals': '0', 'date': '2014-08-25 19:00:00','player_assisted': 'Luka Modric', 'lastAction': 'Pass'}, {'id': '422004','minute': '23', 'result': 'SavedShot', 'X': '0.885', 'Y': '0.5', 'xG':'0.7612988352775574', 'player': 'Cristiano Ronaldo', 'h_a': 'h', 'player_id':'2371', 'situation': 'Penalty', 'season': '2020', 'shotType': 'RightFoot','match_id': '15790', 'h_team': 'Juventus', 'a_team': 'Inter', 'h_goals': '3','a_goals': '2', 'date': '2021-05-15 16:00:00', 'player_assisted': None,'lastAction': 'Standard'}]

>>> json.loads(b) # In fact, it can be converted into a list of dictionaries without decoding

[{'id':'32535', 'minute': '18', 'result': 'SavedShot', 'X': '0.845', 'Y':'0.49900001525878906', 'xG': '0.06659495085477829', 'player': 'CristianoRonaldo', 'h_a': 'h', 'player_id': '2371', 'situation': 'SetPiece', 'season':'2014', 'shotType': 'RightFoot', 'match_id': '5834', 'h_team': 'Real Madrid','a_team': 'Cordoba', 'h_goals': '2', 'a_goals': '0', 'date': '2014-08-2519:00:00', 'player_assisted': 'Luka Modric', 'lastAction': 'Pass'}, {'id':'422004', 'minute': '23', 'result': 'SavedShot', 'X': '0.885', 'Y': '0.5', 'xG':'0.7612988352775574', 'player': 'Cristiano Ronaldo', 'h_a': 'h', 'player_id':'2371', 'situation': 'Penalty', 'season': '2020', 'shotType': 'RightFoot','match_id': '15790', 'h_team': 'Juventus', 'a_team': 'Inter', 'h_goals': '3','a_goals': '2', 'date': '2021-05-15 16:00:00', 'player_assisted': None,'lastAction': 'Standard'}]

>>> type(json.loads(b)) # The result is a list

<class 'list'>

alright! With the above analysis and basic knowledge, it is time to start crawling the webpage, use the get() method of the requests module to crawl the webpage, and extract the content of the <script>...</script> tag from the webpage using the BeautifulSoup class of the BeautifulSoup4 module The find_all() method.

4. Drawing scatter plots in matplotlib - scatter() method

The scatter() function in the pyplot module is used to draw a scatter plot, and its syntax is as follows:

matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, camp=None, 

       norm=None, vmin=None, vmax=None,alpha=None, linewidths=None, 

       verts=None, edgecolors=None, hold=None, data=None,**kwargs)

The commonly used parameters in the formula have the following meanings:

x, y: Indicates the data corresponding to the x-axis and y-axis.

s: specifies the size of the point. If a one-dimensional array is passed in, it indicates the size of each point.

c: Specifies the color of the scatter points. If the input is a one-dimensional array, it represents the color of each point.

marker: Indicates the type of scatter point drawn (the shape of the control point), see Table 1.

alpha: The transparency of the control point, accepting decimals between 0 and 1. When the amount of data is large, set a small alpha value, and then adjust the s value, so that the overlapping effect will be generated so that the aggregation characteristics of the data will be well displayed.

cmap: Adjust the type of gradient color or color list.

Table 1 marker settings and corresponding symbols and descriptions

Five, complete code

The complete code is as follows:

#############################################
# 设计 Zhang Ruilin   创建 2021-01-10 18:35 #
#                     修订 2022-12-28 10:13 #
# Matplotlib 绘制足球运动员的射门数据分布图 #
#############################################
import requests						# 爬网页工具
from bs4 import BeautifulSoup				# 分析网页、提取信息工具
import json						# JSON转字典、字典转JSON
import pandas as pd					# 大数据处理工具
import matplotlib.pyplot as plt				# 类似matlab的绘图工具包
import numpy as np					# 科学计算数学函数库
import matplotlib as mpl
import mplsoccer					# 绘制足球场工具

# 基利安·姆巴佩(Kylian Mbappé)的player-id为3423
url = 'https://understat.com/player/3423'		# 请求数据
html = requests.get(url)				# 爬取网页
# 解析处理数据
soup_parse = BeautifulSoup(html.content, 'lxml')	# 提取内容
scripts = soup_parse.find_all('script')			# 查找script标签返回一个列表类型        
strings = scripts[3].string				# 取含shotsData变量的结果,转字符串
_start = strings.index("('")+2				# 起点为JSON.parse('后的字符
_end = strings.index("')")				# 终止为\x5D')的'前,不含“'”
json_data = strings[_start:_end]			# 截取变量中''之间部分(JSON数据)
json_data = eval("b'"+json_data+"'")			# 将十六进制字符串\xYY转为字节流
data = json.loads(json_data)				# 转换为字典列表
# 处理数据, 包含射门位置(X,Y)、预期进球(xG)、射门结果(result)、赛季(season)
x, y, xg, result, season = [], [], [], [], []
for _dic in data:					# 提取X、Y、xG、result、season
    x.append(_dic['X'])
    y.append(_dic['Y'])
    xg.append(_dic['xG'])
    result.append(_dic['result'])
    season.append(_dic['season'])
columns = ['X', 'Y', 'xG', 'Result', 'Season']
df_data = pd.DataFrame([x, y, xg, result, season], index=columns)
df_data = df_data.T             			# 对数据进行行列交换(转置)
df_data = df_data.apply(pd.to_numeric, errors='ignore')	# 将数值字符串转换为数值型
df_data['X'] = df_data['X'].apply(lambda x: x*100)	# 放大100倍,得到最终结果
df_data['Y'] = df_data['Y'].apply(lambda x: x*100)	# 原数据为相对数据0~1
# df_data.to_csv(r'd:/Mbappé_shooting.csv')		# 保存为文件
background, text_color = 'lightgray', 'black'		# 定义背景色(浅灰色)、文字色(黑色)
mpl.rcParams['text.color'] = text_color			# 设置文字颜色
mpl.rcParams['font.sans-serif'] = ['simsun']		# 设置默认字体为宋体
mpl.rcParams['legend.fontsize'] = 15			# 图例字号15磅
fig, ax = plt.subplots(figsize=(7, 5.6))		# 新建画布7×5.6英寸
ax.axis('off')						# 关闭坐标轴(不显示坐标轴)
fig.set_facecolor(background)				# 用背景色填充
pitch = mplsoccer.VerticalPitch(half=True, pitch_type='opta', line_zorder=3,
        pitch_color='grass')				# 画垂直方向半个足球场
axes = fig.add_axes((0.05, 0.06, 0.9, 0.9))		# 绘图范围。左下角(0.05, 0.06),
axes.patch.set_facecolor(background)			# ↑宽、高各为90%
pitch.draw(ax=axes)
season=2021						# 设置赛季。范围2014~运行年-1
df = df_data.loc[df_data['Season'] == season]		# 筛选指定赛季数据
# 某赛季, 球员射门位置未得分散点图(df['Result']!='Goal'), 青色,透明度0.5
pitch.scatter(df[df['Result'] != 'Goal']['X'], df[df['Result'] != 'Goal']['Y'],
         s=np.sqrt(df[df['Result'] != 'Goal']['xG'])*100, marker='o', alpha=0.5,
         edgecolor='black', facecolor='cyan', ax=axes, label='未进球')
# 某赛季, 球员射门位置得分散点图(df['Result']=='Goal'), 深红色,透明度0.7
pitch.scatter(df[df['Result'] == 'Goal']['X'], df[df['Result'] == 'Goal']['Y'],
         s=np.sqrt(df[df['Result'] == 'Goal']['xG'])*100,marker='o', alpha=0.7,
         edgecolor='black', facecolor='crimson', ax=axes, label='进球得分')
axes.legend(loc='lower right')				# 添加图例
# 输出文字
axes.text(25, 64, f"预期进球:{sum(df['xG']):.2f}", weight='bold', 
              size=14)					# 期望进球df['xG']之和
axes.text(25, 61, f"得分次数:{len(df[df['Result'] == 'Goal'])}",
              weight='bold', size=14)			# 条件df['Result'] == 'Goal'的行数
axes.text(25, 58, f"射门次数:{len(df)}", weight='bold', size=14)	# 本赛季数据行数
axes.text(95, 60, f'{season}-{season+1}赛季', weight='bold', size=18)

plt.show()

The execution result is shown in Figure 4.

Figure 4 Kylian Mbappé shot position distribution map

Guess you like

Origin blog.csdn.net/hz_zhangrl/article/details/128490494