实战演习（九）——用python分析科比生涯数据

笔者是一个痴迷于挖掘数据中的价值的学习人，希望在平日的工作学习中，挖掘数据的价值，找寻数据的秘密，笔者认为，数据的价值不仅仅只体现在企业中，个人也可以体会到数据的魅力，用技术力量探索行为密码，让大数据助跑每一个人，欢迎直筒们关注我的公众号，大家一起讨论数据中的那些有趣的事情。

我的公众号为：livandata

本文主要是借用科比的投篮数据分析来了解机器学习建模的流程，同时也为了增加一些功能分析，机器学习的重中之重是特征工程，可以说特征工程决定了机器学习的模型质量，在构建模型时首先要进行特征工程的整理，完成后再进行模型的确定，找到需要调整的参数，然后再进行一个参数一个参数的调整，本文只处理一个参数，通过对这个参数的调整实现模型构建，机器学习的主要流程为：

1）数据的预处理：主要是特征工程这一块，这个案例中用到的方法为：

# notnull()：返回值是布尔型的矩阵。再取df[布尔型矩阵]返回的是raw['shot_made_flag']为非空的行。
# unique()去重；
# value_counts()计数。
# pd.get_dummies()：是将数据列进行ont_hot变换。
# np.logspace(): 创建等比数列：10的0次方与10的2次方的5个数字。
# np.linspace()：等差序列。

2)数据块的KFOLD分块：

kf = KFold(n_splits=10, shuffle=True)
for train_k, test_k in kf.split(train_kobe):

3）参数调整：参数的调整主要是用到一些循环，看参数在选取哪个值的时候能够得到最优解。

4）模型构建：对模型进行组合，形成一个完整的模型，然后再用数据调整对应的参数，最终确定模型的对应解。

本文分析科比的投篮数据，预测下一次投篮时的命中率。

本文可以针对数据进行一些分析，案例获取到的科比的数据为：

分析的根本是找到合适的特征，什么样的分析目的需要什么样的特征值，因此在数据分析中每个特征的意义便显得非常重要，对于一个数据分析员来讲，对特征的熟悉程度决定了是否能准确全面的解决问题，而特征的熟悉是需要一点一点积累的，笔者试着能通过一次次的案例，积淀一些小小的经验。

本案例中主要用到了上面表格中的数据，以供大家参考，本案例旨在预测科比下一次投篮命中的概率，可以对其特征进行相应的分析，本案例中用到的特征总体可以从个人因素、比赛因素、团队因素三个角度分析，另外结合空间和时间的概念可以得到下面四个分类：

1）动作类：投篮动作（跳投、灌篮等等～）、进球类型（2分球、3分球～）～

2）区域类：经度、维度、投射距离、投射区域等～

3）时间类：比赛进行时间、赛季时间、赛季进度、科比在场上的时间等～

4）团队类：对手是谁、队友是谁等～

在进行数据特征分析时个人感觉可以考虑科比的运球时间、科比的进球比率、失误率等，因为这些都影响比赛时的心态，故个人感觉应该考虑在内，当然这是我自己的一些想法，如果大家有兴趣可以在留言补充，大家相互讨论～

整个的案例为：

#!/usr/bin/env python
# _*_ UTF-8 _*_

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, log_loss
import time
from sklearn.model_selection import KFold
# 科比投篮的预测分析：https://blog.csdn.net/weixin_42108215/article/details/80854125
filename = '/Users/livan/PycharmProjects/test/data/kobe_data.csv'
raw = pd.read_csv(filename)
print(raw.shape)
print(raw.head())
# 数据了解：
# notnull()：返回值是布尔型的矩阵。再取df[布尔型矩阵]返回的是raw['shot_made_flag']为非空的行。
kobe = raw[pd.notnull(raw['shot_made_flag'])]
print(kobe.shape)
alpha = 0.02
plt.figure(figsize=(10, 10))
plt.subplot(121)
# 散点图：
plt.scatter(kobe.loc_x, kobe.loc_y, color='R', alpha = alpha)
plt.title('loc_x and loc_y')
plt.subplot(122)
plt.scatter(kobe.lon, kobe.lat, color='B', alpha = alpha)
plt.title('lat and lon')
plt.show()
# 往元数据集中添加一列：
# 下面需要将x,y值转换成极坐标系：
raw['dist'] = np.sqrt(raw['loc_x']**2 + raw['loc_y']**2)
# 如果raw['loc_x'] == 0，则将true/false赋值给loc_x_zero：
loc_x_zero = raw['loc_x'] == 0
# [0]*len(raw):形成一个全是0的数组[0,0,0,0,……]
raw['angle'] = np.array([0]*len(raw))
# 每个元素取反正切：
# ~loc_x_zero:返回的是为false的值：
raw['angle'][~loc_x_zero] = np.arctan(raw['loc_y'][~loc_x_zero]/raw['loc_x'][~loc_x_zero])
raw['angle'][loc_x_zero] = np.pi/2
# 常见方法使用：unique()去重；value_counts()计数。
print(kobe.action_type.unique())
print(kobe.combined_shot_type.unique())
print(kobe.shot_type.unique())
print(kobe.shot_type.value_counts())
# 对于强相关的数据我们可以只取一个，方法可以使用作图法，任意两组值然后查看相关性，
# 也可以通过计算的形式计算任意两列的相关系数：
plt.figure(figsize=(5, 5))
plt.scatter(raw.dist, raw.shot_distance, color='blue')
plt.title('dist and shot_distance')
# 对于没有用的数据进行drop：
drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_range',
         'shot_zone_basic', 'matchup', 'lon', 'lat', 'seconds_remaining',
         'minutes_remaining', 'shot_distance', 'loc_x', 'loc_y',
         'game_event_id', 'game_id', 'game_date']
for drop in drops:
    raw = raw.drop(drop, 1)
# one_hot编码：pd.get_dummies(),是将数据列进行ont_hot变换。
print("12121212121212212121212121212")
pd.get_dummies(raw['combined_shot_type'], prefix='combined_shot_type')
# 需要进行one-hot变换的变量：
categorical_vars = ['action_type', 'combined_shot_type',
                    'shot_type', 'opponent', 'period',
                    'season']
for var in categorical_vars:
    # 其中的两个"1"为按列操作：
    raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], 1)
    raw = raw.drop(var, 1)

train_kobe = raw[pd.notnull(raw['shot_made_flag'])]
train_label = train_kobe['shot_made_flag']
train_kobe = train_kobe.drop('shot_made_flag', 1)
test_kobe = raw[pd.isnull(raw['shot_made_flag'])]
test_kobe = test_kobe.drop('shot_made_flag', 1)
# np.logspace创建等比数列：10的0次方与10的2次方的5个数字
range_m = np.logspace(0, 2, num=5).astype(int)
print(range_m)
print('Finding best n_estimators for RandomForestClassifier...')
min_score = 100000
best_n = 0
scores_n = []
# np.linspace（）：等差序列
range_n = np.linspace(1, 100, num=10).astype(int)
print(range_n)
# 列举10个数字，作为tree的棵树
# 只找一个参数tree number作为调试的参数，可以只用一个循环即可。
for n in range_n:
    print("the number of trees : {0}".format(n))
    t1 = time.time()
    rfc_score = 0.
    rfc = RandomForestClassifier(n_estimators=n)
    kf = KFold(n_splits=10, shuffle=True)
    for train_k, test_k in kf.split(train_kobe):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_n.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_n = n
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(n, t2 - t1))
print(best_n, min_score)
# find best max_depth for RandomForestClassifier
print('Finding best max_depth for RandomForestClassifier...')
min_score = 100000
best_m = 0
scores_m = []
range_m = np.logspace(0, 2, num=3).astype(int)
for m in range_m:
    print("the max depth : {0}".format(m))
    t1 = time.time()
    rfc_score = 0.
    rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)
    kf = KFold(n_splits=10, shuffle=True)
    for train_k, test_k in kf.split(train_kobe):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_m.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_m = m
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(m, t2 - t1))
print(best_m, min_score)

# 预测结果的参数和差异，可视化出来：
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.plot(range_n, scores_n)
plt.ylabel('score')
plt.xlabel('number of trees')
plt.subplot(122)
plt.plot(range_m)
plt.ylabel('score')
plt.xlabel('max depth')
# 模型训练：
model = RandomForestClassifier(n_estimators=best_n, max_depth=best_m)
model.fit(train_kobe, train_label)

在模型构建完成后，需要将参数与模型进行融合，然后放入到系统中，供系统使用。

livan1234

发布了137 篇原创文章 · 获赞 93 · 访问量 16万+

私信关注

实战演习（九）——用python分析科比生涯数据

猜你喜欢