project description:
The PlayerUnknown’s Battlegrounds game data on kaggle has a total of 4,446,966 pieces of data and a total of 47,965 games. The player ID is not clearly marked and the number of participants is unknown.
Analysis visualization ideas:
Data Dictionary:
Load data and view the data situation
data = pd.read_csv(r'.\PUBG_Mobile\data\train_V2.csv')
data.describe()
data.info()
There are 29 fields in total, with only one missing value.
Possible cheating data and outliers are eliminated.
# 剔除可能开挂的数据,只有一条空数据,直接删除
data.dropna(inplace=True)
# 杀敌数大于20
df1 = data[data.DBNOs<=20]
# 剔除在车上杀敌大于3人
df2 = df1[df1.roadKills<=3]
# 没移动就完成击杀
df3 = df2[~((df2.walkDistance==0)&(df2.DBNOs>0))]
# 剔除杀敌数大于3且爆头率为1的数据
data_ed = df3[~((df3.kills>3)&(df3.kills==df3.headshotKills))]
# 玩家ID没标示
print(len(data_ed),data_ed['Id'].nunique(),data_ed.matchId.nunique())
Specific analysis ideas, from distribution → ranking → chicken
game 1. In a game, the damage suffered by the player himself
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_figwidth(15)
sns.distplot(data_ed['damageDealt'], ax=ax1)
sns.boxplot(data_ed['damageDealt'], ax=ax2)
plt.show()
As can be seen from the picture above, the average player receives 0-500 damage in a game.
2. Distribution of the number of people knocked down
plt.figure(dpi=300,figsize=(24,8))
plt.hist(data_ed.DBNOs)
plt.show()
Haha, most people are very kind and have never knocked down one person
3. The relationship between the number of kills and player rankings
# 击倒人数与当场游戏排名的关系
plt.figure(figsize= (24, 8),dpi=300)
df4 = data_ed[['DBNOs', 'winPlacePerc']]
sns.set(style="darkgrid")
g = sns.relplot(data=df4,x="DBNOs", y="winPlacePerc",height=8,linewidth=2,aspect=1.3, kind="line")
plt.title('DBNOs / winPlacePerc', fontsize=15)
g.fig.autofmt_xdate()
4. Number of knockouts and game rankings
# 单变量分析:击杀人数与玩家排名的关系
df4 = data_ed[['kills', 'rankPoints']]
plt.figure(figsize= (30, 10))
sns.set(style="darkgrid")
g = sns.relplot(data=df4,x="kills", y="rankPoints",height=8,linewidth=2,aspect=1.3, kind="line")
g.fig.autofmt_xdate()
The ELo score is 1000 as the midpoint. If the score reaches more than 1000, the number of kills at the same time must exceed 30 people.
5. The winning probability of each team mode (single row/double row/four rows)
# 查看每种组队模式的获胜概率(单排/双排/四排)
df_matchType_no1 = data_ed[data_ed.winPlacePerc==1].groupby(['matchType']).agg('matchType','count')
df_matchType = data_ed.groupby(['matchType']).agg('matchType','count')
df_matchType_win = pd.merge(df_matchType,df_matchType_no1,left_index=True, right_index=True)
df_matchType_win['胜率'] = df_matchType_win['count']/df_matchType_win[count']
plt.figure(dpi=300,figsize=(24,8))
plt.bar(df_matchType_win.index,df_matchType_win['胜率'])
plt.xticks(rotation=30)
plt.show()
Judging from the results, the probability of eating chicken in the fourth row is the highest 1.4%
. 6. The relationship between walking distance and eating chicken
# 用步行距离与吃鸡的关系walkDistance /winPlacePerc
df_ride = data_ed[['walkDistance', 'winPlacePerc']]
labels=["0k-1k", "1k-2k", "2k-3k", "3k-4k","4k-5k", "5k-6k", "6k-7k", "7k-8k"]
df_ride['walkDistance_cut'] = pd.cut(df_ride['walkDistance'], 8, labels=labels) # pd.cut , 分割pandas 为10个等距子表
df_ride.groupby('walkDistance_cut').winPlacePerc.mean().plot.bar(rot=30, figsize=(24, 8))
plt.xlabel("walkDistance_cut")
plt.ylabel("winPlacePerc")
7. The relationship between the distance the vehicle moves and chicken eating
# 用载具移动的距离与吃鸡的关系rideDistance /winPlacePerc
df_ride = data_ed.loc[data_ed['rideDistance']<10000, ['rideDistance', 'winPlacePerc']]
labels=["0k-1k", "1k-2k", "2k-3k", "3k-4k","4k-5k", "5k-6k", "6k-7k", "7k-8k"]
df_ride['drive'] = pd.cut(df_ride['rideDistance'], 8, labels=labels) # pd.cut , 分割pandas 为10个等距子表
df_ride.groupby('drive').winPlacePerc.mean().plot.bar(rot=30, figsize=(24, 8))
plt.xlabel("rideDistance")
plt.ylabel("winPlacePerc")
8. The relationship between gain items and chicken eating
# 用增益物品与吃鸡的关系boosts/winPlacePerc
df4 = data_ed[['boosts', 'winPlacePerc']]
plt.figure(figsize= (30, 10))
sns.set(style="darkgrid")
g = sns.relplot(data=df4,x="boosts", y="winPlacePerc",height=8,linewidth=2,aspect=1.3, kind="line")
g.fig.autofmt_xdate()
multivariate correlation
#删除与建模无关的字段Id groupId matchId matchType
data_m = data.drop(['Id', 'groupId', 'matchId', 'matchType'],axis=1)
matrix = data_m.corr()
cmap = sns.diverging_palette(250, 15, s=70, l=75, n=40, center="light", as_cmap=True)
plt.figure(figsize=(24, 12))
sns.heatmap(matrix, center=0, annot=True,fmt='.2f', square=True, cmap=cmap)
Starting from winplaceperc, the correlation is relatively strong, the player's walking distance, the number of buff items used, and the number of players killed are negatively correlated. Divide
the data set
y = data_m['winPlacePerc'].values
x = data_m.drop(columns=['winPlacePerc']).values
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.3)
linear regression
# 线性回归
reg = LR().fit(xtrain,ytrain)
y_hat = reg.predict(xtest)
random forest
# 随机森林
rfc = RandomForestClassifier(random_state=0)
rfc = rfc.fit(xtrain,ytrain.astype('int64'))
rfc_y_hat = rfc.predict(xtest)
# score_r = rfc.score(xtest,ytest.astype('int64'))
RMSE, MSE, R-squared and MAE are used to evaluate the accuracy of the regression model.
# 线性回归
MSE = metrics.mean_squared_error(ytest, y_hat)
RMSE = metrics.mean_squared_error(ytest, y_hat)**0.5
MAE = metrics.mean_absolute_error(ytest, y_hat)
MSE,RMSE,MAE,
mse=0.016028860503889776, rmse=0.126605136167099378,mae=0.09272709032057316
#随机森林
MSE = metrics.mean_squared_error(ytest, rfc_y_hat)
RMSE = metrics.mean_squared_error(ytest, rfc_y_hat)**0.5
MAE = metrics.mean_absolute_error(ytest, rfc_y_hat)
MSE,RMSE,MAE,
mse=0.014725708056613685,rmse=0.12134952845649498, mae=0.08928706404803585
Learn from
https://codeantenna.com/a/Rn2nLom4jT
https://www.jianshu.com/p/57c0f0266c10
https://www.heywhale.com/mw/project/63f19d69030c7011ddd54ab7