Game player behavior data analysis and prediction based on python

Game Player Behavior Data Analysis and Prediction

1. Introduction of project background and analysis objectives

1. Requirements and application scenarios

With the continuous development of the game industry, more and more game companies need to conduct data analysis on game operations to optimize operation strategies and improve user retention and revenue. Analysis of game operations can help companies understand user behavior, revenue sources, market trends, etc., and guide companies to tailor marketing and user management strategies. Application scenarios for game operation analysis requirements include:

1. Game companies optimize games based on user data

By analyzing data such as game behaviors, retention rates, and payment habits of different users, game companies can accurately locate the needs and habits of different user groups, provide personalized game services for different users, optimize user experience, and increase retention rates and revenue.

2. Tailor-made marketing strategy

Game companies can formulate more precise marketing strategies by analyzing market trends and competition. According to the characteristics of the user's region, age, gender, game preference, etc., develop corresponding game products and marketing plans, and optimize them for user needs.

3. Monitor the operation status of the game and adjust the strategy in time

Game companies can conduct real-time monitoring based on user data and game data, grasp the operating status of the game, adjust strategies in a timely manner, and improve the user experience and profitability of the game.

2. Analysis objectives

Taking a game operation situation analysis project as an example, the analysis objectives include:

1. User Behavior Analysis

By analyzing user data, including user active time, user retention time, user level distribution, proportion of paying users, etc., it provides game companies with data references on user preferences and behavior habits. Based on user behavior analysis, game companies can more accurately understand user needs, optimize game services, provide better user experience, and increase retention and payment rates.

2. Analysis of game revenue sources

By analyzing the data of game revenue sources, game companies can understand the correlation between different revenue sources, identify the most important source of income, understand why this source of income is the most important, and analyze and improve the payment habits of each channel. Such data reference can help game companies determine the most important source of income, and adjust and optimize the distribution strategy of income sources.

In short, the analysis of game operation can help game companies understand the operation of the game, formulate reasonable countermeasures for different problems, optimize operation strategies, and improve profitability.

2. Dataset source and description

本次课程报告的全部数据来自数据城堡上的公开数据,链接为 

本次数据集一共包涵两万余条数据,110个特征。为了便于分析选取其中的11个特征进行分析。

'user_id' the unique ID of the player
'avg_online_minutes', the average online time
'pvp_battle_count' the number of battles with the player
'pvp_lanch_count' the number of times the player initiates the battle '
pvp_win_count the number of times the player wins the battle
'pve_battle_count' the number of battles with the computer
'pve_lanch_count', the initiative The number of times the game was launched against the computer
'pve_win_count' The number of times the game was won against the computer

user_id unique player ID
pay_price recharge amount
pay_count recharge times
prediction_pay_price predicted recharge amount

3. Application of big data analysis technology

1. Data preprocessing code, annotations and running results

1. Import datasets and libraries

df = pd.read_csv(’./data/game_player.csv’,encoding=‘gbk’)

insert image description here

2. Slice the required features,

#Slice to extract the required features, and name it data
data = df[
[
'user_id', #player unique ID

                         'avg_online_minutes',              #在线时长
                         'pvp_battle_count',                #与玩家对战次数
                         'pvp_lanch_count',                 #主动与玩家对战的次数
                         'pvp_win_count',                   #与玩家对战获胜的次数
                         'pve_battle_count',                #与电脑对战次数
                         'pve_lanch_count',                 #主动发起与电脑对战的次数
                         'pve_win_count'                    #与电脑对战获胜次数                
                        ]
                   ] 

Data

insert image description here
insert image description here

3. Delete missing values ​​and remove duplicates

#Delete missing values
​​print('The shape of the data set before removing the missing row is:', data.shape)
data_1 = data.dropna(axis=0,how='any')
print('The shape of the data set after removing the missing row For:', data_1.shape)
#If shape is used, the data1 array will become str string type

#data1 = data_1['user_id'].drop_duplicates()
data1 = data_1.drop_duplicates()
print('The total number of game IDs after using the drop_duplicates method to remove duplicates:', len(data1))

And name the processed data set as data1
insert image description here

4. Similarity matrix of three features

#Find out the number of battles with players, the number of times to initiate battles with players, the number of times to win battles with players, the pearson method similarity matrix of three features corr_data1 =
data[['pvp_battle_count','pvp_lanch_count','pvp_win_count']]. corr(method='pearson')
print('The number of battles with players, the number of times to initiate battles with players, the number of times to win battles with players:\n', corr_data1)

insert image description here

5. Cut out the required new feature and name it data2

insert image description here

6. Dispersion Normalization

#Dispersion standardization
#Custom deviation standardization function
def min_max_scale(data1):
data1 = (data1 - data1.min())/ (data1.max()- data1.min())
return data1
#On average online time Standardized deviation
time_min_max = min_max_scale(data1['avg_online_minutes'])
print('Online time data before normalized deviation is:\n',data1['avg_online_minutes'])
print('Online time data after normalized deviation is:\ n', time_min_max)

insert image description here

7. Inner join, outer join, save the data set after preprocessing

#Merge data 1 and 2
print('Outer join merged data frame size',
pd.concat([data1,data2],axis=1,join='outer').shape)

#Merge data 1 and 2
print('Inner join merged data frame size',
pd.concat([data1,data2],axis=1,join='inner').shape)

data3 = pd.merge(data1,data2,how=‘inner’,on=‘user_id’)
data3.to_csv(’./data/吴硕秋202006180058.csv’,sep=’;’,index=False)
insert image description here

2. Data exploration and feature construction

1. Analysis of player activity

(1) Calculate the average online duration of all players
avg_time = data3.avg_online_minutes.mean()
avg_time

insert image description here

(2) # Calculate the average online duration of paying players
pay_avg_time = data3[data3.pay_price > 0].avg_online_minutes.mean()
pay_avg_time

insert image description here

#Use the equal-width discrete method to record the distribution of recharge times
pay_cut = pd.cut(data2['pay_count'],40)
print('Discretized recharge times distribution:\n',pay_cut.value_counts())

insert image description here

(3) Draw the player average online time boxplot

Draw a boxplot of the average online time of all players

plt.figure(figsize=(10,10))
plt.boxplot(data3.avg_online_minutes)
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
plt.rcParams['axes.unicode_minus']= False
plt.title('Box plot of average online time of all players')
plt.show()

insert image description here

Draw a boxplot of the average online time of paying players

plt.figure(figsize=(10,10))
plt.boxplot(data3[data3.pay_price > 0].avg_online_minutes)
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
plt.rcParams[ 'axes.unicode_minus']=False
plt.title('Paid player average online time box plot')
plt.show()
insert image description here

Counts players who have played PvP games

pvp_avg_time =data3[data3.pvp_battle_count > 0].avg_online_minutes.mean()

insert image description here

Evaluation
The average online time of all players is 9.6 minutes, and the average online time of paying players is 135.8 minutes, which is about 11 times that of ordinary players. Paying players have higher activity.

2. Analysis of player payment rate

(1) Obtain the number of players whose payment times exceed 0

insert image description here

(2) Draw a pie chart
insert image description here

3. Player payment analysis and correlation exploration

(1) Define HY, total_pay, HY_AVG, HY_PAY_COUNT, PAY_AVG, and PAY_PRO
as the number of active players, total income, average income per active player, number of active paying players, average income per paying player, and payment rate
insert image description here

(2) The relationship between active players and recharge amount

insert image description here

The payment rate of this game is low, and there is still room for further improvement. Related activities can be carried out to increase the payment rate.
The per capita consumption of paying players in this game is 32, which shows that paying users have strong spending power as a whole. Follow-up analysis can be made on paying users to ensure their continuous payment;

4. Analysis of players' gaming habits

Average PVP times of active users

HY_pvp_battle_coun = data3[data3.avg_online_minutes> 10].pvp_battle_count.mean()

The total number of active pvp

HY_count_pvp = data3[data3.avg_online_minutes> 10].pvp_battle_count.sum()

The number of active pvp initiations

HY_count_lanch_pvp = data3[data3.avg_online_minutes> 10].pvp_lanch_count.sum()

Probability of active users actively initiating PVP

HY_rate_lanch_pvp = HY_count_lanch_pvp/HY_count_pvp

The total number of PVP victories of active users

HY_num_win_pvp = data3[data3.avg_online_minutes> 10].pvp_win_count.sum()

Active user PVP victory probability

HY_rate_win_pvp = HY_num_win_pvp/HY_count_pvp

print(f'Average PVP times of active users: {HY_pvp_battle_coun}')
print(f'Probability of active users to initiate PVP: {HY_rate_lanch_pvp}')
print(f'PVP victory probability of active users: {HY_rate_win_pvp}')

insert image description here

(2)

Average PVE times of active users

HY_pve_battle_coun = data3[data3.avg_online_minutes> 10].pve_battle_count.mean()

The total number of active PVE

HY_count_pve = data3[data3.avg_online_minutes> 10].pve_battle_count.sum()

The number of active user PVE launches

HY_count_lanch_pve = data3[data3.avg_online_minutes> 10].pve_lanch_count.sum()

Active users initiate PVE probability

HY_rate_lanch_pve = HY_count_lanch_pve/HY_count_pve

The total number of PVE victories of active users

HY_num_win_pve = data3[data3.avg_online_minutes>=15].pve_win_count.sum()

Active user PVE victory probability

HY_rate_win_pve = HY_num_win_pve/HY_count_pve

print(f'Average PVE times of active users: {HY_pve_battle_coun}')
print(f'Probability of active users actively launching PVE:{HY_rate_lanch_pve}')
print(f'PVE victory probability of active users:{HY_rate_win_pve}')
insert image description here

(3)

Average PVP times of active paying users

HY_PAY_COUNT_pvp_battle_coun = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_battle_count.mean()

The total number of active paid fee protection pvp

HY_PAY_COUNT_count_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_battle_count.sum()

The number of active paid fee protection pvp initiations

HY_PAY_COUNT_count_lanch_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_lanch_count.sum()

Probability of active paying users to initiate PVP

HY_PAY_COUNT_rate_lanc_pvp = HY_PAY_COUNT_count_lanch_pvp/HY_PAY_COUNT_count_pvp

The total number of PVP victories of active paying users

HY_PAY_COUNT_num_win_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_win_count.sum()

PVP victory probability for active paying users

HY_PAY_COUNT_rate_win_pvp = HY_PAY_COUNT_num_win_pvp/HY_PAY_COUNT_count_pvp

print(f'Average PVP times of active paying users: {HY_PAY_COUNT_pvp_battle_coun}')
print(f'Probability of active paying users to initiate PVP: {HY_PAY_COUNT_rate_lanc_pvp}')
print(f'PVP winning probability of active paying users: {HY_PAY_COUNT_rate_win_pvp}')

insert image description here

(4)

The average number of Pve times of active paying users

HY_PAY_COUNT_pve_battle_coun = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_battle_count.mean()

The total number of pve times paid for

HY_PAY_COUNT_count_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_battle_count.sum()

The number of initiations of paid pvp

HY_PAY_COUNT_count_lanch_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_lanch_count.sum()

Paid users actively initiate PV probability

HY_PAY_COUNT_rate_lanc_pve = HY_PAY_COUNT_count_lanch_pve/HY_PAY_COUNT_count_pve

The total number of PVe victories of paying users

HY_PAY_COUNT_num_win_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_win_count.sum()

Paying user PVe victory probability

HY_PAY_COUNT_rate_win_pve = HY_PAY_COUNT_num_win_pve/HY_PAY_COUNT_count_pve

print(f'Average PVE times of paid users: {HY_PAY_COUNT_pve_battle_coun}')
print(f'Probability of paid users to initiate PVE: {HY_PAY_COUNT_rate_lanc_pve}')
print(f'PVE victory probability of paid users: {HY_PAY_COUNT_rate_win_pve}')
insert image description here

visualization
insert image description here

Comment

1) The average number of PVE and PVP times of active paying players is higher than that of active players, and active paying players are more willing to spend time on this game; 2) In PVP
battles, the winning rate of active paying players is much higher than that of active players , indicating that our game props can allow APA to enjoy the fun of winning the battle;

3. The source code, annotations and operation results of the construction and evaluation of the classification model

1. First construct a feature heat matrix to understand the relationship between each feature.

This part needs to complete the regression model and classification model and compare them, so it is particularly important to understand the relationship between each feature. Before that, create a new feature feature, and define the player whose online time is less than half of the average online time of all players as a feature. And import data3 as the last feature.

insert image description here

2. Model construction Data set division

Select the new features created in the previous section for analysis.

This part divides the data label, trains the test set, and standardizes the data set, setting the data set to a state that the algorithm can call directly.
By the way, plot the confusion matrix.

Divide Data Labels

data3_data = data3.iloc[:, :-1]
data3_target = data3.iloc[:, -1]
#Divide training set and test set
from sklearn.model_selection import train_test_split
data3_data_train, data3_data_test, data3_target_train, data3_target_test = train_test_split(data 3_data, data3_target, test_size=0.2, random_state=66)

normalized data set

from sklearn.preprocessing import StandardScaler
stdScale = StandardScaler().fit(data3_data_train)
data3_trainScaler = stdScale.transform(data3_data_train)
data3_testScaler = stdScale.transform(data3_data_test)

#Confusion matrix
from sklearn.metrics import confusion_matrix
def test_pre(pred):
hx = confusion_matrix(data3_target_test, pred)
print('Confusion matrix:\n',hx)

#精确率
P = hx[1,1]/ (hx[0, 1] + hx[1,1])
print('精确率:\n',round(P, 3))

#召回率
R = hx[1,1]/ (hx[1, 0] + hx[1,1])
print('召回率:\n',round(P, 3))

#F1值
F1 = 2 * P * R /(P+R)
print('F1值:',round(F1, 3))

undersampling
insert image description here

2. SVM algorithm constructs classification model, and evaluates, ROC

Use the SVM algorithm to predict the data set and display the top 20 prediction results. The prediction yielded 2125 correct results and 69 wrong results with an accuracy rate of 96%

insert image description here

The F1 value of the evaluation part
is 0.98 and 0.94
insert image description here

Draw the ROC curve
insert image description here

3. Construct and evaluate using Gaussian Naive Bayes

It is consistent with the SVM approach, the only difference is the algorithm

insert image description here

Evaluation model,
compared with SVM, the accuracy rate of this model is slightly lower, which is 75%. According to the classification report results, the accuracy rate is 0.98 and 0.54, and the f1 value is 0.80 and 0.70. Compared with the svm model, there is a big gap. All figures are lower.
insert image description here

Draw the ROC curve

insert image description here

4. The source code, annotations and operation results of the construction and evaluation of the regression model

1. Divide the dataset

The feature of this experiment is that the "recharge times"
part divides data labels, training set and test set, and standardizes the data set, setting the data set to a state that the algorithm can directly call.
insert image description here

2. Experimental random forest regression model construction

Using the Regression Forest Tree Algorithm
insert image description here

Draw a visualization of regression results
from matplotlib import rcParams
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
fig = plt.figure(figsize=(12, 6))
plt.plot(range(data3_target_test. shape[0]), list(data3_target_test), color='blue'
)
plt.plot(range(data3_target_test.shape[0]), y_pred, color='red', linewidth=2.5 ,linestyle='-.'
)
plt.xlabel('result value')
plt.ylabel('online rate')
plt.legend(['true result','predicted result'])
plt.show()
insert image description here

Print and view the regression report
Import the report, view the forest model to view various data.
insert image description here

3. Using Support Vector Regression Model Construction

model building

insert image description here

Results visualization
insert image description here

Print and view regression report
insert image description here

Evaluation and comparison of the two models
Compared with the vector machine model, the random forest method model has a higher R square value of 0.84, and the explained variance is also 0.84.

5. Comparison and explanation of analysis results of various models

Evaluate and compare two regression models

Compared with the vector machine model, the random forest method model has a higher R square value of 0.84, and the explained variance is also 0.84.

Comparative evaluation of classification models

Evaluation model, compared with SVM, the accuracy rate of this model is slightly lower, which is 75%. According to the classification report results, the accuracy rate is 0.98 and 0.54, and the f1 value is 0.80 and 0.70. Compared with the svm model, there is a big gap. All figures are lower.

6. Application of data visualization technology

1. The source code and operation results of the first data visualization technology, and a brief description

The drawing of the box plot, the box plot can clearly display the five statistics of the data, the minimum value, the lower quartile, the median, the upper quartile and the maximum value, and it is concise and easy to understand. Since this visualization is an analysis of the average online time, a comprehensive comparison of multiple indicators is required, so it is very suitable for drawing a boxplot.

Draw a boxplot of the average online time of paying players

plt.figure(figsize=(10,10))
plt.boxplot(data3[data3.pay_price > 0].avg_online_minutes)
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
plt.rcParams[ 'axes.unicode_minus']=False
plt.title('Paid player average online time box plot')
plt.show()

insert image description here

2. The source code and operation results of the second data visualization technology, and a brief description

The advantage of the pie chart is that it emphasizes the proportion and is easy to understand. This time, we choose to compare the gap between the number of paying players and non-paying players. For such a small data set, the pie chart can accurately see the proportion and gap between the two.

Make a percentage pie chart

plt.figure(figsize=(8,8))

drawing

patches,l_text,p_text = plt.pie([22877 - pay_num , pay_num],
labels=['unpaid','paid'],
labeldistance = 0.3,
colors=['#87CEFA','#FFC0CB'],
explode =[0.01,0.05],
autopct='%1.1f%%',
pctdistance=1.15)

set label size

for t in l_text:
t.set_size(20)

set percent font size

for t in p_text:
t.set_size(20)

set title

plt.title('The ratio of paid users to all users', size=25)
plt.show()

insert image description here

3. The source code and operation results of the third data visualization technology, and a brief description

The advantage of the histogram is that the numerical difference between different categories can be represented by height, and the simplicity and efficiency of a single drawing make it easier to compare. Therefore, it is easy to make different pairwise comparisons of each data in one graph to show the gap.

plt.figure(figsize=(15,8))

AU players

plt.bar([0.75,2.75,4.75,6.75],[HY_rate_lanch_pve, HY_rate_win_pve, HY_rate_lanch_pvp, HY_rate_win_pvp],width=0.5,alpha=0.5,label='active HY player') plt.bar([
1.25,3.25,5.25 ,7.25],[ HY_PAY_COUNT_rate_lanc_pve,HY_PAY_COUNT_rate_win_pve , HY_PAY_COUNT_rate_lanc_pvp, HY_PAY_COUNT_rate_lanc_pvp],width=0.5,color='r',alpha=0.5,label='Paid active players') plt.xtick s([1,3,5,7]
, ['Probability of actively initiating PVE','PVE winning probability','Probability of actively initiating PVP','PVP winning probability']) plt.legend() plt.show
(
)

insert image description here

7. Course conclusion and experience

对于本次课程报告,将之前上课所学过的知识全部总结,融会贯通了起来。最主要包括数据预处理,探索性分析,回归模型和分类模型的创建都有了新的理解。

For data preprocessing, data preprocessing is an important step in any data analysis, machine learning or deep learning project, the purpose of which is to ensure no missing data, standardized, clear and consistent data sets so that various analyzes can be better performed , model fitting or forecasting. Data cleaning is the first step in data preprocessing, including removing duplicate values, solving missing values, and outlier processing. Then feature processing is another important part of data preprocessing. Feature processing includes selecting, extracting, and transforming data features. After the first step of cleaning, it is necessary to sort out the data and extract fine features. The success of data preprocessing is important before large-scale data analysis and machine learning phases, as it helps to improve accuracy and maximize the clarity of samples from noise and determinism that drives final results.
For exploratory analysis, prior to the data preprocessing and modeling stages, drill down to all the details of the dataset and obtain as much information and potential stakes as possible in order to efficiently plan the next algorithm and modeling tasks and reduce risk.
Regression models are a common supervised learning task for predicting numerical variables, which not only can improve the accuracy and precision of data, but also play an important role in many practical problems. Understanding different types of regression models and evaluation criteria, and mastering optimization strategy skills will lead to better selection, construction, and validation of models and better predictive results.
Classification model is a common supervised learning task, which not only can improve the classification accuracy and accuracy of data, but also plays an important role in many practical problems. Understand the classification types and evaluation indicators of classification models, and master optimization techniques and methods.
For the harvest of large jobs, the most important thing is the details. For example, for data deduplication during preprocessing, drop must be used instead of list or others, because only drop will not change the structure of the array. The second is about the classification part. Before the prediction, a heat map must be drawn to check the correlation of each feature for intuitive understanding.

insert image description here

Guess you like

Origin blog.csdn.net/weixin_48676558/article/details/130841193