Game Player Behavior Data Analysis and Prediction
1. Introduction of project background and analysis objectives
1. Requirements and application scenarios
With the continuous development of the game industry, more and more game companies need to conduct data analysis on game operations to optimize operation strategies and improve user retention and revenue. Analysis of game operations can help companies understand user behavior, revenue sources, market trends, etc., and guide companies to tailor marketing and user management strategies. Application scenarios for game operation analysis requirements include:
1. Game companies optimize games based on user data
By analyzing data such as game behaviors, retention rates, and payment habits of different users, game companies can accurately locate the needs and habits of different user groups, provide personalized game services for different users, optimize user experience, and increase retention rates and revenue.
2. Tailor-made marketing strategy
Game companies can formulate more precise marketing strategies by analyzing market trends and competition. According to the characteristics of the user's region, age, gender, game preference, etc., develop corresponding game products and marketing plans, and optimize them for user needs.
3. Monitor the operation status of the game and adjust the strategy in time
Game companies can conduct real-time monitoring based on user data and game data, grasp the operating status of the game, adjust strategies in a timely manner, and improve the user experience and profitability of the game.
2. Analysis objectives
Taking a game operation situation analysis project as an example, the analysis objectives include:
1. User Behavior Analysis
By analyzing user data, including user active time, user retention time, user level distribution, proportion of paying users, etc., it provides game companies with data references on user preferences and behavior habits. Based on user behavior analysis, game companies can more accurately understand user needs, optimize game services, provide better user experience, and increase retention and payment rates.
2. Analysis of game revenue sources
By analyzing the data of game revenue sources, game companies can understand the correlation between different revenue sources, identify the most important source of income, understand why this source of income is the most important, and analyze and improve the payment habits of each channel. Such data reference can help game companies determine the most important source of income, and adjust and optimize the distribution strategy of income sources.
In short, the analysis of game operation can help game companies understand the operation of the game, formulate reasonable countermeasures for different problems, optimize operation strategies, and improve profitability.
2. Dataset source and description
本次课程报告的全部数据来自数据城堡上的公开数据,链接为
本次数据集一共包涵两万余条数据,110个特征。为了便于分析选取其中的11个特征进行分析。
'user_id' the unique ID of the player
'avg_online_minutes', the average online time
'pvp_battle_count' the number of battles with the player
'pvp_lanch_count' the number of times the player initiates the battle '
pvp_win_count the number of times the player wins the battle
'pve_battle_count' the number of battles with the computer
'pve_lanch_count', the initiative The number of times the game was launched against the computer
'pve_win_count' The number of times the game was won against the computer
user_id unique player ID
pay_price recharge amount
pay_count recharge times
prediction_pay_price predicted recharge amount
3. Application of big data analysis technology
1. Data preprocessing code, annotations and running results
1. Import datasets and libraries
df = pd.read_csv(’./data/game_player.csv’,encoding=‘gbk’)
2. Slice the required features,
#Slice to extract the required features, and name it data
data = df[
[
'user_id', #player unique ID
'avg_online_minutes', #在线时长
'pvp_battle_count', #与玩家对战次数
'pvp_lanch_count', #主动与玩家对战的次数
'pvp_win_count', #与玩家对战获胜的次数
'pve_battle_count', #与电脑对战次数
'pve_lanch_count', #主动发起与电脑对战的次数
'pve_win_count' #与电脑对战获胜次数
]
]
Data
3. Delete missing values and remove duplicates
#Delete missing values
print('The shape of the data set before removing the missing row is:', data.shape)
data_1 = data.dropna(axis=0,how='any')
print('The shape of the data set after removing the missing row For:', data_1.shape)
#If shape is used, the data1 array will become str string type
#data1 = data_1['user_id'].drop_duplicates()
data1 = data_1.drop_duplicates()
print('The total number of game IDs after using the drop_duplicates method to remove duplicates:', len(data1))
And name the processed data set as data1
4. Similarity matrix of three features
#Find out the number of battles with players, the number of times to initiate battles with players, the number of times to win battles with players, the pearson method similarity matrix of three features corr_data1 =
data[['pvp_battle_count','pvp_lanch_count','pvp_win_count']]. corr(method='pearson')
print('The number of battles with players, the number of times to initiate battles with players, the number of times to win battles with players:\n', corr_data1)
5. Cut out the required new feature and name it data2
6. Dispersion Normalization
#Dispersion standardization
#Custom deviation standardization function
def min_max_scale(data1):
data1 = (data1 - data1.min())/ (data1.max()- data1.min())
return data1
#On average online time Standardized deviation
time_min_max = min_max_scale(data1['avg_online_minutes'])
print('Online time data before normalized deviation is:\n',data1['avg_online_minutes'])
print('Online time data after normalized deviation is:\ n', time_min_max)
7. Inner join, outer join, save the data set after preprocessing
#Merge data 1 and 2
print('Outer join merged data frame size',
pd.concat([data1,data2],axis=1,join='outer').shape)
#Merge data 1 and 2
print('Inner join merged data frame size',
pd.concat([data1,data2],axis=1,join='inner').shape)
data3 = pd.merge(data1,data2,how=‘inner’,on=‘user_id’)
data3.to_csv(’./data/吴硕秋202006180058.csv’,sep=’;’,index=False)
2. Data exploration and feature construction
1. Analysis of player activity
(1) Calculate the average online duration of all players
avg_time = data3.avg_online_minutes.mean()
avg_time
(2) # Calculate the average online duration of paying players
pay_avg_time = data3[data3.pay_price > 0].avg_online_minutes.mean()
pay_avg_time
#Use the equal-width discrete method to record the distribution of recharge times
pay_cut = pd.cut(data2['pay_count'],40)
print('Discretized recharge times distribution:\n',pay_cut.value_counts())
(3) Draw the player average online time boxplot
Draw a boxplot of the average online time of all players
plt.figure(figsize=(10,10))
plt.boxplot(data3.avg_online_minutes)
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
plt.rcParams['axes.unicode_minus']= False
plt.title('Box plot of average online time of all players')
plt.show()
Draw a boxplot of the average online time of paying players
plt.figure(figsize=(10,10))
plt.boxplot(data3[data3.pay_price > 0].avg_online_minutes)
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
plt.rcParams[ 'axes.unicode_minus']=False
plt.title('Paid player average online time box plot')
plt.show()
Counts players who have played PvP games
pvp_avg_time =data3[data3.pvp_battle_count > 0].avg_online_minutes.mean()
Evaluation
The average online time of all players is 9.6 minutes, and the average online time of paying players is 135.8 minutes, which is about 11 times that of ordinary players. Paying players have higher activity.
2. Analysis of player payment rate
(1) Obtain the number of players whose payment times exceed 0
(2) Draw a pie chart
3. Player payment analysis and correlation exploration
(1) Define HY, total_pay, HY_AVG, HY_PAY_COUNT, PAY_AVG, and PAY_PRO
as the number of active players, total income, average income per active player, number of active paying players, average income per paying player, and payment rate
(2) The relationship between active players and recharge amount
The payment rate of this game is low, and there is still room for further improvement. Related activities can be carried out to increase the payment rate.
The per capita consumption of paying players in this game is 32, which shows that paying users have strong spending power as a whole. Follow-up analysis can be made on paying users to ensure their continuous payment;
4. Analysis of players' gaming habits
Average PVP times of active users
HY_pvp_battle_coun = data3[data3.avg_online_minutes> 10].pvp_battle_count.mean()
The total number of active pvp
HY_count_pvp = data3[data3.avg_online_minutes> 10].pvp_battle_count.sum()
The number of active pvp initiations
HY_count_lanch_pvp = data3[data3.avg_online_minutes> 10].pvp_lanch_count.sum()
Probability of active users actively initiating PVP
HY_rate_lanch_pvp = HY_count_lanch_pvp/HY_count_pvp
The total number of PVP victories of active users
HY_num_win_pvp = data3[data3.avg_online_minutes> 10].pvp_win_count.sum()
Active user PVP victory probability
HY_rate_win_pvp = HY_num_win_pvp/HY_count_pvp
print(f'Average PVP times of active users: {HY_pvp_battle_coun}')
print(f'Probability of active users to initiate PVP: {HY_rate_lanch_pvp}')
print(f'PVP victory probability of active users: {HY_rate_win_pvp}')
(2)
Average PVE times of active users
HY_pve_battle_coun = data3[data3.avg_online_minutes> 10].pve_battle_count.mean()
The total number of active PVE
HY_count_pve = data3[data3.avg_online_minutes> 10].pve_battle_count.sum()
The number of active user PVE launches
HY_count_lanch_pve = data3[data3.avg_online_minutes> 10].pve_lanch_count.sum()
Active users initiate PVE probability
HY_rate_lanch_pve = HY_count_lanch_pve/HY_count_pve
The total number of PVE victories of active users
HY_num_win_pve = data3[data3.avg_online_minutes>=15].pve_win_count.sum()
Active user PVE victory probability
HY_rate_win_pve = HY_num_win_pve/HY_count_pve
print(f'Average PVE times of active users: {HY_pve_battle_coun}')
print(f'Probability of active users actively launching PVE:{HY_rate_lanch_pve}')
print(f'PVE victory probability of active users:{HY_rate_win_pve}')
(3)
Average PVP times of active paying users
HY_PAY_COUNT_pvp_battle_coun = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_battle_count.mean()
The total number of active paid fee protection pvp
HY_PAY_COUNT_count_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_battle_count.sum()
The number of active paid fee protection pvp initiations
HY_PAY_COUNT_count_lanch_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_lanch_count.sum()
Probability of active paying users to initiate PVP
HY_PAY_COUNT_rate_lanc_pvp = HY_PAY_COUNT_count_lanch_pvp/HY_PAY_COUNT_count_pvp
The total number of PVP victories of active paying users
HY_PAY_COUNT_num_win_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_win_count.sum()
PVP victory probability for active paying users
HY_PAY_COUNT_rate_win_pvp = HY_PAY_COUNT_num_win_pvp/HY_PAY_COUNT_count_pvp
print(f'Average PVP times of active paying users: {HY_PAY_COUNT_pvp_battle_coun}')
print(f'Probability of active paying users to initiate PVP: {HY_PAY_COUNT_rate_lanc_pvp}')
print(f'PVP winning probability of active paying users: {HY_PAY_COUNT_rate_win_pvp}')
(4)
The average number of Pve times of active paying users
HY_PAY_COUNT_pve_battle_coun = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_battle_count.mean()
The total number of pve times paid for
HY_PAY_COUNT_count_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_battle_count.sum()
The number of initiations of paid pvp
HY_PAY_COUNT_count_lanch_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_lanch_count.sum()
Paid users actively initiate PV probability
HY_PAY_COUNT_rate_lanc_pve = HY_PAY_COUNT_count_lanch_pve/HY_PAY_COUNT_count_pve
The total number of PVe victories of paying users
HY_PAY_COUNT_num_win_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_win_count.sum()
Paying user PVe victory probability
HY_PAY_COUNT_rate_win_pve = HY_PAY_COUNT_num_win_pve/HY_PAY_COUNT_count_pve
print(f'Average PVE times of paid users: {HY_PAY_COUNT_pve_battle_coun}')
print(f'Probability of paid users to initiate PVE: {HY_PAY_COUNT_rate_lanc_pve}')
print(f'PVE victory probability of paid users: {HY_PAY_COUNT_rate_win_pve}')
visualization
Comment
1) The average number of PVE and PVP times of active paying players is higher than that of active players, and active paying players are more willing to spend time on this game; 2) In PVP
battles, the winning rate of active paying players is much higher than that of active players , indicating that our game props can allow APA to enjoy the fun of winning the battle;
3. The source code, annotations and operation results of the construction and evaluation of the classification model
1. First construct a feature heat matrix to understand the relationship between each feature.
This part needs to complete the regression model and classification model and compare them, so it is particularly important to understand the relationship between each feature. Before that, create a new feature feature, and define the player whose online time is less than half of the average online time of all players as a feature. And import data3 as the last feature.
2. Model construction Data set division
Select the new features created in the previous section for analysis.
This part divides the data label, trains the test set, and standardizes the data set, setting the data set to a state that the algorithm can call directly.
By the way, plot the confusion matrix.
Divide Data Labels
data3_data = data3.iloc[:, :-1]
data3_target = data3.iloc[:, -1]
#Divide training set and test set
from sklearn.model_selection import train_test_split
data3_data_train, data3_data_test, data3_target_train, data3_target_test = train_test_split(data 3_data, data3_target, test_size=0.2, random_state=66)
normalized data set
from sklearn.preprocessing import StandardScaler
stdScale = StandardScaler().fit(data3_data_train)
data3_trainScaler = stdScale.transform(data3_data_train)
data3_testScaler = stdScale.transform(data3_data_test)
#Confusion matrix
from sklearn.metrics import confusion_matrix
def test_pre(pred):
hx = confusion_matrix(data3_target_test, pred)
print('Confusion matrix:\n',hx)
#精确率
P = hx[1,1]/ (hx[0, 1] + hx[1,1])
print('精确率:\n',round(P, 3))
#召回率
R = hx[1,1]/ (hx[1, 0] + hx[1,1])
print('召回率:\n',round(P, 3))
#F1值
F1 = 2 * P * R /(P+R)
print('F1值:',round(F1, 3))
undersampling
2. SVM algorithm constructs classification model, and evaluates, ROC
Use the SVM algorithm to predict the data set and display the top 20 prediction results. The prediction yielded 2125 correct results and 69 wrong results with an accuracy rate of 96%
The F1 value of the evaluation part
is 0.98 and 0.94
Draw the ROC curve
3. Construct and evaluate using Gaussian Naive Bayes
It is consistent with the SVM approach, the only difference is the algorithm
Evaluation model,
compared with SVM, the accuracy rate of this model is slightly lower, which is 75%. According to the classification report results, the accuracy rate is 0.98 and 0.54, and the f1 value is 0.80 and 0.70. Compared with the svm model, there is a big gap. All figures are lower.
Draw the ROC curve
4. The source code, annotations and operation results of the construction and evaluation of the regression model
1. Divide the dataset
The feature of this experiment is that the "recharge times"
part divides data labels, training set and test set, and standardizes the data set, setting the data set to a state that the algorithm can directly call.
2. Experimental random forest regression model construction
Using the Regression Forest Tree Algorithm
Draw a visualization of regression results
from matplotlib import rcParams
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
fig = plt.figure(figsize=(12, 6))
plt.plot(range(data3_target_test. shape[0]), list(data3_target_test), color='blue'
)
plt.plot(range(data3_target_test.shape[0]), y_pred, color='red', linewidth=2.5 ,linestyle='-.'
)
plt.xlabel('result value')
plt.ylabel('online rate')
plt.legend(['true result','predicted result'])
plt.show()
Print and view the regression report
Import the report, view the forest model to view various data.
3. Using Support Vector Regression Model Construction
model building
Results visualization
Print and view regression report
Evaluation and comparison of the two models
Compared with the vector machine model, the random forest method model has a higher R square value of 0.84, and the explained variance is also 0.84.
5. Comparison and explanation of analysis results of various models
Evaluate and compare two regression models
Compared with the vector machine model, the random forest method model has a higher R square value of 0.84, and the explained variance is also 0.84.
Comparative evaluation of classification models
Evaluation model, compared with SVM, the accuracy rate of this model is slightly lower, which is 75%. According to the classification report results, the accuracy rate is 0.98 and 0.54, and the f1 value is 0.80 and 0.70. Compared with the svm model, there is a big gap. All figures are lower.
6. Application of data visualization technology
1. The source code and operation results of the first data visualization technology, and a brief description
The drawing of the box plot, the box plot can clearly display the five statistics of the data, the minimum value, the lower quartile, the median, the upper quartile and the maximum value, and it is concise and easy to understand. Since this visualization is an analysis of the average online time, a comprehensive comparison of multiple indicators is required, so it is very suitable for drawing a boxplot.
Draw a boxplot of the average online time of paying players
plt.figure(figsize=(10,10))
plt.boxplot(data3[data3.pay_price > 0].avg_online_minutes)
plt.rcParams['font.sans-serif']=['Microsoft YaHei']
plt.rcParams[ 'axes.unicode_minus']=False
plt.title('Paid player average online time box plot')
plt.show()
2. The source code and operation results of the second data visualization technology, and a brief description
The advantage of the pie chart is that it emphasizes the proportion and is easy to understand. This time, we choose to compare the gap between the number of paying players and non-paying players. For such a small data set, the pie chart can accurately see the proportion and gap between the two.
Make a percentage pie chart
plt.figure(figsize=(8,8))
drawing
patches,l_text,p_text = plt.pie([22877 - pay_num , pay_num],
labels=['unpaid','paid'],
labeldistance = 0.3,
colors=['#87CEFA','#FFC0CB'],
explode =[0.01,0.05],
autopct='%1.1f%%',
pctdistance=1.15)
set label size
for t in l_text:
t.set_size(20)
set percent font size
for t in p_text:
t.set_size(20)
set title
plt.title('The ratio of paid users to all users', size=25)
plt.show()
3. The source code and operation results of the third data visualization technology, and a brief description
The advantage of the histogram is that the numerical difference between different categories can be represented by height, and the simplicity and efficiency of a single drawing make it easier to compare. Therefore, it is easy to make different pairwise comparisons of each data in one graph to show the gap.
plt.figure(figsize=(15,8))
AU players
plt.bar([0.75,2.75,4.75,6.75],[HY_rate_lanch_pve, HY_rate_win_pve, HY_rate_lanch_pvp, HY_rate_win_pvp],width=0.5,alpha=0.5,label='active HY player') plt.bar([
1.25,3.25,5.25 ,7.25],[ HY_PAY_COUNT_rate_lanc_pve,HY_PAY_COUNT_rate_win_pve , HY_PAY_COUNT_rate_lanc_pvp, HY_PAY_COUNT_rate_lanc_pvp],width=0.5,color='r',alpha=0.5,label='Paid active players') plt.xtick s([1,3,5,7]
, ['Probability of actively initiating PVE','PVE winning probability','Probability of actively initiating PVP','PVP winning probability']) plt.legend() plt.show
(
)
7. Course conclusion and experience
对于本次课程报告,将之前上课所学过的知识全部总结,融会贯通了起来。最主要包括数据预处理,探索性分析,回归模型和分类模型的创建都有了新的理解。
For data preprocessing, data preprocessing is an important step in any data analysis, machine learning or deep learning project, the purpose of which is to ensure no missing data, standardized, clear and consistent data sets so that various analyzes can be better performed , model fitting or forecasting. Data cleaning is the first step in data preprocessing, including removing duplicate values, solving missing values, and outlier processing. Then feature processing is another important part of data preprocessing. Feature processing includes selecting, extracting, and transforming data features. After the first step of cleaning, it is necessary to sort out the data and extract fine features. The success of data preprocessing is important before large-scale data analysis and machine learning phases, as it helps to improve accuracy and maximize the clarity of samples from noise and determinism that drives final results.
For exploratory analysis, prior to the data preprocessing and modeling stages, drill down to all the details of the dataset and obtain as much information and potential stakes as possible in order to efficiently plan the next algorithm and modeling tasks and reduce risk.
Regression models are a common supervised learning task for predicting numerical variables, which not only can improve the accuracy and precision of data, but also play an important role in many practical problems. Understanding different types of regression models and evaluation criteria, and mastering optimization strategy skills will lead to better selection, construction, and validation of models and better predictive results.
Classification model is a common supervised learning task, which not only can improve the classification accuracy and accuracy of data, but also plays an important role in many practical problems. Understand the classification types and evaluation indicators of classification models, and master optimization techniques and methods.
For the harvest of large jobs, the most important thing is the details. For example, for data deduplication during preprocessing, drop must be used instead of list or others, because only drop will not change the structure of the array. The second is about the classification part. Before the prediction, a heat map must be drawn to check the correlation of each feature for intuitive understanding.