2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Round A: Smartphone user monitoring data analysis problem Binary classification and regression problem Python code analysis
Related Links
1 topic
2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Topic Preliminary A: Smartphone User Monitoring Data Analysis
1. Problem background
In recent years, with the emergence of smart phones, their popularity has grown explosively, which not only promotes the development and expansion of China's smart phone market, but also rapidly promotes the development of mobile phone software. In recent years, brand competition in China's smartphone market has further intensified, and China has surpassed the United States to become the world's largest smartphone market. Mobile phone software is changing with each passing day, making people use mobile phones more comfortably, bringing a lot of fun to people's lives, and also creating a new group of "head bowing people". Mobile software has entered people's lives. Apps such as games, shopping, social networking, information, and financial management have attracted and facilitated people in modern society, making mobile phones a must-have item for people to go out. The data comes from the monitoring data of more than 40,000 smartphone users of a company for 30 consecutive days in a certain year, and has been desensitized and data transformed. The daily data is a txt file with 10 columns in total, which records the starting time, duration of use, and up and down traffic of each user (identified by uid) every day using each APP (identified by appid). See Table 1 for details. Additionally, there is an auxiliary table, app_class.csv, with two columns. The first column is appid, giving more than 4,000 commonly used APP categories (app_class), such as: social, film and television, education, etc., represented by the English letter at, a total of 20 commonly used categories, the rest of the APP is not commonly used, The category is unknown.
Table 1
variable number | variable name | paraphrase |
---|---|---|
1 | uid | user id |
2 | appid | The id of the APP (corresponding to the first column in the app_class file) |
3 | app_type | APP type: built-in system, installed by users |
4 | start_day | The starting day of use, the value is 1-30 (Note: the starting day of use of the first two rows of the data on the first day is 0, indicating that it is used on the day before this day) |
5 | start_time | Use start time |
6 | end_day | end of use day |
7 | end_time | Use end time |
8 | duration | Duration of use (seconds) |
9 | up_flow | upstream traffic |
10 | down_flow | Downstream traffic |
Two, solve the problem
- Cluster analysis
(1) Cluster users according to the data of 20 categories of APPs that users often belong to. It is required to give at least three different clustering algorithms for comparison, select a reasonable clustering number K value, and analyze the clustering results.
(2) Profile different categories of users according to the clustering results, and analyze the characteristics of different groups of users. (Definition of user portrait: label users according to their attributes, preferences, behavior habits and other information to describe the behavior of users of different groups, so as to recommend different categories of APP products for users of different groups.)
- Predictive analysis of APP usage: the problem to be studied is to predict whether the user will use the APP in the future (classification problem) and the duration of use (regression problem) based on the user's APP usage records
(1) Predict the user's use of APP, and predict whether the user will use this type of APP on the 12th to 21st day based on the user's use of the category a APP on the 11th day. Gives the accuracy rate of the predicted result compared with the real result. (Note: The test set cannot participate in training and verification, otherwise it will be treated as a violation)
(2) Predict the user's use of the APP, and predict the effective daily average usage time of the user's use of the category a APP from the 12th to the 21st day based on the user's usage of the category a APP from the 1st to the 11th day. The evaluation index is MMSE.
MMSE = ∑ ( yi − yi ^ ) ∑ ( yi − yi ‾ ) MMSE = \sqrt{\frac{\sum(y_i-\hat{y_i})}{\sum(y_i-\overline{y_i})}}MMSE=∑(yi−yi)∑(yi−yi^)
where yi y_iyiIndicates the actual value of the duration of use; yi ^ \hat{y_i}yi^Indicates the predicted value of the duration of use; yi ‾ \overline{y_i}yiIndicates the average of the actual usage time of all users. Gives the NMSE between predicted and true results. (Note: The test set cannot participate in training and verification, otherwise it will be treated as a violation)
The data.csv data comes from the monitoring data of more than 40,000 smart phone users of a company for 30 consecutive days in a certain year. The table format is as follows. Please use the XGBoost model to predict the 12th to 11th day according to the usage of the user's category a APP from the 1st to the 11th day. The valid average daily usage time of a category a app by a user for 21 days. The evaluation index is MMSE.
variable number | variable name | paraphrase |
---|---|---|
1 | uid | user id |
2 | category | Category of APP (category a to z, 26 categories in total) |
3 | app_type | APP type: built-in system, installed by users |
4 | start_day | The starting day of use, the value is 1-30 (Note: the starting day of use of the first two rows of the data on the first day is 0, indicating that it is used on the day before this day) |
5 | start_time | Use start time |
6 | end_day | end of use day |
7 | end_time | Use end time |
8 | duration | Duration of use (seconds) |
9 | up_flow | upstream traffic |
10 | down_flow | Downstream traffic |
2 Modeling ideas
first question:
-
Data preprocessing: Perform data cleaning and feature extraction on 20 types of APP data commonly used by users. PCA and LDA algorithms can be used for dimensionality reduction to reduce computational complexity.
-
Clustering algorithm:
a. K-means: When performing data clustering, select different K values for multiple experiments and select the optimal clustering result. Evaluation indicators such as silhouette coefficient and Calinski-Harabaz index can be used for comparison and selection.
b. DBSCAN: Use density to cluster data points without specifying the number of clusters in advance. When using a density-based clustering algorithm, different clustering effects can be obtained by adjusting the radius parameter and density parameter.
c. Hierarchical clustering: can be divided into top-down and bottom-up methods. By iteratively calculating the similarity between each data point, the data points are gradually merged, and finally the clustering result is obtained.d. Improved clustering algorithm
e. Deep clustering algorithm
-
Clustering result analysis: After selecting the optimal clustering result, profile users of different categories. Analyze the user behavior characteristics of each category (such as time of use, frequency of use, duration of use, preference, etc.), and label users based on user portraits. According to user tags, recommend APP products of different categories.
Second question:
- Data preprocessing: perform data cleaning and feature extraction on user APP usage record data, such as counting the number of times and duration of use of each APP by the user.
- Classification problem prediction: establish a classification model, use the user's APP usage records for 1 to 11 days, use feature engineering to process the data, and select an appropriate classification algorithm for training and testing, such as decision trees, random forests, support vector machines, improved machine learning classification algorithm. Finally, the test set is used to verify the model and evaluate the accuracy of the model.
- Regression problem prediction: Establish a regression model, use the user's APP usage records for 1 to 11 days, use feature engineering to process the data, and select an appropriate regression algorithm for training and testing, such as linear regression, decision tree regression, and neural network regression. Use the test set to verify the model and evaluate the accuracy of the model, you can use the NMSE evaluation index.
3 Problem 1 Implementation code
4 Question 2 Implementation Code
4.1 Classification problem: predict whether to use category a APP
(1) For the feature engineering part, see Question 1 blog
(2) Data reading
The data of 1-11 days is used as the training set, and the data of 12-21 days is used as the test set
Create a classification. Note that a user has multiple usage records. In the test set, it needs to be deduplicated according to the user id before predicting
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
# 读取全部数据集
# 读取训练集
train_folder_path = '初赛数据集/训练集'
train_dfs = []
for filename in os.listdir(train_folder_path):
if filename.endswith('.txt'):
csv_path = os.path.join(train_folder_path, filename)
tempdf = pd.read_csv(csv_path)
train_dfs.append(tempdf)
train_df = pd.concat(train_dfs,axis=0)
# 读取测试集
test_folder_path = '初赛数据集/测试集'
test_dfs = []
for filename in os.listdir(test_folder_path):
if filename.endswith('.txt'):
csv_path = os.path.join(test_folder_path, filename)
tempdf = pd.read_csv(csv_path)
test_dfs.append(tempdf)
test_df = pd.concat(test_dfs,axis=0)
# 提取特征和标签
X_train = train_df.drop(['category','uid','appid'], axis=1)
y_train = train_df['category']
X_test = test_df.drop(['category','uid','appid'], axis=1)
y_test = test_df['category']
(3) Model training
# 训练决策树模型
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_y_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_y_pred)
print('决策树模型的准确率:', dt_accuracy)
Accuracy of decision tree model: 0.8853211009174312
# 训练随机森林模型
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print('随机森林模型的准确率:', rf_accuracy)
Accuracy of random forest model: 0.9724770642201835
# 训练支持向量机模型
svc_model = SVC(kernel='linear')
svc_model.fit(X_train, y_train)
svc_y_pred = svc_model.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_y_pred)
print('支持向量机模型的准确率:', svc_accuracy)
Accuracy of the SVM model: 0.9513251783893986
4.2 Regression problem
(1) Feature extraction part
- The daily time features of the past 11 days, day, hour, minute, time features such as holidays, weekends, and working days can also be added
- The total usage time of users in category a apps in the past 11 days
- The number of times users have used apps of category a in the past 11 days
- Curve trend of users for category a apps in the past 11 days
- Other characteristics related to category a apps
read data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import holidays
import os
# 加载 APP 类别文件
app_class = pd.read_csv('初赛数据集/app_class.csv',names=['appid','app_type'])
app_type_dict = dict(zip(app_class['appid'], app_class['app_type']))
# 读取训练集,该文件夹下面包括day01.txt到day21.txt,总共21个文件
train_folder_path = '初赛数据集/问题2数据集'
train_dfs = []
cols = ['uid','appid','app_type','start_day','start_time','end_day','end_time','duration','up_flow','down_flow']
for filename in os.listdir(train_folder_path):
if filename.endswith('.txt'):
csv_path = os.path.join(train_folder_path, filename)
tempdf = pd.read_csv(csv_path,names=cols)
train_dfs.append(tempdf)
data = pd.concat(train_dfs,axis=0)
data.shape
data preprocessing
# 处理app类别
data['category'] = data['appid'].map(app_type_dict)
# 处理时间格式
data['start_time'] = pd.to_datetime(data['start_time'])
data['end_time'] = pd.to_datetime(data['end_time'])
# 构建"使用时长(小时)"特征
data['duration_hour'] = (data['end_time'] - data['start_time']).dt.seconds / 3600
# 缺失值处理
data = data.dropna()
# 提取时间特征
data['start_time_day'] = data.start_time.dt.day
data['start_time_hour'] = data.start_time.dt.hour
data['start_time_minute'] = data.start_time.dt.minute
# 异常值处理(例如使用时长小于0或大于24小时的数据)
data = data[(data['duration_hour'] >= 0) & (data['duration_hour'] <= 24)]
# 构建训练集和测试集
train = data[data['start_day'] <= 11]
test = data[(data['start_day'] >= 12) & (data['start_day'] <= 21)]
feature engineering
# 提取过去11天用户对a类APP的总使用时长
。。。略
# 提取过去11天用户对a类APP的使用次数
。。。略
# 将特征合并到训练集和测试集中
train = pd.merge(train, total_duration, on='uid', how='left')
train = pd.merge(train, count, on='uid', how='left')
test = pd.merge(test, total_duration, on='uid', how='left')
test = pd.merge(test, count, on='uid', how='left')
# 缺失值处理
train = train.fillna(0)
test = test.fillna(0)
# 选择必要的特征
features = ['a_total_duration', 'a_count','start_time_day','start_time_hour','start_time_minute']
# 构建训练集和测试集的特征矩阵和目标变量
。。。略
X_test = test[features].values
mean_test_duration = test.groupby('uid')['duration_hour'].mean()
y_test = test['uid'].map(dict(mean_test_duration))
(2) Model training part
Using the XGB regression model, you can also use LGB, linear regression, decision tree regression, neural network regression and other models. In addition, you need to adjust parameters. The machine learning method can consider the method of grid optimization.
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
# 特征归一化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
# XGBoost回归模型,还可以使用线性回归、决策树回归、神经网络回归
xgbmodel = xgb.XGBRegressor(
objective='reg:squarederror',
n_jobs=-1,
n_estimators=1000,
max_depth=7,
subsample=0.8,
learning_rate=0.05,
gamma=0,
colsample_bytree=0.9,
random_state=2021, max_features=None, alpha=0.3)
# 训练模型
xgbmodel.fit(X_train, y_train)
(3) Model evaluation
Realized according to the formula given in the title, the formula is as follows
MMSE = ∑ ( yi − yi ^ ) ∑ ( yi − yi ‾ ) MMSE = \sqrt{\frac{\sum(y_i-\hat{y_i})} {\sum(y_i-\overline{y_i})}}MMSE=∑(yi−yi)∑(yi−yi^)
from sklearn.metrics import mean_squared_error, mean_absolute_error
def MMSE(y_test, y_pred):
# 计算实际值与预测值之间的平均误差
error = y_test - y_pred
# 计算分子和分母
numerator = np.sum(np.square(error))
denominator = np.sum(np.square(y_test - np.mean(y_test)))
# 计算 MMSE
mmse = np.sqrt(numerator / denominator)
return mmse
# 对测试集进行预测
y_pred = xgbmodel.predict(X_test)
# 计算评价指标
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mmse = MMSE(y_test, y_pred)
print("MMSE: {:.4f},MSE: {:.4f}, MAE: {:.4f}".format(mmse,mse, mae))
MMSE: 1.0709,MSE: 0.0432, MAE: 0.1181
4 downloads
See the bottom of the Zhihu article, after downloading, the complete code of all questions is included
zhuanlan.zhihu.com/p/643785015