2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Round A: Smartphone user monitoring data analysis problem Binary classification and regression problem Python code analysis

2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Round A: Smartphone user monitoring data analysis problem Binary classification and regression problem Python code analysis

insert image description here

Related Links

[2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round] Preliminary round A: Smartphone user monitoring data analysis problem-Python code analysis

[2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round] Preliminary round A: Python code analysis of smart phone user monitoring data analysis problem binary classification and regression problem

1 topic

2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Topic Preliminary A: Smartphone User Monitoring Data Analysis

1. Problem background

In recent years, with the emergence of smart phones, their popularity has grown explosively, which not only promotes the development and expansion of China's smart phone market, but also rapidly promotes the development of mobile phone software. In recent years, brand competition in China's smartphone market has further intensified, and China has surpassed the United States to become the world's largest smartphone market. Mobile phone software is changing with each passing day, making people use mobile phones more comfortably, bringing a lot of fun to people's lives, and also creating a new group of "head bowing people". Mobile software has entered people's lives. Apps such as games, shopping, social networking, information, and financial management have attracted and facilitated people in modern society, making mobile phones a must-have item for people to go out. The data comes from the monitoring data of more than 40,000 smartphone users of a company for 30 consecutive days in a certain year, and has been desensitized and data transformed. The daily data is a txt file with 10 columns in total, which records the starting time, duration of use, and up and down traffic of each user (identified by uid) every day using each APP (identified by appid). See Table 1 for details. Additionally, there is an auxiliary table, app_class.csv, with two columns. The first column is appid, giving more than 4,000 commonly used APP categories (app_class), such as: social, film and television, education, etc., represented by the English letter at, a total of 20 commonly used categories, the rest of the APP is not commonly used, The category is unknown.

Table 1

variable number variable name paraphrase
1 uid user id
2 appid The id of the APP (corresponding to the first column in the app_class file)
3 app_type APP type: built-in system, installed by users
4 start_day The starting day of use, the value is 1-30 (Note: the starting day of use of the first two rows of the data on the first day is 0, indicating that it is used on the day before this day)
5 start_time Use start time
6 end_day end of use day
7 end_time Use end time
8 duration Duration of use (seconds)
9 up_flow upstream traffic
10 down_flow Downstream traffic

Two, solve the problem

  1. Cluster analysis

(1) Cluster users according to the data of 20 categories of APPs that users often belong to. It is required to give at least three different clustering algorithms for comparison, select a reasonable clustering number K value, and analyze the clustering results.

(2) Profile different categories of users according to the clustering results, and analyze the characteristics of different groups of users. (Definition of user portrait: label users according to their attributes, preferences, behavior habits and other information to describe the behavior of users of different groups, so as to recommend different categories of APP products for users of different groups.)

  1. Predictive analysis of APP usage: the problem to be studied is to predict whether the user will use the APP in the future (classification problem) and the duration of use (regression problem) based on the user's APP usage records

(1) Predict the user's use of APP, and predict whether the user will use this type of APP on the 12th to 21st day based on the user's use of the category a APP on the 11th day. Gives the accuracy rate of the predicted result compared with the real result. (Note: The test set cannot participate in training and verification, otherwise it will be treated as a violation)

(2) Predict the user's use of the APP, and predict the effective daily average usage time of the user's use of the category a APP from the 12th to the 21st day based on the user's usage of the category a APP from the 1st to the 11th day. The evaluation index is MMSE.
MMSE = ∑ ( yi − yi ^ ) ∑ ( yi − yi ‾ ) MMSE = \sqrt{\frac{\sum(y_i-\hat{y_i})}{\sum(y_i-\overline{y_i})}}MMSE=(yiyi)(yiyi^)

where yi y_iyiIndicates the actual value of the duration of use; yi ^ \hat{y_i}yi^Indicates the predicted value of the duration of use; yi ‾ \overline{y_i}yiIndicates the average of the actual usage time of all users. Gives the NMSE between predicted and true results. (Note: The test set cannot participate in training and verification, otherwise it will be treated as a violation)

The data.csv data comes from the monitoring data of more than 40,000 smart phone users of a company for 30 consecutive days in a certain year. The table format is as follows. Please use the XGBoost model to predict the 12th to 11th day according to the usage of the user's category a APP from the 1st to the 11th day. The valid average daily usage time of a category a app by a user for 21 days. The evaluation index is MMSE.

variable number variable name paraphrase
1 uid user id
2 category Category of APP (category a to z, 26 categories in total)
3 app_type APP type: built-in system, installed by users
4 start_day The starting day of use, the value is 1-30 (Note: the starting day of use of the first two rows of the data on the first day is 0, indicating that it is used on the day before this day)
5 start_time Use start time
6 end_day end of use day
7 end_time Use end time
8 duration Duration of use (seconds)
9 up_flow upstream traffic
10 down_flow Downstream traffic

2 Modeling ideas

first question:

  1. Data preprocessing: Perform data cleaning and feature extraction on 20 types of APP data commonly used by users. PCA and LDA algorithms can be used for dimensionality reduction to reduce computational complexity.

  2. Clustering algorithm:
    a. K-means: When performing data clustering, select different K values ​​for multiple experiments and select the optimal clustering result. Evaluation indicators such as silhouette coefficient and Calinski-Harabaz index can be used for comparison and selection.
    b. DBSCAN: Use density to cluster data points without specifying the number of clusters in advance. When using a density-based clustering algorithm, different clustering effects can be obtained by adjusting the radius parameter and density parameter.
    c. Hierarchical clustering: can be divided into top-down and bottom-up methods. By iteratively calculating the similarity between each data point, the data points are gradually merged, and finally the clustering result is obtained.

    d. Improved clustering algorithm

    e. Deep clustering algorithm

  3. Clustering result analysis: After selecting the optimal clustering result, profile users of different categories. Analyze the user behavior characteristics of each category (such as time of use, frequency of use, duration of use, preference, etc.), and label users based on user portraits. According to user tags, recommend APP products of different categories.

Second question:

  1. Data preprocessing: perform data cleaning and feature extraction on user APP usage record data, such as counting the number of times and duration of use of each APP by the user.
  2. Classification problem prediction: establish a classification model, use the user's APP usage records for 1 to 11 days, use feature engineering to process the data, and select an appropriate classification algorithm for training and testing, such as decision trees, random forests, support vector machines, improved machine learning classification algorithm. Finally, the test set is used to verify the model and evaluate the accuracy of the model.
  3. Regression problem prediction: Establish a regression model, use the user's APP usage records for 1 to 11 days, use feature engineering to process the data, and select an appropriate regression algorithm for training and testing, such as linear regression, decision tree regression, and neural network regression. Use the test set to verify the model and evaluate the accuracy of the model, you can use the NMSE evaluation index.

3 Problem 1 Implementation code

[2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round] Preliminary round A: Smartphone user monitoring data analysis problem-Python code analysis

4 Question 2 Implementation Code

4.1 Classification problem: predict whether to use category a APP

(1) For the feature engineering part, see Question 1 blog

[2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round] Preliminary round A: Smartphone user monitoring data analysis problem-Python code analysis

(2) Data reading

The data of 1-11 days is used as the training set, and the data of 12-21 days is used as the test set

Create a classification. Note that a user has multiple usage records. In the test set, it needs to be deduplicated according to the user id before predicting

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

# 读取全部数据集
# 读取训练集
train_folder_path = '初赛数据集/训练集'
train_dfs = []
for filename in os.listdir(train_folder_path):
    if filename.endswith('.txt'):
        csv_path = os.path.join(train_folder_path, filename)
        tempdf = pd.read_csv(csv_path)
        train_dfs.append(tempdf)
train_df = pd.concat(train_dfs,axis=0)
# 读取测试集
test_folder_path = '初赛数据集/测试集'
test_dfs = []
for filename in os.listdir(test_folder_path):
    if filename.endswith('.txt'):
        csv_path = os.path.join(test_folder_path, filename)
        tempdf = pd.read_csv(csv_path)
        test_dfs.append(tempdf)
test_df = pd.concat(test_dfs,axis=0)
# 提取特征和标签
X_train = train_df.drop(['category','uid','appid'], axis=1)
y_train = train_df['category']

X_test = test_df.drop(['category','uid','appid'], axis=1)
y_test = test_df['category']

(3) Model training

# 训练决策树模型
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_y_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_y_pred)
print('决策树模型的准确率:', dt_accuracy)

Accuracy of decision tree model: 0.8853211009174312


# 训练随机森林模型
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print('随机森林模型的准确率:', rf_accuracy)

Accuracy of random forest model: 0.9724770642201835

# 训练支持向量机模型
svc_model = SVC(kernel='linear')
svc_model.fit(X_train, y_train)
svc_y_pred = svc_model.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_y_pred)

print('支持向量机模型的准确率:', svc_accuracy)

Accuracy of the SVM model: 0.9513251783893986

4.2 Regression problem

(1) Feature extraction part

  • The daily time features of the past 11 days, day, hour, minute, time features such as holidays, weekends, and working days can also be added
  • The total usage time of users in category a apps in the past 11 days
  • The number of times users have used apps of category a in the past 11 days
  • Curve trend of users for category a apps in the past 11 days
  • Other characteristics related to category a apps

read data


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import holidays
import os

# 加载 APP 类别文件
app_class = pd.read_csv('初赛数据集/app_class.csv',names=['appid','app_type'])
app_type_dict = dict(zip(app_class['appid'], app_class['app_type']))

# 读取训练集,该文件夹下面包括day01.txt到day21.txt,总共21个文件
train_folder_path = '初赛数据集/问题2数据集'
train_dfs = []
cols = ['uid','appid','app_type','start_day','start_time','end_day','end_time','duration','up_flow','down_flow']
for filename in os.listdir(train_folder_path):
    if filename.endswith('.txt'):
        csv_path = os.path.join(train_folder_path, filename)
        tempdf = pd.read_csv(csv_path,names=cols)
        train_dfs.append(tempdf)
data = pd.concat(train_dfs,axis=0)
data.shape

data preprocessing

# 处理app类别
data['category'] = data['appid'].map(app_type_dict)
# 处理时间格式
data['start_time'] = pd.to_datetime(data['start_time'])
data['end_time'] = pd.to_datetime(data['end_time'])

# 构建"使用时长(小时)"特征
data['duration_hour'] = (data['end_time'] - data['start_time']).dt.seconds / 3600
# 缺失值处理
data = data.dropna()

# 提取时间特征
data['start_time_day'] = data.start_time.dt.day
data['start_time_hour'] = data.start_time.dt.hour
data['start_time_minute'] = data.start_time.dt.minute


# 异常值处理(例如使用时长小于0或大于24小时的数据)
data = data[(data['duration_hour'] >= 0) & (data['duration_hour'] <= 24)]

# 构建训练集和测试集
train = data[data['start_day'] <= 11]
test = data[(data['start_day'] >= 12) & (data['start_day'] <= 21)]

feature engineering

# 提取过去11天用户对a类APP的总使用时长
。。。略
# 提取过去11天用户对a类APP的使用次数
。。。略


# 将特征合并到训练集和测试集中
train = pd.merge(train, total_duration, on='uid', how='left')
train = pd.merge(train, count, on='uid', how='left')

test = pd.merge(test, total_duration, on='uid', how='left')
test = pd.merge(test, count, on='uid', how='left')


# 缺失值处理
train = train.fillna(0)
test = test.fillna(0)

# 选择必要的特征
features = ['a_total_duration', 'a_count','start_time_day','start_time_hour','start_time_minute']
# 构建训练集和测试集的特征矩阵和目标变量
。。。略


X_test = test[features].values
mean_test_duration = test.groupby('uid')['duration_hour'].mean()
y_test = test['uid'].map(dict(mean_test_duration))

(2) Model training part

Using the XGB regression model, you can also use LGB, linear regression, decision tree regression, neural network regression and other models. In addition, you need to adjust parameters. The machine learning method can consider the method of grid optimization.


import xgboost as xgb
from sklearn.model_selection import GridSearchCV


# 特征归一化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# XGBoost回归模型,还可以使用线性回归、决策树回归、神经网络回归
xgbmodel = xgb.XGBRegressor(
            objective='reg:squarederror',
            n_jobs=-1,
            n_estimators=1000,
            max_depth=7,
            subsample=0.8,
            learning_rate=0.05,
            gamma=0,
            colsample_bytree=0.9,
            random_state=2021, max_features=None, alpha=0.3)

# 训练模型
xgbmodel.fit(X_train, y_train)

insert image description here

(3) Model evaluation
Realized according to the formula given in the title, the formula is as follows
MMSE = ∑ ( yi − yi ^ ) ∑ ( yi − yi ‾ ) MMSE = \sqrt{\frac{\sum(y_i-\hat{y_i})} {\sum(y_i-\overline{y_i})}}MMSE=(yiyi)(yiyi^)


from sklearn.metrics import mean_squared_error, mean_absolute_error

def MMSE(y_test, y_pred):
    # 计算实际值与预测值之间的平均误差
    error = y_test - y_pred
    # 计算分子和分母
    numerator = np.sum(np.square(error))
    denominator = np.sum(np.square(y_test - np.mean(y_test)))
    # 计算 MMSE
    mmse = np.sqrt(numerator / denominator)
    return mmse
# 对测试集进行预测
y_pred = xgbmodel.predict(X_test)

# 计算评价指标
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mmse = MMSE(y_test, y_pred)
print("MMSE: {:.4f},MSE: {:.4f}, MAE: {:.4f}".format(mmse,mse, mae))

MMSE: 1.0709,MSE: 0.0432, MAE: 0.1181

4 downloads

See the bottom of the Zhihu article, after downloading, the complete code of all questions is included

zhuanlan.zhihu.com/p/643785015

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/131895788