[2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round] Preliminary round A: Smartphone user monitoring data analysis problem-Python code analysis

2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Round A: Smartphone user monitoring data analysis problem-Python code analysis

insert image description here

1 topic

2023 Second DingTalk Cup College Student Big Data Challenge Preliminary Topic Preliminary A: Smartphone User Monitoring Data Analysis

1. Problem background

In recent years, with the emergence of smart phones, their popularity has grown explosively, which not only promotes the development and expansion of China's smart phone market, but also rapidly promotes the development of mobile phone software. In recent years, brand competition in China's smartphone market has further intensified, and China has surpassed the United States to become the world's largest smartphone market. Mobile phone software is changing with each passing day, making people use mobile phones more comfortably, bringing a lot of fun to people's lives, and also creating a new group of "head bowing people". Mobile software has entered people's lives. Apps such as games, shopping, social networking, information, and financial management have attracted and facilitated people in modern society, making mobile phones a must-have item for people to go out. The data comes from the monitoring data of more than 40,000 smartphone users of a company for 30 consecutive days in a certain year, and has been desensitized and data transformed. The daily data is a txt file with 10 columns in total, which records the starting time, duration of use, and up and down traffic of each user (identified by uid) every day using each APP (identified by appid). See Table 1 for details. Additionally, there is an auxiliary table, app_class.csv, with two columns. The first column is appid, giving more than 4,000 commonly used APP categories (app_class), such as: social, film and television, education, etc., represented by the English letter at, a total of 20 commonly used categories, the rest of the APP is not commonly used, The category is unknown.

Table 1

variable number variable name paraphrase
1 uid user id
2 appid The id of the APP (corresponding to the first column in the app_class file)
3 app_type APP type: built-in system, installed by users
4 start_day The starting day of use, the value is 1-30 (Note: the starting day of use of the first two rows of the data on the first day is 0, indicating that it is used on the day before this day)
5 start_time Use start time
6 end_day end of use day
7 end_time Use end time
8 duration Duration of use (seconds)
9 up_flow upstream traffic
10 down_flow Downstream traffic

Two, solve the problem

  1. Cluster analysis

(1) Cluster users according to the data of 20 categories of APPs that users often belong to. It is required to give at least three different clustering algorithms for comparison, select a reasonable clustering number K value, and analyze the clustering results.

(2) Profile different categories of users according to the clustering results, and analyze the characteristics of different groups of users. (Definition of user portrait: label users according to their attributes, preferences, behavior habits and other information to describe the behavior of users of different groups, so as to recommend different categories of APP products for users of different groups.)

  1. Predictive analysis of APP usage: the problem to be studied is to predict whether the user will use the APP in the future (classification problem) and the duration of use (regression problem) based on the user's APP usage records

(1) Predict the user's use of APP, and predict whether the user will use this type of APP on the 12th to 21st day based on the user's usage of the category a APP on the 1st to 11th day. Gives the accuracy rate of the predicted result compared with the real result. (Note: The test set cannot participate in training and verification, otherwise it will be treated as a violation)

(2) Predict the user's use of the APP, and predict the effective daily average usage time of the user's use of the category a APP from the 12th to the 21st day based on the user's usage of the category a APP from the 1st to the 11th day. The evaluation index chooses NMSE.
MMSE = ∑ ( yi − yi ^ ) ∑ ( yi − yi ‾ ) MMSE = \sqrt{\frac{\sum(y_i-\hat{y_i})}{\sum(y_i-\overline{y_i})}}MMSE=(yiyi)(yiyi^)

where yi y_iyiIndicates the actual value of the duration of use; yi ^ \hat{y_i}yi^Indicates the predicted value of the duration of use; yi ‾ \overline{y_i}yiIndicates the average of the actual usage time of all users. Gives the NMSE between predicted and true results. (Note: The test set cannot participate in training and verification, otherwise it will be treated as a violation)

2 Modeling ideas

first question:

  1. Data preprocessing: Perform data cleaning and feature extraction on 20 types of APP data commonly used by users. PCA and LDA algorithms can be used for dimensionality reduction to reduce computational complexity.

  2. Clustering algorithm:
    a. K-means: When performing data clustering, select different K values ​​for multiple experiments and select the optimal clustering result. Evaluation indicators such as silhouette coefficient and Calinski-Harabaz index can be used for comparison and selection.
    b. DBSCAN: Use density to cluster data points without specifying the number of clusters in advance. When using a density-based clustering algorithm, different clustering effects can be obtained by adjusting the radius parameter and density parameter.
    c. Hierarchical clustering: can be divided into top-down and bottom-up methods. By iteratively calculating the similarity between each data point, the data points are gradually merged, and finally the clustering result is obtained.

    d. Improved clustering algorithm

    e. Deep clustering algorithm

  3. Clustering result analysis: After selecting the optimal clustering result, profile users of different categories. Analyze the user behavior characteristics of each category (such as time of use, frequency of use, duration of use, preference, etc.), and label users based on user portraits. According to user tags, recommend APP products of different categories.

Second question:

  1. Data preprocessing: perform data cleaning and feature extraction on user APP usage record data, such as counting the number of times and duration of use of each APP by the user.
  2. Classification problem prediction: establish a classification model, use the user's APP usage records for 1 to 11 days, use feature engineering to process the data, and select an appropriate classification algorithm for training and testing, such as decision trees, random forests, support vector machines, improved machine learning classification algorithm. Finally, the test set is used to verify the model and evaluate the accuracy of the model.
  3. Regression problem prediction: Establish a regression model, use the user's APP usage records for 1 to 11 days, use feature engineering to process the data, and select an appropriate regression algorithm for training and testing, such as linear regression, decision tree regression, and neural network regression. Use the test set to verify the model and evaluate the accuracy of the model, you can use the NMSE evaluation index.

3 Problem 1 Implementation code

3.1 Data cleaning

import package

import pandas as pd
from sklearn.cluster import Birch
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
import time
from sklearn import metrics
import os
from sklearn.cluster import MeanShift
from tqdm import tqdm
import numpy as np
import warnings
warnings.filterwarnings("ignore")
tqdm.pandas()

Merge data

# 合并数据
folder_path = '初赛数据集/'
dfs = []
for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        csv_path = os.path.join(folder_path, filename)
        tempdf = pd.read_csv(csv_path)
        dfs.append(tempdf)
df = pd.concat(dfs,axis=0)
df.shape

Data cleaning:

  1. For the row whose start_day is 0, change its start_day to 1, indicating that it will be used on the first day.
  2. For time-related features (start_time, end_day, end_time), convert them to the datetime type, and calculate the specific time and date of each use, as well as the duration of use (minutes), upstream traffic (MB), and downstream traffic (MB) .
  3. Delete the rows whose duration, up_flow, and down_flow are 0, because it means that the user only opened the APP for a while, and did not really use it.
  4. According to the distribution graph of the usage time, remove the rows with obviously abnormal usage duration and traffic, such as rows with too short usage time (less than 10 seconds) and too large/too small traffic.
import pandas as pd
import datetime
import matplotlib.pyplot as plt

# 数据清洗
df.loc[df['start_day'] == 0, 'start_day'] = 1  # 将使用起始天为0的行,修改为1
df['start_time'] = pd.to_datetime(df['start_time'])  # 转换为datetime类型
df['end_time'] = pd.to_datetime(df['end_time'])  # 转换为datetime类型
df['usage_time'] = (df['end_time'] - df['start_time']) / pd.Timedelta(minutes=1)  # 使用时长(分钟)
df['up_flow_mb'] = df['up_flow'] / 1024 / 1024  # 上行流量(MB)
df['down_flow_mb'] = df['down_flow'] / 1024 / 1024  # 下行流量(MB)
df = df[df['duration'] != 0]  # 剔除使用时长为0的行
df = df[df['up_flow'] != 0]  # 剔除上行流量为0的行
df = df[df['down_flow'] != 0]  # 剔除下行流量为0的行


# 剔除使用时长和流量明显异常的行
# 剔除使用时长小于10秒的行
df = df[df['usage_time'] >= 10]
fig, axs = plt.subplots(1, 3, figsize=(10, 5))
axs[0].hist(df['usage_time'])
axs[0].set_title('Usage Time')
axs[0].set_xlabel('Time (minutes)')
axs[1].hist(df['up_flow_mb'])
axs[1].set_title('Up Flow')
axs[1].set_xlabel('Up Flow (MB)')
axs[2].hist(df['down_flow_mb'])
axs[2].set_title('Down Flow')
axs[2].set_xlabel('Down Flow (MB)')
plt.show()
df

insert image description here

3.2 Feature Engineering

  1. Through the analysis of the APP, the classification information of the APP is extracted, such as games, social, life, etc.
  2. Statistics of the number of APPs used by each user, total usage time, total traffic, average usage time, average traffic and other characteristics.
  3. Statistics for each APP…
  4. . . . slightly
  5. . . . slightly
  6. . . . slightly
# APP分类信息(可根据app_id和app_class文件进行关联)
cate_df = pd.read_csv('初赛数据集/app_class.csv',header=None)
cate_df.columns = ['appid','letter']
# 定义字母编码映射字典
char_map = {
    
    chr(i + 96): i for i in range(1, 27)}
# 将'letter'列中的字母进行编码
cate_df['letter'] = cate_df['letter'].map(char_map)
cate_dict = dict(zip(cate_df['appid'],cate_df['letter']))
df['category'] = df['appid'].map(cate_dict)

# 用户的使用次数、使用总时长、总流量、平均每次使用时长、平均流量等特征
user_agg = df.groupby('uid').agg({
    
    'appid': 'nunique', 'usage_time': ['sum', 'mean'], 
                                  'up_flow_mb': ['sum', 'mean'], 'down_flow_mb': ['sum', 'mean']})
user_agg.columns = ['num_apps', 'total_usage_time', 'avg_usage_time', 
                    'total_up_flow', 'avg_up_flow', 'total_down_flow', 'avg_down_flow']


# APP的使用次数、使用总时长、总流量、平均每次使用时长、平均流量等特征
app_agg = df.groupby('appid').agg({
    
    'uid': 'nunique', 'usage_time': ['sum', 'mean'], 
                                   'up_flow_mb': ['sum', 'mean'], 'down_flow_mb': ['sum', 'mean']})
app_agg.columns = ['num_users', 'total_usage_time', 'avg_usage_time', 
                   'total_up_flow', 'avg_up_flow', 'total_down_flow', 'avg_down_flow']
app_agg['category'] = app_agg.index.map(cate_dict)
app_agg

insert image description here

。。。略
app_agg['category'] = app_agg.index.map(cate_dict)
app_agg

insert image description here

。。。略
user_daily_agg.columns = ['avg_num_apps', 'avg_daily_usage_time', 'avg_daily_up_flow', 'avg_daily_down_flow']
user_daily_agg['total_days'] = user_dates
user_daily_agg

insert image description here

。。。略
user_app_dates_agg.columns = ['min_app_dates', 'avg_app_dates', 'max_app_dates']
user_app_dates_agg

insert image description here

。。。略
app_daily_agg['total_days'] = app_dates.groupby('appid').size()
app_daily_agg

insert image description here

# 合并特征
merged_df_uid = pd.concat([user_agg,user_daily_agg,user_app_dates_agg], axis=1, join='inner')
merged_df_appid = pd.concat([app_agg,app_daily_agg], axis=1, join='inner')
raw_df = df[['uid','appid']]
all_df = pd.merge(raw_df,merged_df_uid,on='uid')
all_df = pd.merge(all_df,merged_df_appid,on='appid')
all_df = all_df.drop_duplicates(subset='uid')
all_df = all_df.dropna()
# 输出结果
all_df.to_excel('初赛数据集/all_df.xlsx', index=False)

3.3 Problem 1: Cluster Analysis

3.3.1 KMeans

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# 对df进行归一化

df = pd.read_excel('初赛数据集/all_df.xlsx')
df = df.drop(columns=['uid','appid'])
scaler = MinMaxScaler()
weight = scaler.fit_transform(df)
start = time.time()
trainingData = weight
SSE = []  # 存放每次结果的误差平方和
k1 = 2
k2 = 10
for k in range(k1, k2):
    pca = PCA(n_components=k)
    trainingData = pca.fit_transform(weight)
    estimator = KMeans(n_clusters=k, max_iter=10000, init="k-means++", tol=1e-6)
    estimator.fit(trainingData)
    SSE.append(estimator.inertia_) # estimator.inertia_获取聚类准则的总和
end = time.time()
print(f'耗时:{
      
      end-start}s')
X = range(k1,k2)
plt.figure(figsize=(8,6))
plt.xlabel('k',fontsize=20)
plt.ylabel('SSE',fontsize=20)
plt.plot(X, SSE, 'o-')
plt.savefig('img/pca降维-手肘法.png',dpi=300)
plt.show()

insert image description here

from sklearn.cluster import KMeans
start = time.time()
pca = PCA(n_components=10)
trainingData = pca.fit_transform(weight)
# trainingData = weight
clf = KMeans(n_clusters=4,max_iter=10000, init="k-means++", tol=1e-6)
result = clf.fit(trainingData)
source = list(clf.predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
CHI = metrics.calinski_harabasz_score(trainingData, label)
print("CHI: ", CHI)

insert image description here

3.3.2 AGG clustering

start = time.time()
pca = PCA(n_components=10)
trainingData = pca.fit_transform(weight)
# 使用层次聚类
clf = AgglomerativeClustering(n_clusters=4, linkage='ward', affinity='euclidean')
result = clf.fit(trainingData)
source = list(clf.labels_)
end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
CHI = metrics.calinski_harabasz_score(trainingData, label)
print("CHI: ", CHI)

3.3.3 MeanShift clustering

start = time.time()
pca = PCA(n_components=10)
trainingData = pca.fit_transform(weight)
# 进行PCA降维
pca = PCA(n_components=10)
trainingData = pca.fit_transform(weight)

# 使用均值漂移聚类
clf = MeanShift(bandwidth=0.9)
result = clf.fit(trainingData)
source = list(clf.labels_)

end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
CHI = metrics.calinski_harabasz_score(trainingData, label)
print("CHI: ", CHI)

3.3.3 DBSCAN clustering

from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import time
from sklearn import metrics
start = time.time()
pca = PCA(n_components=10)
trainingData = pca.fit_transform(weight)
trainingData = weight
clf = DBSCAN(eps=0.08, min_samples=7)
result = clf.fit(trainingData)
source = list(clf.fit_predict(trainingData))
end = time.time()
label = clf.labels_

print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
CHI = metrics.calinski_harabasz_score(trainingData, label)
print("CHI: ", CHI)

insert image description here

3.3.4 Birch clustering

pca = PCA(n_components=10)
trainingData = pca.fit_transform(weight)
trainingData = weight
clf = Birch(n_clusters=5, branching_factor=10, threshold=0.01)
start = time.time()
result = clf.fit(trainingData)
source = list(clf.predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
CHI = metrics.calinski_harabasz_score(trainingData, label)
print("CHI: ", CHI)

insert image description here

4 downloads

See the bottom of the Zhihu article

zhuanlan.zhihu.com/p/643785015

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/131742246