New challenge study notes for users in the third phase of AI Summer Camp

1. Data visualization

1.数据探索和理解:数据可视化可以帮助我们更好地理解数据集的特征、分布和关系。通过可视化数据,我们可以发现数据中的模式、异常值、缺失值等信息,从而更好地了解数据的特点和结构。

2.特征工程:数据可视化可以帮助我们选择和创建合适的特征。通过可视化特征与目标变量之间的关系,我们可以发现特征与目标之间的相关性、线性/非线性关系、重要性等信息,从而指导特征选择、变换和创建。

3.模型评估和调优:数据可视化可以帮助我们评估和比较不同模型的性能。通过可视化模型的预测结果、误差分布、学习曲线等信息,我们可以了解模型的准确性、稳定性、过拟合/欠拟合等情况,并根据可视化结果进行模型调优和改进。

4.结果解释和沟通:数据可视化可以帮助我们解释和传达机器学习模型的结果。通过可视化模型的预测、特征重要性、决策边界等信息,我们可以更直观地解释模型的工作原理和结果,使非技术人员也能理解和接受模型的输出。

5.发现洞察和故事讲述:数据可视化可以帮助我们发现数据中的洞察和故事,并将其传达给观众。通过可视化数据的趋势、关联、分布等信息,我们可以发现数据中的有趣模式、趋势和关系,并通过可视化故事的方式将这些发现传达给观众。
# 导入库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 读取训练集和测试集文件
train_data = pd.read_csv('D:/D/Download/360安全浏览器下载/用户新增预测挑战赛公开数据/train.csv')
test_data = pd.read_csv('D:/D/Download/360安全浏览器下载/用户新增预测挑战赛公开数据/test.csv')

print(train_data.info())

Insert image description here
Check the data frame attributes through the df.info() method of the pd library and find that only the udmap field is of category type, and the rest are of numeric type.

# x7分组下标签均值
sns.barplot(x='x7', y='target', data=train_data)

Insert image description here

# 相关性热力图
sns.heatmap(train_data.corr().abs(), cmap='YlOrRd')

Insert image description here
The darker the color of the correlation heat map, the stronger the correlation, so the relationship between x7 and x8 variables is closer, as well as common_ts and x6. That is, there is strong multicollinearity. When performing feature engineering, you can consider eliminating one of the two variables to avoid overfitting due to multicollinearity.
Next, for each field, draw histograms and boxplots

# 列表,包含要分析的列名
cols = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8']
# 对于每一个字段,绘制直方图
plt.figure(figsize=(15, 10))
for i, col in enumerate(cols):
    plt.subplot(2, 4, i+1)
    sns.histplot(train_data[col], bins=30, kde=True)
    plt.title(f'Histogram of {
      
      col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# 对于每一个字段,绘制箱线图
plt.figure(figsize=(15, 10))
for i, col in enumerate(cols):
    plt.subplot(2, 4, i+1)
    sns.boxplot(train_data[col])
    plt.title(f'Boxplot of {
      
      col}')
    plt.xlabel(col)
plt.tight_layout()
plt.show()

The result is shown in the figure:
Insert image description hereInsert image description here

# 获取指定时间和日期
train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
# 从common_ts中提取小时
train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
# 绘制每小时下标签分布变化
sns.barplot(x='common_ts_hour', y='target', data=train_data)
plt.show()

Insert image description hereIt can be found that the probability of new users is relatively high between 1-15 hours, especially between 8-15 hours. Feature extraction attempts can be made for this part in the future.

# 定义函数,统计每个key对应的标签均值,绘制直方图。 
def plot_keytarget_mean(df):
    target_mean = np.zeros(9)
    for i in range(1, 10):
        df_temp = df.copy()
        number = 'key' + str(i)
        if number in df_temp.columns:
            data = {
    
    
                    f"{
      
      number}": df_temp[number],
                    'target': df_temp['target']
                    }
        df1 = pd.DataFrame(data)
        # 过滤出 "key" 列中非零值对应的行
        df_nonzero_key = df1[df1[number] != 0]
        # 计算非零值 "key" 对应的 "target" 均值
        mean_target_nonzero_key = df_nonzero_key['target'].mean()
        target_mean[i - 1] = mean_target_nonzero_key  # 索引从 0 开始
    return target_mean

target_mean = plot_keytarget_mean(train_data)
print(target_mean)
keys = ['key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9']
plt.bar(keys, target_mean)
plt.ylabel('Mean Target Value')

Insert image description hereFrom the figure above, the probability of new users corresponding to features key7, key8, and key9 is relatively high. You can make some related feature combinations in the future to try to improve the prediction accuracy of the model.

Summarize

Through data visualization, we can observe the relationship between different features and the target in more detail, which helps us filter out useful features and combine features to further improve the prediction accuracy of the model. And can better understand the data, discover patterns and trends in the data, and optimize our modeling process based on these findings.

2. Feature engineering

Feature engineering refers to the process of converting raw data into model training data, with the purpose of obtaining better training data features. Feature engineering can improve the performance of the model, and sometimes even achieve good results on simple models.
Insert image description here

From data visualization and observation of data, we can see that time is an important feature.
So add time features such as minutes, days of the week, years, etc.

train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour

train_data['common_ts_minute'] = train_data['common_ts'].dt.minute + train_data['common_ts_hour'] * 60
test_data['common_ts_minute'] = test_data['common_ts'].dt.minute + test_data['common_ts_hour'] * 60
train_data['dayofweek'] = train_data['common_ts'].dt.dayofweek
test_data['dayofweek'] = test_data['common_ts'].dt.dayofweek

train_data["weekofyear"] = train_data["common_ts"].dt.isocalendar().week.astype(int)
test_data["weekofyear"] = test_data["common_ts"].dt.isocalendar().week.astype(int)

train_data["dayofyear"] = train_data["common_ts"].dt.dayofyear
test_data["dayofyear"] = test_data["common_ts"].dt.dayofyear

train_data["day"] = train_data["common_ts"].dt.day
test_data["day"] = test_data["common_ts"].dt.day

train_data['is_weekend'] = train_data['dayofweek'] // 6
test_data['is_weekend'] = test_data['dayofweek'] // 6

It was found that the value of week has a great relationship with user growth. After submission, I found that the score increased to 0.73+
Insert image description here
and then continued to add the features given in the learning document. It is found that there are missing values ​​that need to be filled, and the missing values ​​are filled with 0 through fillna.

# 提取x1~x8的频次特征和标签特征
for i in range(1, 9):
    train_data['x' + str(i) + '_freq'] = train_data['x' + str(i)].map(train_data['x' + str(i)].value_counts())
    test_data['x' + str(i) + '_freq'] = test_data['x' + str(i)].map(train_data['x' + str(i)].value_counts())
    test_data['x' + str(i) + '_freq'].fillna(test_data['x' + str(i) + '_freq'].mode()[0], inplace=True)
    train_data['x' + str(i) + '_mean'] = train_data['x' + str(i)].map(train_data.groupby('x' + str(i))['target'].mean())
    test_data['x' + str(i) + '_mean'] = test_data['x' + str(i)].map(train_data.groupby('x' + str(i))['target'].mean())
    test_data['x' + str(i) + '_mean'].fillna(test_data['x' + str(i) + '_mean'].mode()[0], inplace=True)
# 提取key1~key9的频次特征和标签特征
for i in range(1, 10):
    train_data['key'+str(i)+'_freq'] = train_data['key'+str(i)].map(train_data['key'+str(i)].value_counts())
    test_data['key'+str(i)+'_freq'] = test_data['key'+str(i)].map(train_data['key'+str(i)].value_counts())
    train_data['key'+str(i)+'_mean'] = train_data['key'+str(i)].map(train_data.groupby('key'+str(i))['target'].mean())
    test_data['key'+str(i)+'_mean'] = test_data['key'+str(i)].map(train_data.groupby('key'+str(i))['target'].mean())
 
train_data = train_data.fillna(0)
test_data = test_data.fillna(0)

Then I tried his feature based on the excellent notes of other teaching assistants that the mode is better than 0 filling effect, and it turned out that the effect was very good and increased to 0.75+.
Insert image description here
The specific code is as follows:

train_data['x1_freq'] = train_data['x1'].map(train_data['x1'].value_counts())
test_data['x1_freq'] = test_data['x1'].map(train_data['x1'].value_counts())
test_data['x1_freq'].fillna(test_data['x1_freq'].mode()[0], inplace=True)
train_data['x1_mean'] = train_data['x1'].map(train_data.groupby('x1')['target'].mean())
test_data['x1_mean'] = test_data['x1'].map(train_data.groupby('x1')['target'].mean())
test_data['x1_mean'].fillna(test_data['x1_mean'].mode()[0], inplace=True)

train_data['x2_freq'] = train_data['x2'].map(train_data['x2'].value_counts())
test_data['x2_freq'] = test_data['x2'].map(train_data['x2'].value_counts())
test_data['x2_freq'].fillna(test_data['x2_freq'].mode()[0], inplace=True)
train_data['x2_mean'] = train_data['x2'].map(train_data.groupby('x2')['target'].mean())
test_data['x2_mean'] = test_data['x2'].map(train_data.groupby('x2')['target'].mean())
test_data['x2_mean'].fillna(test_data['x2_mean'].mode()[0], inplace=True)

train_data['x3_freq'] = train_data['x3'].map(train_data['x3'].value_counts())
test_data['x3_freq'] = test_data['x3'].map(train_data['x3'].value_counts())
test_data['x3_freq'].fillna(test_data['x3_freq'].mode()[0], inplace=True)

train_data['x4_freq'] = train_data['x4'].map(train_data['x4'].value_counts())
test_data['x4_freq'] = test_data['x4'].map(train_data['x4'].value_counts())
test_data['x4_freq'].fillna(test_data['x4_freq'].mode()[0], inplace=True)

train_data['x6_freq'] = train_data['x6'].map(train_data['x6'].value_counts())
test_data['x6_freq'] = test_data['x6'].map(train_data['x6'].value_counts())
test_data['x6_freq'].fillna(test_data['x6_freq'].mode()[0], inplace=True)
train_data['x6_mean'] = train_data['x6'].map(train_data.groupby('x6')['target'].mean())
test_data['x6_mean'] = test_data['x6'].map(train_data.groupby('x6')['target'].mean())
test_data['x6_mean'].fillna(test_data['x6_mean'].mode()[0], inplace=True)

train_data['x7_freq'] = train_data['x7'].map(train_data['x7'].value_counts())
test_data['x7_freq'] = test_data['x7'].map(train_data['x7'].value_counts())
test_data['x7_freq'].fillna(test_data['x7_freq'].mode()[0], inplace=True)
train_data['x7_mean'] = train_data['x7'].map(train_data.groupby('x7')['target'].mean())
test_data['x7_mean'] = test_data['x7'].map(train_data.groupby('x7')['target'].mean())
test_data['x7_mean'].fillna(test_data['x7_mean'].mode()[0], inplace=True)

train_data['x8_freq'] = train_data['x8'].map(train_data['x8'].value_counts())
test_data['x8_freq'] = test_data['x8'].map(train_data['x8'].value_counts())
test_data['x8_freq'].fillna(test_data['x8_freq'].mode()[0], inplace=True)
train_data['x8_mean'] = train_data['x8'].map(train_data.groupby('x8')['target'].mean())
test_data['x8_mean'] = test_data['x8'].map(train_data.groupby('x8')['target'].mean())
test_data['x8_mean'].fillna(test_data['x8_mean'].mode()[0], inplace=True)

3. Model cross-validation

交叉验证(Cross-Validation)是机器学习中常用的一种模型评估方法,用于评估模型的性能和泛化能力。
它的主要目的是在有限的数据集上,尽可能充分地利用数据来评估模型,避免过拟合或欠拟合,并提供对模型性能的更稳健的估计。
交叉验证的基本思想是将原始的训练数据划分为多个子集(也称为折叠),然后将模型训练和验证进行多次循环。
在每一次循环中,使用其中一个子集作为验证集,其他子集作为训练集。这样可以多次计算模型的性能指标,并取这些指标的平均值作为最终的模型性能评估结果。

1. Why use cross-validation?

  • Cross-validation is used to evaluate the prediction performance of the model, especially the performance of the trained model on new data, which can reduce overfitting to a certain extent.
  • It is possible to obtain as much effective information as possible from limited data.
  • It can help us choose the best model parameters. By conducting multiple evaluations on different training and test sets, the performance of the model under different parameter settings can be compared and the best parameter combination can be selected. This helps us optimize model performance and improve prediction accuracy.

2. Common cross-validation methods:

  • Simple cross-validation
    divides the data set into two parts (or three parts), 70% as the training set and 30% as the validation set. Use 70% of the data to select different model parameters for training. After the end, use 30% of the data (not trained) for verification. Choose the best model.
  • S-fold cross-validation
    divides the data set into S disjoint data sets of similar size, uses S-1 part of the data to train the model, and the remaining 1 part of the data is used for verification. After multiple trainings, the optimal model is selected.
    [Note] The verification set may be different each time.
  • Leave-one-out cross-validation
    is actually a special form of S-fold cross-validation, that is, when the size of the data set is extremely small (less than 100, or even more exaggerated). Divide S into S=N, where N is the data size. Leave 1 piece of data for verification.
# 导入模型
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# 导入交叉验证和评价指标
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
# 训练并验证SGDClassifier(基于随机梯度下降(Stochastic Gradient Descent)优化算法的分类器)
pred = cross_val_predict(
    SGDClassifier(max_iter=10),
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))

Insert image description here
accuracy: accuracy. Precision can measure whether a sample with a negative label is judged as positive, and recall is used to measure all positive examples.
macro avg: "macro avg" is one of the indicators to evaluate the performance of multi-class classification models. It is calculated by calculating the indicators of each category (such as accuracy, precision, recall, F1 value, etc.), and then averaging the indicators of all categories.
"micro" option: Indicates that micro-averaging all labels in multi-classification will produce an average precision, recall and F value.
weighted avg:
"weighted" option: Indicates that a weighted-averaging F value will be generated.
Specifically, you can see various indicator learning of machine learning.

# 训练并验证决策树DecisionTreeClassifier
pred = cross_val_predict(
    DecisionTreeClassifier(),
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))

Insert image description here

# 训练并验证MultinomialNB
pred = cross_val_predict(
    MultinomialNB(),
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))

Insert image description here

# 训练并验证RandomForestClassifier
pred = cross_val_predict(
    RandomForestClassifier(n_estimators=5),
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))

Insert image description here
Judging from the above four models, decision trees and random forests perform better. Among the two, decision trees are better. I think the excellent performance of the decision tree model on this data set may be due to its ability to handle feature engineering, data distribution, and imbalance. Due to natural processing capabilities of data and feature interaction effects. Of course, the parameters of all models should be further adjusted and optimized to further improve performance.
At the same time, I used two models, XGBoost and LightgBM, for cross-validation. The results are as follows:

import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict
# 定义XGBoost模型
xgb_model = xgb.XGBClassifier()
# 使用交叉验证进行训练和验证
pred_xgb = cross_val_predict(
    xgb_model,
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)
print(classification_report(train_data['target'], pred_xgb, digits=3))

Insert image description here

# 定义LightGBM模型
lgb_model = lgb.LGBMClassifier()
# 使用交叉验证进行训练和验证
pred_lgb = cross_val_predict(
    lgb_model,
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)
print(classification_report(train_data['target'], pred_lgb, digits=3))

Insert image description here
From the perspective of macro avg and weightzvg, the best performance is the decision tree.

In addition, the optimization of the model itself cannot be ignored:
  • Hyperparameter tuning There are many artificially set model parameters in machine learning. Among them, parameters that are not obtained through model training are called hyperparameters. Manually adjusting the parameters of the trained model according to specific problems can improve the accuracy of the model. Commonly used hyperparameter tuning algorithms include Bayesian optimization, grid search and random search.

1. Bayesian optimization: Bayesian optimization is a technique based on Bayes’ theorem, which describes the probability of an event occurring related to current knowledge. When Bayesian optimization is used for hyperparameter optimization, the algorithm builds a probabilistic model from a set of hyperparameters to optimize a specific metric. It uses regression analysis to iteratively select the best set of hyperparameters.

2. Grid Search: With grid search, you specify a set of hyperparameters and performance metrics, and the algorithm then iterates through all possible combinations to determine the best match. Grid search works well, but it is relatively tedious and computationally expensive, especially when using a large number of hyperparameters.
3. Random search: Although random search is based on similar principles to grid search, random search randomly selects a set of hyperparameters at each iteration. This method works well when a relatively small number of hyperparameters mainly determine the model's results.

After reading the random forest tuning notes shared by a 0.86+ boss, I tried to use the random forest tuning model
to improve the efficiency of tpe parameter optimization by setting parameter combinations with known better effects.

#设定已知好的参数组合:就默认的参数组合就已经很好了
good_params = {
    
    
  'n_estimators': 100,
  'max_depth': None, 
  'min_samples_split': 2
......
}
#这里的loss就是上面五交叉认证的相反数-score,将其转化为一个结果对象,加入trials:
good_result = {
    
    'loss': 0.95, 'status': STATUS_OK}
trials.insert_trial_docs([{
    
    
  'tid': len(trials) + 1,
  'spec': good_params,
  'result': good_result,
  'misc': {
    
    }
}])
#运行tpe搜索
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=100, trials=trials)

Parameters for this optimization:
n_estimators: the number of decision tree models included in the random forest model
max_depth: the maximum depth of the decision tree model
max_features: the maximum number of features selected when building the decision tree
min_samples_leaf: the minimum number of samples of leaf nodes
min_samples_split: current The minimum number of samples that a node is allowed to split
criterion: the basis for node splitting

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import time

n_estimators: the number of decision tree models included in the random forest model

#这里的train_data就是上面读入数据后,特征处理好后的待训练的数据
data = train_data.iloc[:,:-1]
lable = train_data.iloc[:,-1]
start=time.time()
scorel = []
for i in range(0,200,10): # 迭代建立包含0-200棵决策树的RF模型进行对比
  rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1,random_state=90)
  score = cross_val_score(rfc,data,lable,cv=10).mean()
  scorel.append(score)
print(max(scorel),(scorel.index(max(scorel))*10)+1)
end=time.time()
print('Running time: %s Seconds'%(end-start))
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()

result:

0.9613823613698237 131 Running time: 5530.6381804943085 Seconds

final optimization results

clf = RandomForestClassifier(n_estimators=131,
                             max_depth=33,
                             n_jobs=-1,
                             max_features=9,
                             min_samples_leaf=1,
                             min_samples_split=2,
                             criterion = 'entropy'
                             )
clf.fit(
    train_data.drop(['target'], axis=1),
    train_data['target']
)
y_pred = clf.predict(X_val)
# 计算准确率
accuracy = accuracy_score(y_val, y_pred)
print("Accuracy:", accuracy)
# 计算F1分数
f1 = f1_score(y_val, y_pred)
print("F1 score:", f1)

After submitting, the score reaches 0.79+
Insert image description here
and the score can no longer go up. The next operations are all reverse tuning, haha.
We made a feature importance score to evaluate the impact of each feature on the target variable.

# 获取字段列表
l0 = ['x1_freq', 'x2_freq', 'x3_freq', 'x4_freq', 'x5_freq', 'x6_freq', 'x7_freq', 'x8_freq',
      'x1_mean', 'x2_mean', 'x3_mean', 'x4_mean', 'x5_mean', 'x6_mean', 'x7_mean', 'x8_mean',
      'x1_std', 'x2_std', 'x3_std', 'x4_std', 'x5_std', 'x6_std', 'x7_std', 'x8_std',
      'key1_freq', 'key2_freq', 'key3_freq', 'key4_freq', 'key5_freq', 'key6_freq', 'key7_freq', 'key8_freq', 'key9_freq',
      'key1_mean', 'key2_mean', 'key3_mean', 'key4_mean','key5_mean', 'key6_mean', 'key7_mean', 'key8_mean', 'key9_mean',
      'key1_std', 'key2_std', 'key3_std', 'key4_std', 'key5_std', 'key6_std', 'key7_std', 'key8_std', 'key9_std',
      'unmap_isunknown', 'udmap', 'common_ts', 'uuid', 'target', 'common_ts_hour', 'day', 'common_ts_minute','dayofweek',
      'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8',
      'eid', 'eid_std', 'eid_mean', 'eid_freq',
      'key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9'
      ]
 
# 训练模型:按需分组选取特征
x = train_data.drop(['x1_freq', 'x2_freq', 'x3_freq', 'x4_freq', 'x5_freq', 'x6_freq', 'x7_freq', 'x8_freq',
                     'x1_mean', 'x2_mean', 'x3_mean', 'x4_mean', 'x5_mean', 'x6_mean', 'x7_mean', 'x8_mean',
                     'x1_std', 'x2_std', 'x3_std', 'x4_std', 'x5_std', 'x6_std', 'x7_std', 'x8_std',
                     'key1_freq', 'key2_freq', 'key3_freq', 'key4_freq', 'key5_freq', 'key6_freq', 'key7_freq', 'key8_freq','key9_freq',
                     'key1_mean', 'key2_mean', 'key3_mean', 'key4_mean','key5_mean', 'key6_mean', 'key7_mean', 'key8_mean','key9_mean',
                     'key1_std', 'key2_std', 'key3_std', 'key4_std', 'key5_std', 'key6_std', 'key7_std', 'key8_std', 'key9_std',
                     'udmap', 'common_ts', 'uuid', 'target', 'common_ts_hour', 'day', 'common_ts_minute','dayofweek',
                     'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8',
                     'eid', 'eid_std', 'eid_mean', 'eid_freq',
                     'key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9'
                      ], axis=1)
y = train_data['target']
clf = DecisionTreeClassifier()
clf.fit(x, y)
 
# 获取特征重要性得分
feature_importances = clf.feature_importances_
 
# 创建特征名列表
feature_names = list(x.columns)
 
# 创建一个DataFrame,包含特征名和其重要性得分
feature_importances_df = pd.DataFrame({
    
    'feature': feature_names, 'importance': feature_importances})
 
# 对特征重要性得分进行排序
feature_importances_df = feature_importances_df.sort_values('importance', ascending=False)
 
# 颜色映射
colors = plt.cm.viridis(np.linspace(0, 1, len(feature_names)))
 
# 可视化特征重要性
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(feature_importances_df['feature'], feature_importances_df['importance'], color=colors)
ax.invert_yaxis()  # 翻转y轴,使得最大的特征在最上面
ax.set_xlabel('特征重要性', fontsize=12)  # 图形的x标签
ax.set_title('决策树特征重要性可视化', fontsize=16)
for i, v in enumerate(feature_importances_df['importance']):
    ax.text(v + 0.01, i, str(round(v, 3)), va='center', fontname='Times New Roman', fontsize=10)
 
# 保存图形
plt.savefig('./特征重要性.jpg', dpi=400, bbox_inches='tight')
plt.show()


The results show that the time feature year is the most important,

# 使用Decision Tree Classifier对模型进行训练
clf = DecisionTreeClassifier()
X = train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1)
y = train_data['target']
clf.fit(X, y)
# 绘制特征重要性柱状图
import matplotlib.pyplot as plt
# 获取特征重要性分数
feature_importances = clf.feature_importances_

# 创建特征重要性 DataFrame
importance_df = pd.DataFrame({
    
    'Feature': X.columns, 'Importance': feature_importances})

# 按重要性从大到小排序
importance_df = importance_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(80, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])

Then I drew a decision tree feature importance histogram.
Insert image description here
I used the important commom_ts time features of week, day, or minute to combine features with the target. However, the effect did not seem to be very good. It was not as high as the previously extracted feature scores. Finally, I
Insert image description here
introduced the dark highest The Source of Points - AutoGluon
builds machine learning solutions on raw data with just a few lines of code.
Haha, let’s start with the code

pip install autogluon
import numpy as np
import pandas as pd
from autogluon.tabular import TabularDataset
from autogluon.tabular import TabularPredictor
train_data = TabularDataset('D:/D/Download/360安全浏览器下载/用户新增预测挑战赛公开数据/train.csv')
test_data = TabularDataset('D:/D/Download/360安全浏览器下载/用户新增预测挑战赛公开数据/test.csv')
submit = pd.DataFrame()
submit["uuid"] = test_data["uuid"]
label = "target"
predictor = TabularPredictor(
    label = label,
    problem_type="binary",
    eval_metric="f1"
).fit(
    train_data.drop(columns=["uuid"]),
    excluded_model_types=[
        "CAT",
        "NN_TORCH",
        "FASTAI",
    ],
)
submit[f"{
      
      label}"] = predictor.predict(test_data.drop(columns=["uuid"]))
submit.to_csv("D:/D/Download/360安全浏览器下载/用户新增预测挑战赛公开数据/submit.csv",index=False)

With just these few lines of code, I couldn’t catch up after working on it for several days, haha, it’s a bit embarrassing.
Insert image description here
About AutoGluon

Guess you like

Origin blog.csdn.net/m0_68165821/article/details/132499335