DatawhaleAI Summer Camp Third Phase Machine Learning User New Prediction Challenge Baseline Novice Tutorial

This tutorial will guide you through project-based learning, from shallow to deep, and gradually advance. From the general process of the competition and running through the simplest baseline, to in-depth reading of each competition link, intensive reading of the baseline and the learning of advanced practical skills.
A journey of a thousand miles begins with a single step. Start your AI learning journey from here!
——Datawhale Contributor Team

User Addition Prediction Challenge:
https://challenge.xfyun.cn/topic/info?type=subscriber-addition-prediction&ch=ymfk4uU
Organizer: iFlytek
Download data
Submit results

Click to start the environment
Insert image description here
Click to enter the environment
Run all codes with one click
got the answer
Right-click the file to download the file and submit it to the iFlytek platform

# 导入库
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier#决策树模型

# 读取训练集和测试集文件
train_data = pd.read_csv('用户新增预测挑战赛公开数据/train.csv')
test_data = pd.read_csv('用户新增预测挑战赛公开数据/test.csv')

# 提取udmap特征,人工进行onehot
#定义udmap_onethot()函数:该函数用于对udmap特征进行人工的one-hot编码。首先创建一个长度为9的全零向量v,然后根据输入的d的值进行判断,如果值为'unknown',则直接返回全零向量。如果值不为'unknown',则通过eval()函数将字符串转换成字典对象d,然后遍历数字1到9,检查字典中是否包含键名为'key1'、'key2'、...、'key9'的元素,如果存在,则将对应的值赋给向量v的相应位置(索引为i-1),最后返回得到的向量v。
def udmap_onethot(d):
    v = np.zeros(9)
    if d == 'unknown':
        return v
    d = eval(d)
    for i in range(1, 10):
        if 'key' + str(i) in d:
            v[i-1] = d['key' + str(i)]
            
    return v
#对udmap特征进行one-hot编码:通过apply()方法将udmap_onethot()函数应用到train_data['udmap']和test_data['udmap']上,将返回的数组垂直堆叠成DataFrame对象train_udmap_df和test_udmap_df,然后为这两个DataFrame设置列名。
train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))
train_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]

# 编码udmap是否为空
train_data['udmap_isunknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isunknown'] = (test_data['udmap'] == 'unknown').astype(int)

# udmap特征和原始数据拼接
#通过使用.concat()函数将train_udmap_df和test_udmap_df与原始数据集train_data和test_data进行列拼接。
train_data = pd.concat([train_data, train_udmap_df], axis=1)
test_data = pd.concat([test_data, test_udmap_df], axis=1)

# 提取eid的频次特征
# 使用value_counts()函数统计train_data['eid']中每个元素的出现次数,并通过map()函数将结果映射到对应的train_data['eid_freq']和test_data['eid_freq']中。
train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())

# 提取eid的标签特征
# 使用groupby()函数根据eid对train_data进行分组,然后计算每个分组中target列的均值,并通过map()函数将结果映射到对应的train_data['eid_mean']和test_data['eid_mean']中。
train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())

# 提取时间戳
# 将train_data['common_ts']和test_data['common_ts']的数值类型转换为时间戳类型,指定时间单位为毫秒。然后使用.dt.hour将时间戳转换为小时数,并将结果存储在train_data['common_ts_hour']和test_data['common_ts_hour']中。
train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'], unit='ms')
train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour

# 加载决策树模型进行训练
# 创建一个DecisionTreeClassifier分类器对象clf,使用fit()方法将训练集的特征列(去除不需要的列)与目标列作为输入进行模型训练。
clf = DecisionTreeClassifier()
clf.fit(
    train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
    train_data['target']
)

# 对测试集进行预测,将submit.csv在比赛页面提交
# 使用已训练好的分类器clf对测试集的特征列(去除不需要的列)进行预测,并生成包含预测结果的DataFrame对象。最后将预测结果保存为CSV文件submit.csv,并包括uuid和target两列。
pd.DataFrame({
    
    
    'uuid': test_data['uuid'],
    'target': clf.predict(test_data.drop(['udmap', 'common_ts', 'uuid'], axis=1))
}).to_csv('submit.csv', index=None)

The fit method used is a method for training models in the scikit-learn library. It first requires us to have a model. Here we use a decision tree model. The fit method is linear regression fitting. Our goal is to select the model parameters that can minimize the sum of squares of the modeling errors, that is, minimize the loss function.

Before solving the analysis of onehot encoding, first solve the problem of why one-hot encoding is used: Obviously, the values ​​of the discrete features of the data here have no significance, and one-hot encoding can be used; if the values ​​of the discrete features have For the meaning of size, we can use a one-to-one mapping of continuous values.
One-hot encoding is One-Hot encoding, also known as one-bit effective encoding. Its method is to use an N-bit status register to encode N states. Each state has its own independent register bit, and at any time, only One is valid.
It can be understood that for each feature, if it has m possible values, then after one-hot encoding, it becomes m binary features (for example, the feature of grades has good, medium, and poor and becomes one-hot. 100, 010, 001). Also, these features are mutually exclusive and only one is active at a time. Therefore, the data becomes sparse.
The main benefits of this are:
it solves the problem of classifiers not being able to handle attribute data,
and it also plays a role in expanding features to a certain extent.

Q&A

  • If submit.csv is submitted to the iFlytek competition page, how many points will there be?
  • The score I submitted was 0.62686
  • How to perform manual onehot on udmp in the code?
  • In the code, udmap is manually one-hot encoded through the customized udmap_onethot() function. The following are the specific implementation steps of the udmap_onethot() function:

1. Create an all-zero vector v of length 9 to store the encoded result.
2. Determine whether the value of the input d is 'unknown'. If so, directly return the all-zero vector v.
3. If the value of d is not 'unknown', convert the dictionary object in string form into an actual dictionary object. You can use the eval() function to achieve this conversion.
4. Traverse the numbers 1 to 9 (representing the 9 categories of one-hot encoding), and check whether the dictionary object d contains elements with key names 'key1', 'key2', ..., 'key9'.
5. For each number i, if there is an element with the key name 'key'+str(i) in the dictionary object d, then assign the value of the element to the i-1th position of the vector v (the index is i- 1).
6. Finally, return the vector v obtained after encoding.
By calling the udmap_onethot() function and applying it to the udmap columns of the training set and test set, the feature matrix after manual one-hot encoding can be obtained.

You can check out the baseline explanation from a big guy at datawhale.
The baseline video explanation

Introduction to decision trees

What is a decision tree model?
A decision tree is a prediction model that represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node.

Decision tree view
The composition of the decision tree Root
node: the first choice point
Non-leaf nodes and branches: intermediate processes
Leaf nodes: the final decision result

Decision tree algorithms include the following three types:
Insert image description here
information entropy and information gain:
information entropy
Conditional entropy
the so-called information gain refers to the information gain g (D, A) of feature A to training data set D. It is defined as the information entropy H (D) of set D and feature A. The difference between the information conditional entropy H(D|A) of D under given conditions, that is, the formula is: Therefore, the
Insert image description here
generation of a decision tree is mainly divided into the following two steps. These two steps are usually achieved by learning samples for which the classification results are already known.

  1. Splitting of nodes: Generally, when the attribute represented by a node cannot be judged, the node is divided into 2
    child nodes (if it is not a binary tree, it will be divided into n child nodes).
  2. Determination of threshold: Choose an appropriate threshold to minimize the classification error rate (Training Error).

Example data:
Insert image description here

Assuming it is a binary tree, the general decision result of the decision tree is as follows:
Insert image description here
Here, why should your own house be placed at the beginning? We can know the size of the information gain. According to the above information gain formula, let's apply it:
Insert image description hereThen let A1, A2, A3, and A4 represent the four characteristics of age, job, own house, and credit situation, respectively, then calculate The information gain of age is:
Insert image description here
In the same way, we can calculate g(D,A2)=0.324, g(D,A3)=0.420, g(D,A4)=0.363. In comparison, feature A3 has the largest information gain. , so put it at the front.

Advantages and Disadvantages of Decision Trees:

  • Advantages:
    Decision trees are easy to understand and interpret, can be analyzed visually, and rules can be easily extracted; can handle
    both nominal and numerical data;
    are more suitable for processing samples with missing attributes;
    can handle irrelevant features;
    when testing the data set, run It is relatively fast;
    it can produce feasible and good results on large data sources in a relatively short period of time.
  • Disadvantages:
    Overfitting is prone to occur (random forest can reduce overfitting to a large extent);
    it is easy to ignore the correlation of attributes in the data set;
    for data with inconsistent numbers of samples in each category, when dividing attributes in the decision tree, different The judgment criteria will bring different attribute selection tendencies; the information gain criterion prefers attributes with a larger number of desirable attributes (typically representative of the ID3 algorithm), while the gain rate criterion (CART) prefers attributes with a smaller number of desirable attributes. , but when CART performs attribute classification, it no longer simply uses the gain rate to divide, but uses a heuristic rule) (as long as information gain is used, this shortcoming is present, such as RF).
    When the ID3 algorithm calculates information gain, the results are biased towards features with more numerical values.

Why prune?
The risk of overfitting in the decision tree model is very high, and the data can theoretically be completely separated. Because if there are too many nodes, each sample can be divided into a leaf. It can achieve good results on the training set, but the effect on the test set is not good. Therefore, after building the decision tree model, a pruning strategy should be adopted to make the classification criteria more powerful, make the tree more concise, speed up the operating efficiency, and improve the applicability of the model.

Pruning ideas:
Pre-pruning: Terminate the growth of certain branches in advance
Post-pruning: Generate a complete tree, and then go back and "prun" from bottom to top
Example of pre-pruning:
Division After the accuracy becomes low, pruning will be performed without division. The accuracy will not change after division. Follow the Occam's razor principle and do not divide. Example of post-pruning
Insert image description here
:
from bottom to top, each node will be examined whether to prune. If pruned, If there is no change after cutting, don't cut it.
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Pre-pruning VS post-pruning
(1) Time cost
Pre-pruning: Test time cost is reduced, training time cost is reduced
Post-pruning: Test time cost is reduced, training time cost is increased
(2) Over/under-fitting risk:
Pre-pruning Branch: The risk of over-fitting is reduced, and the risk of under-fitting is increased.
Post-pruning: The risk of over-fitting is reduced, and the risk of under-fitting is basically unchanged
(3) Generalization performance: Post-pruning is usually better than pre-pruning

Guess you like

Origin blog.csdn.net/m0_68165821/article/details/132248848