Introduction to Kaggle Actual Combat (4) Cat-In-The-Dat-ii

The fourth project is relatively simple and interesting, because its data set is all sub-type features, in this case, what should we do. Here to share with you a relatively easy-to-use model catboost and the encoding method TargetEncoder for bisection type feature processing. In this project, the data can be processed and modeled conveniently and quickly.

Part1. Data Import

import numpy as np
import pandas as pd
import os
from sklearn.exceptions import ConvergenceWarning
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=ConvergenceWarning)


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# separate target, remove id and target
test_ids = test['id']
target = train['target']
train.drop(columns=['id', 'target'], inplace=True)
test.drop(columns=['id'], inplace=True)

train.head()

Insert picture description here
It can be seen that all 23 features this time are sub-type features, which cannot be directly modeled. It is more troublesome to deal with one by one, here we can directly use an efficient encoding method TargetEncoder

Part2. Data Processing

import category_encoders as ce

te = ce.TargetEncoder(cols=train.columns.values, smoothing=0.3).fit(train, target)

train = te.transform(train)
train.head()

Insert picture description here

TargetEncoder turns all sub-type features into numbers that can be directly modeled

Multiple encoding methods in sklearn-category_encoders
TargetEncoder
target encoding is a category variable encoding method based not only on the feature value itself, but also on the corresponding dependent variable. For classification problems: replace the category feature with the combination of the posterior probability of the dependent variable given a specific category value and the prior probability of the dependent variable on all training data. For continuous targets: replace the categorical features with the combination of the target expected value of the dependent variable given a specific category value and the target expected value of the dependent variable on all training data. This method relies heavily on the distribution of the dependent variable, but this greatly reduces the number of features generated after encoding.

Formula:
Insert picture description here
This method is also easy to cause overfitting. The following method is used to prevent overfitting:

①Increase the size of the regular term
a②Add noise to this column in the training
set③Use cross-validation

If you want to learn about other encoding methods, you can click on the link above to learn more about it

Part3. Data Modeling

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
x_train, x_test, y_train, y_test = train_test_split(
    train, target,  
    test_size=0.2, 
    random_state=100
)

#分训练集和测试集
from catboost import CatBoostClassifier

best_params_cat = {
    
    
    'max_depth': 2,
    'n_estimators': 600,
    'random_state': 289,
    'verbose': 0
}

#预测
def predict(estimator, features):
    return estimator.predict_proba(features)[:, 1]

#计算auc分数
def auc(estimator):
    y_pred = predict(estimator, x_test)
    return roc_auc_score(y_test, y_pred)

SEARCH_NOW = False

# 搜索最佳参数
if SEARCH_NOW:
    params = {
    
    
        'max_depth': [2, 3, 4, 5],
        'n_estimators': [50, 100, 200, 400, 600],
        'random_state': [289],
        'verbose': [0]
    }
    best_params_cat = make_search(CatBoostClassifier(), params)   

# 建模
cat = CatBoostClassifier()
cat.set_params(**best_params_cat)
cat.fit(x_train, y_train)
print('roc auc = %.4f' % auc(cat))

The model catboost used here, if you want to learn more, you can click the link below.
CatBoost principle and practice
Simply put, catboost is a gradient boosting algorithm library that can handle categorical features well.
It can be used to prevent overfitting after we use TargetEncoder, and can reduce its impact on us.

In order to use all samples for training, CatBoost provides a solution, that is, first randomly sort all samples, and then take a value in the categorical feature. When the feature of each sample is converted to a numerical value, It is based on the average value of the category labels ranked before the sample, and the priority and priority weight coefficients are added at the same time. The formula example is as follows.
Insert picture description here
This approach can reduce the noise caused by the low-frequency features in the category features.

#数据预测
test = te.transform(test)
pre =cat.predict_proba(test)[:,1]

#数据保存
res = pd.DataFrame()
res['id'] = test_ids
res['target'] = pre
res.to_csv('submission.csv', index=False)

Part4. Summary

This project is not very difficult, I mainly want to share with you two small helpers to deal with this kind of data set with only sub-type characteristics. Hope it can be helpful to everyone. The reprinted part of the article has a link to the original text. I hope you can also see the detailed content. Thank you all for reading!

Guess you like

Origin blog.csdn.net/kiligso/article/details/108700696
Recommended