4. Categorical Encoding with CatBoost Encoder

4. Categorical Encoding with CatBoost Encoder

记录下CatBoost Encoder 特征编码,来自于Categorical Encoding with CatBoost Encoder

大部分机器学习算法要求数据是数字格式的,对于类别这种非数字格式,我们要将其转换为数字形式。方法非常多,Catboost使用的是target-based categorical 编码。

其计算公式为:
TargetCount + prior FeatureCount + 1 (1) \frac{\text{TargetCount} + \text{prior}}{\text{FeatureCount} + 1} \tag{1} FeatureCount+1TargetCount+prior(1)
其中:

  1. TargetCount \text{TargetCount} TargetCount : 对于指定类别特征在target value的总和
  2. prior \text{prior} prior:对于整个数据集而言,target值的总和/所有的观测变量数目
  3. FeatureCount \text{FeatureCount} FeatureCount:观测的特征列表在整个数据集中的出现次数。

例如,对于 color=[“red”, “blue”, “blue”, “green”, “red”, “red”, “black”, “black”, “blue”, “green”] and target column with values, target=[1, 2, 3, 2, 3, 1, 4, 4, 2, 3]

这里,先验prior = 25/10=2.5。

对于“red”, TargetCount = 1 + 3 + 1 = 5 \text{TargetCount}=1+3+1=5 TargetCount=1+3+1=5,而red在特征中出现了3次

FeatureCount = 3 \text{FeatureCount}=3 FeatureCount=3。所以最后类别为:(5+2.5)/(3+1)=1.875.

我们可以安装 pip install category_encoders来使用, 其还有非常多编码方式官方文档

import category_encoders as ce
import pandas as pd
  
# Make dataset
train = pd.DataFrame({
    
    
    'color': ["red", "blue", "blue", "green", "red",
              "red", "black", "black", "blue", "green"],
    
    'interests': ["sketching", "painting", "instruments",
                  "sketching", "painting", "video games",
                  "painting", "instruments", "sketching",
                  "sketching"],
    
    'height': [68, 64, 87, 45, 54, 64, 67, 98, 90, 87],
    
    'grade': [1, 2, 3, 2, 3, 1, 4, 4, 2, 3], })
  
# Define train and target
target = train[['grade']]
train = train.drop('grade', axis = 1)
  
# Define catboost encoder
cbe_encoder = ce.cat_boost.CatBoostEncoder()
  
# Fit encoder and transform the features
cbe_encoder.fit(train, target)
train_cbe = cbe_encoder.transform(train)
print(train_cbe)
  
# We can use fit_transform() instead of fit()
# and transform() separately as follows:
# train_cbe = cbe_encoder.fit_transform(train,target
==========================================
  color  interests  height
0  1.875   2.100000      68
1  2.375   2.875000      64
2  2.375   3.166667      87
3  2.500   2.100000      45
4  1.875   2.875000      54
5  1.875   2.500000      64
6  3.500   2.875000      67
7  3.500   3.166667      98
8  2.375   2.100000      90
9  2.500   2.100000      87

[mark] 11-categorical-encoders-and-benchmark

Guess you like

Origin blog.csdn.net/weixin_39754630/article/details/120227770