4. Categorical Encoding with CatBoost Encoder
记录下CatBoost Encoder 特征编码,来自于Categorical Encoding with CatBoost Encoder。
大部分机器学习算法要求数据是数字格式的,对于类别这种非数字格式,我们要将其转换为数字形式。方法非常多,Catboost使用的是target-based categorical 编码。
其计算公式为:
TargetCount + prior FeatureCount + 1 (1) \frac{\text{TargetCount} + \text{prior}}{\text{FeatureCount} + 1} \tag{1} FeatureCount+1TargetCount+prior(1)
其中:
- TargetCount \text{TargetCount} TargetCount : 对于指定类别特征在target value的总和
- prior \text{prior} prior:对于整个数据集而言,target值的总和/所有的观测变量数目
- FeatureCount \text{FeatureCount} FeatureCount:观测的特征列表在整个数据集中的出现次数。
例如,对于 color=[“red”, “blue”, “blue”, “green”, “red”, “red”, “black”, “black”, “blue”, “green”] and target column with values, target=[1, 2, 3, 2, 3, 1, 4, 4, 2, 3]
。
这里,先验prior = 25/10=2.5。
对于“red”
, TargetCount = 1 + 3 + 1 = 5 \text{TargetCount}=1+3+1=5 TargetCount=1+3+1=5,而red在特征中出现了3次
FeatureCount = 3 \text{FeatureCount}=3 FeatureCount=3。所以最后类别为:(5+2.5)/(3+1)=1.875.
我们可以安装 pip install category_encoders
来使用, 其还有非常多编码方式官方文档:
import category_encoders as ce
import pandas as pd
# Make dataset
train = pd.DataFrame({
'color': ["red", "blue", "blue", "green", "red",
"red", "black", "black", "blue", "green"],
'interests': ["sketching", "painting", "instruments",
"sketching", "painting", "video games",
"painting", "instruments", "sketching",
"sketching"],
'height': [68, 64, 87, 45, 54, 64, 67, 98, 90, 87],
'grade': [1, 2, 3, 2, 3, 1, 4, 4, 2, 3], })
# Define train and target
target = train[['grade']]
train = train.drop('grade', axis = 1)
# Define catboost encoder
cbe_encoder = ce.cat_boost.CatBoostEncoder()
# Fit encoder and transform the features
cbe_encoder.fit(train, target)
train_cbe = cbe_encoder.transform(train)
print(train_cbe)
# We can use fit_transform() instead of fit()
# and transform() separately as follows:
# train_cbe = cbe_encoder.fit_transform(train,target
==========================================
color interests height
0 1.875 2.100000 68
1 2.375 2.875000 64
2 2.375 3.166667 87
3 2.500 2.100000 45
4 1.875 2.875000 54
5 1.875 2.500000 64
6 3.500 2.875000 67
7 3.500 3.166667 98
8 2.375 2.100000 90
9 2.500 2.100000 87