7 ways to deal with categorical variables by tree model (Python code)

In the data of data mining projects, data types can be divided into two types: ordered continuous numerical values ​​and unordered categorical features.

For boosting tree models such as xgboost and GBDT, the basic learning is usually a cart regression tree, and the input of the cart tree usually only supports continuous numerical types. Continuous variables such as age and income can be handled well by Cart, but for disordered variables The categorical variables (such as occupation, region, etc.), the cart tree processing is more troublesome. If it is directly and violently enumerates every possible combination of categorical features, it is easy to find the category feature division point. .

In this article, I will list the common methods of tree model for categorical feature processing. You are welcome to discuss it in depth. If you like, remember to like, follow, and favorite.

[Note] The full version of the code, data, and technical exchange can be obtained at the end of the article

1. One-hot encoding processing

pictureWe can directly perform Onehot processing on categorical features (this is also the most common practice), and the value of each category is represented by a single bit 0/1, that is, a "gender" category feature can be converted into whether it is "" "Male", "Female" or "Other" as follows:

display(df.loc[:,['Gender_Code']].head())

# onehot 
pd.get_dummies(df['Gender_Code']).head()

However, the major disadvantage of onehot is that for categorical features with many values, it may lead to high-dimensional sparse features and easily lead to overfitting of tree models. As mentioned earlier, in the face of high-dimensional and sparse onehot features, once the division conditions are met, the tree model is easy to deepen. may just be noise (i.e. overfitting).

Recommendations for use : Onehot is naturally suitable for neural network models, and neural networks can easily learn low-density dense representations from high-dimensional sparse features. When onehot is used in the tree model, the more important interactive features can still be learned when the number of categorical features is small, but when the value is large (such as greater than 100), it is easy to cause overfitting, which is not suitable. Using onehot+ tree model. picture(Note: In addition, onehot has disadvantages such as increased memory overhead and training time overhead)

Second, Ordinal Encoder

OrdinalEncoder is also called sequential encoding (same as label encoding, both functions are basically the same), features/labels are converted to ordinal integers (0 to n_categories - 1)

Usage suggestion : It is suitable for ordinal feature, that is, although it is a categorical feature, it has an inherent order. For example, the clothes size "S", "M", "L" and other features are suitable for integer encoding from small to large.picture

from sklearn.preprocessing import  LabelEncoder
encoder = LabelEncoder()
df[col] = encoder.transform(df[col])

3. target encoding

target encoding, also known as mean encoding, is encoded with the help of the label information corresponding to each category feature (for example, the two-category simply uses the average value of the label value "0/1" corresponding to the sample with each value of the category feature), which is a kind of The commonly used supervised coding method (in addition to the classic WoE coding) is very suitable for weak models such as logistic regression.

Recommendations for use : When the tree model uses target encoding, it is necessary to add some regularization techniques to reduce the phenomenon of conditional bias caused by the Target encoding method (conditional bias will occur when the data structure and distribution of the training data set and the test data set are different. The mainstream method is to use Catboost coding or use cross-validation to find target mean or bayesian mean.

picture

# 如下简单的target  mean代码。也可以用:from category_encoders import TargetEncoder 

target_encode_columns = ['Gender_Code']
target = ['y']

target_encode_df = score_df[target_encode_columns + target].reset_index().drop(columns = 'index', axis = 1)
target_name = target[0]
target_df = pd.DataFrame()
for embed_col in target_encode_columns:
    val_map = target_encode_df.groupby(embed_col)[target].mean().to_dict()[target_name]
    target_df[embed_col] = target_encode_df[embed_col].map(val_map).values
    
score_target_drop = score_df.drop(target_encode_columns, axis = 1).reset_index().drop(columns = 'index', axis = 1)
score_target = pd.concat([score_target_drop, target_df], axis = 1)

Fourth, CatBoostEncoder

CatBoostEncoder is a method of CatBoost model processing categorical variables (Ordered TS encoding), which reduces conditional offset based on target encoding. Its calculation formula is:picture

  • TargetCount : the sum of the target value for the specified category features

  • prior: for the entire dataset, the sum of the target values ​​/ the number of all observed variables

  • FeatureCount: The number of occurrences of the observed feature list in the entire dataset.

CBE_encoder = CatBoostEncoder()
train_cbe = CBE_encoder.fit_transform(train[feature_list], target)
test_cbe = CBE_encoder.transform(test[feature_list])

5. CountEncoder

Also known as frequency coding, the value of each category feature is converted to its frequency of occurrence in the training set. Intuitively, the frequency of the category value will be used as the basis to divide the high-frequency category and the low-frequency category. As for the effect, it is still necessary to combine the business and the actual scene.

## 也可以直接 from category_encoders import  CountEncoder

bm = []
tmp_df=train_df
for k in catefeas:
    t = pd.DataFrame(tmp_df[k].value_counts(dropna=True,normalize=True)) # 频率
    t.columns=[k+'vcount']
    bm.append(t)
for k,j in zip(catefeas, range(len(catefeas))):# 联结编码
    df = df.merge(bm[j], left_on=k, right_index=True,how='left')

Six, neural network embedding

When there are a large number of values ​​in the category (onehot is high-dimensional), if onehot is used directly, it will be poor in terms of performance or effect. At this time, it is a good method to use neural network embedding to input the category variable onehot into the neural network to learn a low-dimensional Dense vectors, such as the classic unsupervised word vector representation learning word2vec or based on supervised neural network encoding.

Recommendations for use : It is especially suitable for a large number of categorical variables, high-dimensional sparseness after onehot, and then applied to the tree model after NN low-dimensional representation conversion.picture

# word2vec
from gensim.models import word2vec
# 加载数据
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
# 切分词汇
sentences= [s.encode('utf-8').split() for s in sentences]

# 构建模型
model = word2vec.Word2Vec(sentences,size=10)  # 词向量的维数为10

#  各单词学习的词向量
model['dogs']  
# array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

Seven, lgb category feature processing

In order to solve the insufficiency of one-hot encoding (one vs many) to deal with categorical features. lgb adopts the segmentation method of many vs many. In simple terms, it performs numerical encoding on the value of each category (similar to target encoding), and finds a better segmentation point according to the encoded value, realizing the classification of category feature sets. Better segmentation.

  • The specific algorithm principle:

1. The number of feature values ​​is less than or equal to 4 (parameter max_cat_to_onehot): direct onehot encoding, scan each bin container one by one, and find the best split point;

2. The number of feature values ​​is greater than 4: the default value of max bin is 256. When the number of values ​​is greater than the number of max bins, the values ​​with less frequency will be filtered out. Then count the sum of the first-order gradients and the sum of the second-order gradients of the samples corresponding to each eigenvalue, and take the sum of the first-order gradients/(the sum of the second-order gradients + the regularization coefficient) as the encoding of the feature value. After converting the categories into numerical codes, sort them from large to small, and traverse the histogram to find the optimal segmentation point

Briefly, Lightgbm utilizes gradient statistics to encode categorical features. My personal understanding is that in this way, category feature groups can be divided according to the difficulty of learning. For example, a certain feature has five categories of values: [wolf, dog, cat, pig, rabbit], and under the [wolf, dog] type The classification difficulty of the samples is quite high (the gradient under the value of this feature is large), in the category features after gradient coding, looking for a better dividing point may be [wolf, dog]|vs|[cat, pig, rabbit]

picture

Recommendations for use: usually use lgb category feature processing, the effect is better than one-hot encoding, and it is also convenient to use.

# lgb类别处理:简单转化为类别型特征直接输入Lgb模型训练即可。
 for ft in category_list:
        train_x[ft] = train_x[ft].astype('category')
clf = LGBMClassifier(**best_params)
clf.fit(train_x, train_y)

Experience summary

  • For categorical features with a small number of values ​​(<10), the corresponding number of samples under each value is also relatively large, which can be directly Onehot encoded.

  • For a large number of values ​​(10 to several hundred), onehot is not as efficient or effective as lightgbm gradient encoding or catboost target encoding, and it is also very convenient to use directly. (It should be noted that in personal practice, these two methods are still relatively easy to overfit in many categories of values. At this time, the category values ​​are first combined with experience or after trying to eliminate certain category features, the model effect would be better)

  • When there are hundreds or thousands of categories, you can first onehot and then (high-dimensional sparse), and use the neural network model to do low-dimensional dense representation.

The above is the main tree model encoding method for category features.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/124298560