Detailed explanation of Catboost principle

Table of contents

1. Main features:

1 Category variable encoding-Order Target Statistics method

2 Text variable encoding processing

3 category feature crossover-FM

4 Unbiased Boosting-Ordered Boosting

5 Use symmetric trees as the base model to speed up the operation

6 Different missing value processing

2. Detailed explanation of principles

1 Category variable processing-Order Target Statistics method

2 Text variable processing

3 Feature Cross-FM

4 Unbiased Boosting-Ordered Boosting

5 Symmetry trees

6 Missing value processing

Three official website examples

1 CatBoostRegressor:

2 CatBoostClassifier contains categorical variables

3 Using the Pools data type

4. Detailed explanation and actual combat of catboost parameters


        For categorical variables, xgb needs to be coded by itself before it can be input into the model; lgb greatly simplifies one step, and only needs to convert the corresponding variable column into category, or specify the categorical variable name to input the model; catboost further processes, It not only embeds the processing of categorical variables, but also comes with the categorical feature cross function, and also adds the processing of some text data. This article explains catboost in detail and in simple terms. The whole article is easy to understand and helps everyone master the principles.

Official website documentation: https://catboost.ai/en/docs/

1. Main features:

1 Category variable encoding-Order Target Statistics method

2 Text variable encoding processing

3 category feature crossover-FM

4 Unbiased Boosting-Ordered Boosting

5 Use symmetric trees as the base model to speed up the operation

6 Different missing value processing

2. Detailed explanation of principles

1 Category variable processing-Order Target Statistics method

        Target Statistics is a very commonly used monotonic coding method. Taking music genre characteristics (including rock, indie, pop, etc.) as an example, the coding result of rock is the average bad_rate corresponding to the rock sample (y label mean; y=1 Sample proportion, etc.), so that the higher the bad_rate for each feature value, the greater the encoding result.

         There is a small problem with Target Statistics coding. When the sample size of some feature values ​​of categorical features is very small, in extreme cases, assuming that the pop sample size is only 1 and the label is 1, the coding result of the pop feature value is 1, but when When we divide the training and test sets, the encoding result of the pop feature value of the training set is 1, but there are multiple pop samples in the test set, and the corresponding y label bad_rate is far less than 1, so there is a point-based overview problem; for For this problem, the bad_rate of the overall data set can be added to the above formula to make the encoding results flat and more stable.

The Order Target Statistics method is introduced more clearly and intuitively with examples.

(1) The fn feature is music genre, including rock, indie, pop and other values. Convert this column of categorical variables into numerical variables

(2) Randomly arrange the data set and generate multiple random sequence data sets (default is 4)

(3) Assume that the mean value (proportion of y=1) of the labels corresponding to rock, indie, and pop in all samples at this time is 0.05, and the categorical variables are encoded into numerical variables according to the following formula

        countInClass: In the data set of the current sequence, the frequency of samples in the same category and with the same y label among all samples before sample There are 2 items, and there is 1 item that is both rock and has the same y label, so here countInClass=1

        prior: prior value, which is equivalent to the label mean corresponding to the current feature value under the second classification (bad_rate in risk control), here it is 0.05

        totalCount: In the data set of the current sequence, the frequency of samples of the same category among all samples before sample totalCount=2

        To sum up, taking the 4th sample (Object=4) in the above figure as an example, avg_target=(1+0.05)/(2+1)=0.35

        As can be seen in the above example, under the same feature value and the same y label, the coding results before and after may be very different, such as the 1st and 7th samples. At this time, the samples ranked earlier have different coding results. Reasonable, because there are not many samples before this sample as the basis for coding.

        To address this problem, catboost randomly sorts the data multiple times (mentioned in step 2) and generates multiple random sequence data sets. In each sub-model, a sequence of data sets is randomly selected for encoding, so that Alleviate the problem of unreasonable coding of categorical variables in the previous samples

2 Text variable processing

(1) Word segmentation-use the original text as a string and segment it according to the text spaces (taking English space segmentation as an example)

(2) Generate dictionary:

Token type includes two types of letter-level Letter and word-level Word

        For the text "abra cadabra", Letter segmentation results {a, b, c, d, r}; Word segmentation results {'abra', 'cadabra'}

        For the entire column of text data, after word segmentation, all unique words form a dictionary, and the index is used to encode the words in the dictionary.

(3) String to digital encoding

For text data columns:

Word segmentation based on spaces

Summarize and generate dictionary

  

Numerical encoding

(4) Numerical characteristics

Numerical features rely on word segmentation results and support the following forms

a. Bag of words: a Boolean feature that reflects whether a certain word (word encoding) is included. The number of generated features is equal to the size of the dictionary.

b. top_tokens_count: Specify n words with word frequency top and generate corresponding Boolean features. The number of generated features is equal to n.

c. Naive Bayes: Polynomial Naive Bayes model, the number of generated features is equal to the number of categories of y labels; in order to avoid y label information leakage (crossing), this model will be calculated and generated online on multiple data sets (similar to (estimated at CTR)

d. BM25: A function used by search engines for ranking purposes to estimate the relevance of documents. To avoid target leakage, the model is calculated online on multiple dataset permutations (similar to CTR estimation)

3 Feature Cross-FM

        As the name suggests, the main contribution of Catboost is on categorical variables. It not only embeds encoding conversion that supports categorical variables, but also supports feature crossover of categorical variables and generates high-order crossover features (supports any order crossover, which can be restricted by parameters) highest order)

        When the catboost subtree is split, the splitting feature of the root node will be selected from the original numerical variables and the categorically encoded variables. In the subsequent splitting process, the splitting feature will be selected from all variables including cross features.

Example: second-order intersection features of musical style and musical genre

4 Unbiased Boosting-Ordered Boosting

        Introduction, the idea of ​​blending model fusion is roughly like this. The first layer first uses the train1 data set to build n models, and then uses these models to predict and score another batch of samples train2 to obtain n groups of model features; the second layer uses The train2 data set (the feature set contains n groups of model sub-features) trains the model. The purpose of changing the data set in the second-layer model training is to avoid over-learning of the same training set data by the two-layer model, resulting in overfitting.

        Under serial training, when xgb and lgb train the subsequent trees, they need to calculate the first-order gradient and second-order gradient of the model on all samples for training the subsequent trees; the problem is that the subtrees are trained for the same batch of samples The results are used again for the next step of residual training. There is a risk of over-learning the same training set data and causing over-fitting.

        Each subtree trained by catboost uses one of the randomly sorted data sets. For a single sample, only the sample with the serial number in front of it is used to train the subtree, and then the model is used to calculate the first-order gradient sum on the sample. Second-order gradient, builds the tree behind. Under this idea, the gradient estimation error can be reduced. Corresponding parameters: boosting_type, value Ordered (ordered gradient boosting), Plain (classic gradient boosting)

5 Symmetry trees

        catboost uses symmetric decision trees as subtrees, symmetrical about the y-axis. The nodes on the left and right sides (split features, split nodes) of each layer generated by splitting are the same. On the one hand, it can avoid overfitting to a certain extent, and on the other hand, it can accelerate predict

6 Missing value processing

        During xgb training, missing values ​​will be placed in the direction of the left sub-node and right sub-node to calculate the information gain and retain the larger gain (for details, please refer to the historical article); unlike xgb, catboost has three modes for processing missing values. :

(1) Forbidden: Disable missing values. When a data set containing missing values ​​is used to train the catboost model, an error will be reported.

(2) Min: Missing values ​​are treated as the minimum value of the column, which allows the tree model to separate missing values ​​from other values.

(3) Max: Missing values ​​are treated as the maximum value of the column, which allows the tree model to separate missing values ​​from other values.

Three official website examples

1 CatBoostRegressor:

from catboost import CatBoostRegressor
# Initialize data

train_data = [[1, 4, 5, 6],
              [4, 5, 6, 7],
              [30, 40, 50, 60]]

eval_data = [[2, 4, 6, 8],
             [1, 4, 50, 60]]

train_labels = [10, 20, 30]
# Initialize CatBoostRegressor
model = CatBoostRegressor(iterations=2,
                          learning_rate=1,
                          depth=2)
# Fit model
model.fit(train_data, train_labels)
# Get predictions
preds = model.predict(eval_data)

2 CatBoostClassifier contains categorical variables

from catboost import CatBoostClassifier
# Initialize data
cat_features = [0, 1]
train_data = [["a", "b", 1, 4, 5, 6],
              ["a", "b", 4, 5, 6, 7],
              ["c", "d", 30, 40, 50, 60]]
train_labels = [1, 1, -1]
eval_data = [["a", "b", 2, 4, 6, 8],
             ["a", "d", 1, 4, 50, 60]]

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2,
                           learning_rate=1,
                           depth=2)
# Fit model
model.fit(train_data, train_labels, cat_features)
# Get predicted classes
preds_class = model.predict(eval_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(eval_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(eval_data, prediction_type='RawFormulaVal')

3 Using the Pools data type

from catboost import CatBoostClassifier, Pool

train_data = Pool(
    [
        [[0.1, 0.12, 0.33], [1.0, 0.7], 2, "male"],
        [[0.0, 0.8, 0.2], [1.1, 0.2], 1, "female"],
        [[0.2, 0.31, 0.1], [0.3, 0.11], 2, "female"],
        [[0.01, 0.2, 0.9], [0.62, 0.12], 1, "male"]
    ],
    label = [1, 0, 0, 1],
    cat_features=[3],
    embedding_features=[0, 1]    
)

eval_data = Pool(
    [
        [[0.2, 0.1, 0.3], [1.2, 0.3], 1, "female"],
        [[0.33, 0.22, 0.4], [0.98, 0.5], 2, "female"],
        [[0.78, 0.29, 0.67], [0.76, 0.34], 2, "male"],
    ],
    label = [0, 1, 1],
    cat_features=[3],
    embedding_features=[0, 1]    
)

model = CatBoostClassifier(iterations=10)

model.fit(train_data, eval_set=eval_data)
preds_class = model.predict(eval_data)

4. Detailed explanation and actual combat of catboost parameters

Detailed explanation and actual combat of catboost parameters (strongly recommended)

 

For more theoretical knowledge and code sharing, welcome to follow the public account: Python risk control model and data analysis

Guess you like

Origin blog.csdn.net/a7303349/article/details/125980851