[Treading the Pit] Analysis and solution to the same problem of xgboost prediction probability

Keywords : model prediction probability results are always the same, xgboost predict_proba

1. Problem description

Colleagues use xgboot to build a classification model, and use pandas.DataFrame to store data features. In the inference stage, different sample data are passed in, and the predicted proba results are always the same. The brief code description is as follows:

# 加载模型并使用
model = xgb.XGBClassifier()
model.load_model("models/xgb.model")

# datas为DataFrame格式,待预测数据
res_01 = model.predict_proba(datas.iloc[[0]])
print(res_01)
res_02 = model.predict_proba(datas.iloc[[1]])
print(res_02)

# 输出结果完全一样
[[0.30088782 0.6991122 ]]
[[0.30088782 0.6991122 ]]

2. Analysis process

  1. Verify that the input data is the same. After checking, the sum is datas.iloc[[0]]not datas.iloc[[1]]the same, and further datas.iloc[[0]]think that some fields in the pair are manually assigned a value of 0. As a result input_01,, the result of proba is still the same;datas.iloc[[0]]input_02
  2. Whether the environment is abnormal, after re-conda new environment, the problem still exists;
  3. Whether the training and the predicted feature order are inconsistent, the verification process is as follows:
# 预测测试集并计算准确率
old_row = copy.deepcopy(X_test.iloc[[0]])
# 打乱特征列,模拟预测特征与训练特征顺序不一致的情况!
new_row = X_test.iloc[[0]].sample(frac=1, axis=1)
y_pred = loaded_model.predict_proba(new_row)
print(y_pred)
# 手动修改输入的特征值,预期proba结果发生变化
new_row.iat[0,0] = 0
new_row.iat[0,1] = 0
y_pred = loaded_model.predict_proba(new_row)
print(y_pred)    ## 实际结果没有发生变化!!

# old_row为与训练特征保持一致的情况
y_pred = loaded_model.predict_proba(old_row)
print(y_pred)

# 手动修改输入的特征值,预期proba结果发生变化
old_row.iat[0,0] = 0
old_row.iat[0,1] = 0
y_pred = loaded_model.predict_proba(old_row)
print(y_pred)  ## 实际结果发生了变化!!

After the above code adjusts the input feature order and modifies the feature value, the result of proba remains unchanged, and the problem reappears!

3. Solutions

  • Maintain the consistency of predicted data features and training data features, and reuse the same set of feature calculation codes

  • Or save features in dictionary format, sort by dictionary key and output feature order

  • The tree model may have different input characteristics but the same output prediction probability. By adjusting the model parameters, the fitting ability of the model can be increased, so that different input characteristics can obtain different prediction probabilities. The main adjustment parameters in xbg are:

    • max_depth: the maximum depth of the tree, a positive integer, the default value is 6;
    • learning_rate: Learning rate, usually set to a decimal between 0.01 and 0.2, the default value is 0.3;
    • n_estimators: the number of weak classifiers used, positive integer, the default value is 100;
    • subsample: Randomly select the proportion of training data, which can be set to a decimal between 0 and 1, usually set to a value between 0.5 and 1, and the default is 1;
    • colsample_bytree: The ratio of randomly selected features, which can be set to a decimal between 0 and 1, usually set to a value between 0.5 and 1, and the default is 1;

    Example:

    xgb_model = xgb.XGBClassifier(
    max_depth=6, 
    learning_rate=0.07, 
    n_estimators=100, 
    subsample=0.8, 
    colsample_bytree=0.8)
    

---------------- END ----------------

Guess you like

Origin blog.csdn.net/iling5/article/details/130421902