Feature engineering operation process

Feature engineering

  • Data collection/collection

  • Choose an algorithm based on the problem

Logistic regression is often used for classification problems. The model is simple, easy to train, and highly interpretable.

The logistic regression performance is not as good as the tree model, and the interpretability is not as good as the decision-making.

Do feature crossover, feature combination, you need to do it manually.

If you use a deep model or a tree model to do it, you can also achieve automatic feature crossover. The tree model can only do two-dimensional, and the high-dimensional one is not very effective.

If accuracy is emphasized, GBDT ,   xgboost

Interpretation is particularly strong with decision trees

When doing linear regression or regression tree, there can be no null values, and an error will be reported for null values.

  • Process the data according to the characteristics of the algorithm

    • Null value is unusually repeated

If the abnormality is greater than 99%, if it is less than 1, use 99 to fill in

  • Further processing according to the type of feature

    • Numerical/Continuous

      • Normalization/standardization

      • Binning/Bagging/Discretization

    • classification

      • one-hot encoding

      • serial number

  • Feature derivation

    • If you have a lot of data aggregated by id, you can calculate the statistics of these data with the same id (average, variance, range...)

    • Feature cross

    • Derive new features based on business understanding

  • Feature selection

    • filter

    • Recursion

    • Embed

  • Is the sample of classification problems balanced?

  • Modeling tuning

  • Model fusion problem

    • RMSE MSE MAE

Guess you like

Origin blog.csdn.net/weixin_48135624/article/details/114947730