Collection | 3 major strategies for machine learning model tuning

Click on " Study Algorithm " above, and choose to add " star " or " top "

Heavy dry goods, delivered as soon as possible13577976e7ffaed8619596d7c990935e.png

It is only for academic sharing, and does not represent the position of this official account. Infringement contact will be deleted.

Reprinted in: Heart of the Machine

Whether for Kaggle competitions or industrial deployments, machine learning models face endless tuning needs after they are built. What kind of thinking should we follow in this process?

If the accuracy is not enough, the machine learning model has little practicality in the real world. For developers, how to improve performance is very important work, this article will introduce some common strategies, including choosing the best algorithm, adjusting model settings and feature engineering.

If you follow the right tutorial, you can train your first machine learning model in no time. However, it is extremely difficult to get good results on the first model. After the model is trained, we need to spend a lot of time tuning to improve performance. Different types of models have different tuning strategies, in this article, we will introduce common strategies for model tuning.

Is the model good?

Before model tuning, we first need to know whether the current model performance is good or bad. If you don't know how to measure the performance of the model, you can refer to:

  • https://www.mage.ai/blog/definitive-guide-to-accuracy-precision-recall-for-product-developers

  • https://www.mage.ai/blog/product-developers-guide-to-ml-regression-model-metrics

Each model has baseline metrics. We can use the "pattern category" as a baseline metric for a classification model. If your model is better than the baseline, congratulations, it's a good start. If the model capability has not reached the benchmark level, it means that your model has not yet gained valuable insights from the data. There is still a lot to do to improve performance.

Of course, there is also a situation where the performance of the model is "too good", such as 99% accuracy and 99% recall. This is not a good thing and may indicate some problem with your model. One possible cause is a "data leak", and we'll discuss how to fix this in the "Eliminate Data Leak Features" section.

Strategies for improving the model

Generally speaking, there are three directions for model tuning: choosing better algorithms, tuning model parameters, and improving data.

Compare Different Algorithms

Comparing multiple algorithms is a simple idea to improve model performance, different algorithms fit different types of datasets, we can train them together and find the one that performs best. For example, for classification models, we can try logistic regression, support vector machines, XGBoost, neural networks, etc.

5a844192c8ef76f0eb38673f1418e411.png

Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Hyperparameter Tuning

Hyperparameter tuning is a commonly used model tuning method. In a machine learning model, some parameters that need to be selected before the learning process starts are called hyperparameters. Such as the maximum depth allowed for decision trees, and the number of trees included in a random forest. Hyperparameters significantly affect the outcome of the learning process. Tuning hyperparameters allows us to achieve optimal results quickly during the learning process.

51a05901906aa1e1a646f09e29faa1ac.png

We highly recommend using publicly available libraries to help with hyperparameter tuning, such as optuna.

trade recall for precision

For classification models, we usually use 2 metrics to measure the performance of the model: precision and recall. Depending on the problem, you may need to optimize for either recall or precision. There is a quick way to tune the model to trade off between the two metrics. Classification models predict the probability of label classes, so we can simply modify the probability thresholds to modify recall and precision.

For example, if we build a model to predict whether a passenger will survive the sinking of the Titanic, the model can predict the probability that the passenger will survive or die. If the probability is higher than 50%, the model will predict that the passenger will survive, otherwise the passenger will die. If we want higher accuracy, we can increase the probability threshold. The model will then predict that fewer passengers will survive, but will be more accurate.

eac4cc673103ae6ca52369367bb6153b.png

feature engineering

In addition to choosing the best algorithm and tuning parameters, we can also generate more features from existing data, which is called feature engineering.

Create new features

Building new features requires a certain amount of domain knowledge and creativity. Here is an example of building a new feature:

  • Create a function to count the number of letters in text.

  • Create a function to count the number of words in a text.

  • Create a feature (e.g. word embedding) that understands the meaning of the text.

  • Aggregated user event counts for the past 7, 30, or 90 days.

  • Extract features like "day", "month", "year", and "days after holiday" from date or timestamp features.

Use public datasets to augment training data

When you run out of ideas for generating new features from existing datasets, another idea is to get features from public datasets. Suppose you are building a model to predict whether a user will convert to a member, and there is not much user information in the available dataset, only the "email" and "company" attributes. Then you can obtain user and company data from third parties, such as user address, user age, company size, etc., which can be used to enrich your training data.

cf5a29a8eacaab64175aad37a02b63af.gif

Feature selection

Adding more features is not always good. Removing irrelevant and noisy features helps reduce model training time and improve model performance. There are various feature selection methods in scikit-learn that can be used to remove irrelevant features.

Remove data leakage features

As mentioned above, one scenario is that the model performs "very well". But when deploying the models and using those models in production, the performance becomes poor. The cause of this problem may be "data leakage", a common pitfall in model training. Data leakage refers to the use of some features that occur after the target variable and contain information about the target variable. Real-life predictions, however, would not have those data breach characteristics.

For example, to predict whether a user will open an email, features might include whether the user clicks on the email. Once the model sees that the user has clicked on it, it predicts that the user will open it 100% of the time. In real life, however, we have no way of knowing if someone didn't click on an email before opening it.

We can use SHAP values ​​to debug data leakage issues, and plotting a graph with the SHAP library shows the features that have the most impact and how they directionally affect the output of the model. If the features are highly correlated with the target variable and have very high weights, they may be data leaking features and we can remove them from the training data.

34fd6268897dfccf25e464abc24e7ef1.png

more data

Getting more training data is an obvious and effective way to improve model performance. More training data allows the model to find more insights and achieve higher accuracy.

a5f812003485beb8d940c9a44053f953.gif

So, when is it time to stop tuning?

You need to know how to start and when to stop, and a lot of the time what is enough is a difficult question to answer. The improvement of the model seems to be infinite, with no end point: there will always be new ideas that bring new data, create new features, or new tweaks to algorithms. First, the minimum criterion is that the model performance should be at least better than the baseline metrics. Once the minimum criteria are met, we should use the following process to improve the model and decide when to stop:

  • Try all strategies to improve your model.

  • Compare the model performance to some other metrics you have to validate to see if the model makes sense.

  • After a few rounds of model tuning, evaluate the price/performance between continued modifications and a percent improvement in performance.

  • If the model is performing well and there is little improvement after trying out a few ideas, deploy the model into production and measure the actual performance.

  • If the performance under real conditions is similar to that in the test environment, then your model is ready to use. If the production performance is worse than the training performance, there is some problem in the training, which may be due to overfitting or data leakage. This means that the model also needs to be retuned.

in conclusion

Model tuning is a long and complex process, including model retraining, experimentation with new ideas, performance evaluation, and metrics comparison. With the ideas presented in this article, I hope you can take your machine learning skills to a higher level.

Original link:

https://m.mage.ai/how-to-improve-the-performance-of-a-machine-learning-ml-model-409b05b2a5f

0775e5e5445fb4531d7b96e31c5157cb.png

outside_default.png

Click to see the paper constantly!

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326855903&siteId=291194637