Eight ways to improve the accuracy of machine learning models

1. Add more data

It's always a good idea to hold more data. More data allows data to "express itself" than to rely on assumptions and weak correlations. The more data, the better the model and the higher the accuracy.

I understand that sometimes more data is not available. For example, in a data science competition, the amount of data in the training set cannot be increased. But for enterprise projects, I'd recommend asking for more data if possible. This reduces the pain due to limited dataset size.

2. Handling Missing Values ​​and Outliers

The unexpected appearance of missing values ​​and outliers in the training set often leads to low or biased model accuracy. This can lead to wrong predictions. This is due to our failure to properly analyze the target behavior and its relationship to other variables. So it is very important to deal with missing values ​​and outliers.

Take a closer look at the screenshot below. In the presence of missing values, males and females have the same probability of playing cricket. But if you look at the second table (after missing values ​​have been filled according to the title "Miss"), women are more likely to play cricket than men.

Left: before missing value processing; right: after missing value processing

From the above example, we can see the negative impact of missing values ​​on model accuracy. Fortunately, we have various ways to deal with missing and outliers:

Missing values : For continuous variables, you can replace missing values ​​with mean, median, and mode. For categorical variables, you can treat the variable as a special category. You can also build models to predict missing values. KNN provides a great way to deal with missing values. To learn more about this, I recommend reading Methods to deal and treat missing values.

Outliers : You can remove these entries, transform, bin. As with missing values, you can also treat outliers differently. To learn more about this, I recommend reading How to detect Outliers in your dataset and treat them?.

3. Feature Engineering

This step helps to extract more information from the existing data. New information is extracted as new features. These features may better explain the differential variation in the training set. Therefore, the accuracy of the model can be improved.

Hypothesis generation has a big impact on feature engineering. Good hypotheses lead to better feature sets. This is why I always recommend spending time on hypothesis generation. Feature engineering can be divided into two steps:

Feature transformation: Many scenarios require feature transformation:

A) Change the range of the variable from the original range to 0 to 1. This is often called data normalization. For example, in a dataset where the first variable is in meters, the second variable in centimeters, and the third in kilometers, the data must be normalized to the same range before any algorithm can be used.

B) Some algorithms perform better on normally distributed data. So we need to remove the bias of the variable. Logarithmic, square root, reciprocal, etc. methods can be used to correct for skew.

C) Sometimes numerical data performs better after binning, because it also handles outliers. Numeric data can be made discrete by grouping the values ​​into bins. This is also known as data discretization.

Creating new features: Deriving new variables from existing variables is called feature creation. This helps unleash hidden relationships in the dataset. For example, we want to predict the transaction volume of a store by its transaction date. The date may not have much to do with trading volume in this matter, but if you look into the day of the week, there may be a higher correlation. In this example, the information that a certain date is the day of the week is latent. We can extract this information as new features and optimize the model.

4. Feature selection

Feature selection is the process of finding which subset of attributes can best explain the relationship between the target variable and each independent variable.

You can select useful features based on a variety of criteria, such as:

Domain knowledge: Based on experience in the domain, variables that have a greater impact on the target variable can be selected.

Visualization: As the name suggests, visualization allows relationships between variables to be seen, making the process of feature selection easier.

Statistical parameters: We can consider p-values, information values ​​and other statistical parameters to choose the right parameters.

PCA: This method helps to represent training set data in a low-dimensional space. This is a dimensionality reduction technique. There are many ways to reduce the dimension of a dataset: factor analysis, low variance, high correlation, forward-backward variable selection, and others.

5. Use multiple algorithms

Using the right machine learning algorithm is the ideal way to achieve higher accuracy. But that's easier said than done.

This intuition comes from experience and constant experimentation. Some algorithms are better suited for certain types of data than others. Therefore, we should use all relevant models and check their performance.

Source: Scikit-Learn Algorithm Selection Diagram

6. Algorithm adjustment

We all know that machine learning algorithms are driven by parameters. These parameters have a significant impact on the results of learning. The purpose of parameter tuning is to find the optimal value for each parameter to improve the accuracy of the model. To adjust these parameters, you must have some understanding of their meaning and their respective effects. You can repeat this process on some well-behaved models.

For example, in random forest we have max_features, number_trees, random_state, oob_score and other parameters. Optimizing these parameter values ​​leads to better and more accurate models.

To learn more about the impact of tuning parameters, see Tuning the parameters of your Random Forest model. The following is a list of all parameters of the random forest algorithm in scikit learn:

RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None,min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None,bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False,class_weight=None)

7. Integration Model

The most common approach in data science competition winning proposals. The technique is to combine the results of multiple weak models to obtain better results. It can be achieved in many ways, such as:

  • Bagging (Bootstrap Aggregating)
  • Boosting

To learn more about this, check out Introduction to ensemble learning.

Using ensemble methods to improve model accuracy is always a good idea. There are two main reasons:

  1. Integrated methods are usually more complex than traditional methods;
  2. Traditional methods provide a good foundation upon which to build integrated methods.

Notice!

So far, we have seen ways to improve the accuracy of the model. However, models with high accuracy do not necessarily perform better (on unknown data). Sometimes, improvements in model accuracy are due to overfitting.

8. Cross Validation

If we want to solve this problem, we have to use cross validation technique. Cross-validation is one of the most important concepts in the field of data modeling. It means that some data samples are reserved not to train the model, but to validate before completing the model.

This approach helps to derive more generalized relationships. To learn more about cross-validation, it is recommended to read "Improve model performance using cross validation".

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325631414&siteId=291194637