Approaching (Almost) Any Machine Learning Problem Chinese translation

foreword

  • Abhishek Thakur, many kagglers are very familiar with him. In 2017, he published an article on Linkedin called Approaching (Almost) Any Machine Learning Problem , introducing an automatic machine learning framework he established that can solve almost any machine learning problem. Learning problems, this article has been popular on Kaggle.
  • Abhishek's achievements on Kaggle:
    • Competitions Grandmaster (17 gold medals, No. 3 in the world)
    • Kernels Expert (Top 1% of Kagglers)
    • Discussion Grandmaster (65 gold medals, No. 2 in the world)
  • Abhishek currently holds the position of Chief Data Scientist at boost Norway, a software company specializing in conversational artificial intelligence.
  • This article conducts an approach to the Approaching (Almost) Any Machine Learning ProblemChinese translation, due to my limited level and not using machine translation, some of the language may not be fluent or the degree of localization is insufficient. Please provide more valuable opinions during the reading process.
  • Because there are a few chapters that are too basic, they have not been translated. For details, please refer to the book catalog:
    • Prepare the environment (untranslated)
    • Unsupervised and supervised learning (untranslated)
    • cross-validation (translated)
    • Evaluation Metrics (Translated)
    • Organizational Machine Learning (Translated)
    • Handling Categorical Variables (Translated)
    • Feature Engineering (Translated)
    • Feature Selection (Translated)
    • Hyperparameter optimization (translated)
    • Image classification and segmentation methods (untranslated)
    • Text classification or regression methods (not translated)
    • Combining and stacking methods (untranslated)
    • Repeatable code and model methods (untranslated)
  • I will put the translated markdownfile on Github for free download. If you find any mistakes during the reading process, you are also welcome to submit an issue or PR to help me modify it. AAAML-CN .
  • If you need it in the future, you maycontinue to translate untranslatedIf it is helpful to everyone, please help to click a star, or pay attention.
  • Let me show youcross-checkTranslation of chapters

cross-check

In the last chapter, we did not build any models. The reason is simple, before creating any kind of machine learning model, we must know what is cross-validation and how to choose the best cross-validation dataset according to the dataset.

So, what is cross-validation , and why should we care about it?

Regarding what is cross-validation, we can find various definitions. My definition is just one sentence: cross-validation is a step in the process of building a machine learning model that helps us ensure that the model fits the data accurately while ensuring that we are not overfitting. But that brings up another word: overfitting .

To explain overfitting, I think it's best to look at a dataset first. There is a fairly famous red wine quality dataset ( red wine quality dataset ). This dataset has 11 different features that determine the quality of red wine.

These properties include:

  • fixed acidity
  • Volatile acidity
  • citric acid
  • residual sugar
  • Chlorides
  • Free sulfur dioxide
  • Total sulfur dioxide (total sulfur dioxide)
  • Density
  • pH value (pH)
  • sulfates
  • alcohol

Based on these different characteristics, we need to predict the quality of red wine, with a quality value between 0 and 10.

Let's see what the data look like.

import pandas as pd
df = pd.read_csv("winequality-red.csv")

Please add a picture description

Figure 1: Simple display of the red wine quality dataset

We can think of this problem as a classification problem or as a regression problem. For simplicity, we choose classification. However, this dataset contains 6 quality values. Therefore, we map all quality values ​​between 0 and 5.

quality_mapping = {
    
    
    3: 0,
    4: 1,
    5: 2,
    6: 3,
    7: 4,
    8: 5
}
df.loc[:, "quality"] = df.quality.map(quality_mapping)

When we look at this big data and think of it as a classification problem, there are many algorithms that come to our mind that can be applied, maybe, we can use neural networks. But it's a bit of a stretch if we dive into neural networks from the very beginning. So, let's start with something simple that we can also visualize: decision trees.

Before we start to understand what overfitting is, let's divide the data into two parts. This dataset has 1599 samples. We keep 1000 samples for training and 599 samples as a separate set.

The division can be easily done with the following code:

df = df.sample(frac=1).reset_index(drop=True)

df_train = df.head(1000)
df_test = df.tail(599)

Now, we will train a decision tree model using scikit-learn on the training set.

from sklearn import tree 
from sklearn import metrics

clf = tree.DecisionTreeClassifier(max_depth=3) 

cols = ['fixed acidity',
        'volatile acidity',
        'citric acid',
        'residual sugar',
        'chlorides',
        'free sulfur dioxide',
        'total sulfur dioxide',
        'density',
        'pH',
        'sulphates',
        'alcohol']

clf.fit(df_train[cols], df_train.quality)

Note that I set the maximum depth (max_depth) of the decision tree classifier to 3. All other parameters of the model were kept at default values. Now, we test the accuracy of the model on the training and test sets:

train_predictions = clf.predict(df_train[cols])

test_predictions = clf.predict(df_test[cols])

train_accuracy = metrics.accuracy_score(
    df_train.quality, train_predictions
)

test_accuracy = metrics.accuracy_score(
    df_test.quality, test_predictions
)

The training and testing accuracies are 58.9% and 54.25%, respectively. Now, we increase the maximum depth (max_depth) to 7 and repeat the above process. In this way, the training accuracy rate is 76.6%, and the test accuracy rate is 57.3%. Here, we use accuracy mainly because it is the most straightforward metric. It's probably not the best indicator for this question. We can calculate and plot these accuracy rates for different values ​​of the maximum depth (max_depth).

from sklearn import tree
from sklearn import metrics 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rc('xtick', labelsize=20)
matplotlib.rc('ytick', labelsize=20)
%matplotlib inline
train_accuracies = [0.5]
test_accuracies = [0.5]
for depth in range(1, 25):
    clf = tree.DecisionTreeClassifier(max_depth=depth)
    cols = [
        'fixed acidity',
        'volatile acidity',
        'citric acid',
        'residual sugar',
        'chlorides',
        'free sulfur dioxide',
        'total sulfur dioxide',
        'density',
        'pH',
        'sulphates',
        'alcohol'
    ]
    clf.fit(df_train[cols], df_train.quality)
    train_predictions = clf.predict(df_train[cols]) 
    test_predictions = clf.predict(df_test[cols])
    
    train_accuracy = metrics.accuracy_score(
        df_train.quality, train_predictions
    )
    test_accuracy = metrics.accuracy_score(
        df_test.quality, test_predictions
    )
    train_accuracies.append(train_accuracy)
    test_accuracies.append(test_accuracy)
    
plt.figure(figsize=(10, 5)) 
sns.set_style("whitegrid")
plt.plot(train_accuracies, label="train accuracy")
plt.plot(test_accuracies, label="test accuracy")
plt.legend(loc="upper left", prop={
    
    'size': 15})
plt.xticks(range(0, 26, 5))
plt.xlabel("max_depth", size=20)
plt.ylabel("accuracy", size=20)
plt.show()

This will generate the graph shown in Figure 2.
Please add a picture description

Figure 2: Training and testing accuracy for different max_depth.

We can see that when the maximum depth (max_depth) has a value of 14, the test data has the highest score. As we keep increasing the value of this parameter, the test accuracy stays the same or gets worse, but the training accuracy keeps improving. This shows that as the maximum depth (max_depth) increases, the decision tree model learns better and better on the training data, but the performance on the test data does not improve at all.

This is called overfitting .

The model fits perfectly on the training set, but performs poorly on the test set. This means that the model can learn the training data well, but cannot generalize to unseen examples. In the above dataset, we can build a model with a very high max_depth which will give excellent results on the training data, but such a model is not practical because it is not useful on real world samples or real-time data will not provide similar results.

One could argue that this method is not overfitting since the accuracy on the test set remains essentially the same. Another definition of overfitting is when we keep increasing the training loss, the test loss is also increasing. This situation is very common in neural networks.

Whenever we train a neural network, it is necessary to monitor the loss on the training and test sets during training. If we have a very large network working on a very small dataset (i.e. very few samples), we will observe that as we continue to train, the loss on both the training and test sets decreases. However, at some point, the test loss reaches a minimum, after which the test loss starts increasing even if the training loss decreases further. We have to stop training when the validation loss reaches its minimum.

This is the most common explanation for overfitting .

In simple terms, Occam's razor is not trying to complicate things that can be solved in a much simpler way. In other words, the simplest solution is the most general solution. Generally speaking, as long as your model does not conform to the principle of Occam's razor, it is likely to be overfitting.

Please add a picture description

Figure 3: The most general definition of overfitting

Now we can go back to cross-validation.

When explaining overfitting, I decided to split the data into two parts. I train the model on one part and check its performance on the other part. This is also a type of cross-validation and is often called a "hold -out set " . We use this kind of (cross-)validation when we have a lot of data and model inference is a time-consuming process.

There are many different approaches to cross-validation, and it is the most critical step in building a good machine learning model. Choosing the right cross-test depends on the data set being dealt with, and what works for one data set may not work for another. However, there are several types of cross-validation techniques that are the most popular and widely used.

These include:

  • k-fold cross-validation
  • Stratified k-fold cross-validation
  • stay cross-check
  • leave-one-out cross-validation
  • Grouped k-fold cross-validation

Cross-validation is to stratify the training data into several parts, we train the model on one part, and then test it on the rest. Please see Figure 4.
Please add a picture description

Figure 4: Splitting the dataset into training and validation sets

Figures 4 and 5 illustrate that when you get a dataset to build a machine learning model on, you split them into two distinct sets: a training set and a validation set . Many people also divide it into a third group called the test set. However, we will only use two sets. As you can see, we split the samples and the targets associated with them. We can divide the data into k distinct sets that are not related to each other. This is known as k-fold cross-validation .
Please add a picture description

Figure 5: K-fold cross-validation

We can split any data into k equal parts using KFold in scikit-learn. Each sample is assigned a value from 0 to k-1.

import pandas as pd
from sklearn import model_selection

if __name__ == "__main__":
    df = pd.read_csv("train.csv")
    df["kfold"] = -1
    df = df.sample(frac=1).reset_index(drop=True)
    kf = model_selection.KFold(n_splits=5)
    for fold, (trn_, val_) in enumerate(kf.split(X=df)): 
        df.loc[val_, 'kfold'] = fold
        df.to_csv("train_folds.csv", index=False)

This process can be used for almost any type of dataset. For example, when data images, you can create a CSV containing the image ID, image location, and image label, and then use the process described above.

Another important type of cross-validation is the stratified k-fold cross-validation . If you have a skewed binary classification dataset with 90% positive samples and only 10% negative samples, then you should not use random k-fold crossover. Using a simple k-fold cross-validation on such a dataset may result in folded samples that are all negative. In this case, we prefer to use stratified k-fold cross-validation. Stratified k-fold cross-validation keeps the proportion of each tradeoff label constant. Therefore, in each fold, there will be the same 90% positive samples and 10% negative samples. So no matter what metric you choose to evaluate, you will get similar results across all folds.

It is also easy to modify the code that creates a k-fold cross-test to create a stratified k-fold cross-test. We simply change model_selection.KFold to model_selection.StratifiedKFold and specify the target column to stratify in the kf.split(…) function. Let's assume the CSV dataset has a column named "target" and it is a classification problem.

import pandas as pd
from sklearn import model_selection 
if __name__ == "__main__":
    df = pd.read_csv("train.csv")
    df["kfold"] = -1
    df = df.sample(frac=1).reset_index(drop=True)
    y = df.target.values
    kf = model_selection.StratifiedKFold(n_splits=5)
    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)): 
        df.loc[v_, 'kfold'] = f
        df.to_csv("train_folds.csv", index=False)

For the wine dataset, let's look at the distribution of labels.

b = sns.countplot(x='quality', data=df)
b.set_xlabel("quality", fontsize=20) 
b.set_ylabel("count", fontsize=20)

Note that we continue with the code above. Therefore, we have transformed the target value. From Figure 6 we can see that the mass deviation is large. Some categories have many samples, some not so many. If we do a simple k-fold cross-validation, the distribution of target values ​​in each fold will not be the same. Therefore, in this case, we choose stratified k-fold cross-validation.
Please add a picture description

Figure 6: Distribution of "quality" in the wine dataset

The rule is simple, if it is a standard classification problem, blindly choose stratified k-fold cross-validation.

But what if the amount of data is large? Suppose we have 1 million samples. 5-fold cross-validation means training on 800k samples and validation on 200k samples. Depending on the algorithm we choose, training and even validation can be prohibitively expensive for datasets of this size. In this case, we can choose to stay cross-validated .

The procedure for creating holdout results is the same as for stratified k-fold cross-validation. For a dataset with 1 million samples, we can create 10 folds instead of 5 and keep one of them as a holdout sample. That means, we will have 100,000 samples held out, and we will always compute loss, accuracy, and other metrics on this sample set, and train on 900,000 samples.

Pause cross-validation is also very commonly used when working with time series data. Suppose the problem we want to solve is to predict the sales of a store in 2020, and we have all the data from 2015-2019. In this case, you can choose all the data from 2019 as holdout and then train your model on all the data from 2015 to 2018.
Please add a picture description

Figure 7: Example of time series data

In the example shown in Figure 7, suppose our task is to predict sales from time steps 31 to 40. We can keep the data from steps 21 to 30 and train the model from steps 0 to 20. It should be noted that when predicting the 31st to 40th step, the data from the 21st to 30th step should be included in the model, otherwise, the performance of the model will be greatly reduced.

In many cases, we have to deal with small datasets, and creating large validation sets means that the model learns with a lot of data lost. In this case, we can choose leave-one-out cross-test, which is equivalent to the special k-then cross-test where k=N, where N is the number of samples in the data set. This means that in all training folds, we will train on all data samples except 1. The number of folds for this type of cross-validation is the same as the number of samples in the dataset.

Note that this type of cross-validation can be time consuming if the model is not fast enough, but since this cross-validation only works for small datasets it is not important.

Now we can turn to the regression problem. The nice thing about regression problems is that we can use all of the above cross-validation techniques on regression problems except for stratified k-fold cross-validation. That is, we cannot use stratified k-fold cross-testing directly, but there are ways to slightly change the problem to use stratified k-fold cross-testing in a regression problem. In most cases, simple k-fold cross-validation works for any regression problem. However, if you find that the target distributions are inconsistent, you can use stratified k-fold cross-testing.

To use stratified k-fold cross-test in regression problems, we must first divide the target into several strata, and then use stratified k-fold cross-test in the same way as we do for classification problems. There are several options for choosing an appropriate number of strata. If the sample size is large (>10k, >100k), then there is no need to consider the number of strata. Just divide the data into 10 or 20 layers. If the sample size is small, a simple rule like Sturge's Rule can be used to calculate the appropriate number of strata.

Sturge’s Rule:
N u m b e r o f B i n s = 1 + l o g 2 ( N ) Number of Bins = 1 + log_2(N) NumberofBins=1+log2( N )
whereNNN is the number of samples in the dataset. This function is shown in Figure 8.
Please add a picture description

Figure 8: Using Sterger's rule to draw a comparison chart of samples and box numbers

Let us make a sample regression dataset and try to apply stratified k-fold cross-test as shown in the following python code snippet.

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection

def create_folds(data):
    data["kfold"] = -1
    data = data.sample(frac=1).reset_index(drop=True)
    
    num_bins = int(np.floor(1 + np.log2(len(data)))) 
    data.loc[:, "bins"] = pd.cut(
        data["target"], bins=num_bins, labels=False 
    )
    kf = model_selection.StratifiedKFold(n_splits=5)
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)): 
        data.loc[v_, 'kfold'] = f
        data = data.drop("bins", axis=1) 
    return data

if __name__ == "__main__":
    X, y = datasets.make_regression(
        n_samples=15000, n_features=100, n_targets=1 
    )
df = pd.DataFrame(
    X,
    columns=[f"f_{
      
      i}" for i in range(X.shape[1])] 
)
df.loc[:, "target"] = y 
df = create_folds(df)

Cross-validation is the first and most fundamental step in building a machine learning model. If you want to do feature engineering, you must first split the data. If you want to build a model, you first need to split the data. If you have a good cross-validation scheme where the validation data is representative of the training data and the real world data, then you can build a good machine learning model with high generalizability.

The types of cross-validation described in this chapter are applicable to almost all machine learning problems. However, you have to keep in mind that cross-validation also depends heavily on the data, and you may need to adopt new forms of cross-validation depending on your problem and data.

For example, suppose we have a problem where we want to build a model to detect skin cancer from images of a patient's skin. Our task is to build a binary classifier that takes an input image and predicts the probability of it being benign or malignant.

In such datasets, there may be multiple images of the same patient in the training dataset. So to have a good cross-validation system here, you have to have stratified k-fold cross-validation, but you also have to make sure that patients in the training data don't show up in the validation data. Fortunately, scikit-learn provides a type of cross-validation called GroupKFold. Here, patients can be considered as groups. But unfortunately, scikit-learn cannot combine GroupKFold with StratifiedKFold. So you need to do it yourself. I leave this as an exercise for the reader.

Guess you like

Origin blog.csdn.net/qq_20144897/article/details/132578437