The Python data mining framework of the Kaggle gold medalist, the basic process of machine learning is clearly explained

Author | Liu Zaoqi

Source | Early Python

Introduction: When learning machine learning, many students often fall into the pit of reading books and video, but lack the pit of actual project training. Sometimes they want to practice but can't find a complete tutorial. This project is translated from kaggle entry Kernel, the winner of the project Titanic gold medal, this article introduces in detail how to analyze problems, data preprocessing, model building, feature selection, model evaluation and improvement through the Titanic data set that everyone is familiar with. It is a rare gift Excellent tutorial.

This article has deleted some introductory texts while translating, and adjusted the structure to make it easier for everyone to read. Due to space reasons, this article does not contain a large section of code, only the process and results are retained. It is recommended to obtain the notebook version and data set at the end of the article to reproduce it completely. If you are in the introductory stage of machine learning, I believe you will gain something.

Project background and analysis

The sinking of the Titanic is one of the most famous shipwrecks in history. On April 15, 1912, during the first voyage of the Titanic, it sank after colliding with an iceberg, killing 1,502 of the 2,224 passengers and crew. This sensational tragedy shocked the international community.

One of the reasons for the loss of life in the shipwreck accident is that there are not enough lifeboats for passengers and crew. Although there is some luck in surviving the sinking, certain groups are more likely to survive than others, such as women, children, and the upper class.

In this project, we were asked to complete an analysis of the population that might survive. And you need to use machine learning tools to predict which passengers can survive the tragedy.


Data reading and inspection

First import the library related to data processing, and check the version and data folder

#导入相关库
import sys 
print("Python version: {}". format(sys.version))
import pandas as pd 
print("pandas version: {}". format(pd.__version__))
import matplotlib 
print("matplotlib version: {}". format(matplotlib.__version__))
import numpy as np 
print("NumPy version: {}". format(np.__version__))
import scipy as sp 
print("SciPy version: {}". format(sp.__version__)) 
import IPython
from IPython import display 
print("IPython version: {}". format(IPython.__version__)) 
import sklearn 
print("scikit-learn version: {}". format(sklearn.__version__))


import random
import time
#忽略警号
import warnings
warnings.filterwarnings('ignore')
print('-'*25)
# 将三个数据文件放入主目录下 
from subprocess import check_output
print(check_output(["ls"]).decode("utf8"))

Next, import libraries related to modeling and prediction

#导入建模相关库
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier


from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics


import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix


#可视化相关设置
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

Now to read and do a preliminary preview of the data, we use the info() and sample() functions to get a quick overview of the variable data types.

data_raw = pd.read_csv('train.csv')
data_val  = pd.read_csv('test.csv')


data1 = data_raw.copy(deep = True)
data_cleaner = [data1, data_val]


print (data_raw.info()) 
data_raw.sample(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
  • The survival variable is our outcome or dependent variable. This is a 1 survivor of a binary nominal data type, and 0 does not survive. All other variables are potential predictors or independent variables. It is important to note that more predictors will not form a better model, but the correct variables.

  • Passenger ID and ticket variables are assumed to be random unique identifiers and have no effect on the outcome variables. Therefore, they will be excluded from the analysis.

  • The Pclass variable is the ordinal data of the ticket category and represents the socioeconomic status (SES). It represents 1 = upper class, 2 = middle class, and 3 = lower class.

  • The Name variable is a nominal data type. For feature extraction, gender can be obtained from title, family size, and surname. For example, SES can be judged from a doctor or master. Because these variables already exist, we will use it to see if the title (such as master) will have an impact.

  • Gender and load variables are a nominal data type. They will be converted into dummy variables for mathematical calculations.

  • Age and cost variables are continuous quantitative data types.

  • SIBSP represents the number of siblings/spouses related to the ship, while PARCH represents the number of parents/children uploaded. Both are discrete quantitative data types. This allows feature engineering to create a variable about household size.

  • The cabin variable is a nominal data type that can be used in feature engineering to describe the approximate position on the ship and the position on the deck when the accident occurs. However, since there are many null values, it does not increase the value and is therefore excluded from the analysis.


Data preprocessing

In this section we will deal with the data cleaning 4c principle

  • Correcting

  • Filling (Completing)

  • Create (Creating)

  • Conversion (Converting)


Data correction

Checking the data, there does not appear to be any abnormal or unacceptable data input. In addition, we found that we may have potential outliers in age and fare. However, since they are reasonable values, we will wait until exploratory analysis is completed before determining whether they should be included or excluded from the data set. It should be noted that if they are unreasonable values, such as age = 800 instead of 80, then repairing is now a safe decision. However, we need to be extra careful when modifying data from original values, as accurate models may need to be created.


Missing value fill

Age, null values ​​or missing data in the cabin and departure area. Missing values ​​may be bad, because some algorithms don't know how to handle null values ​​and will fail. Other decision trees etc. can handle null values. Therefore, it is important to repair before starting modeling, because we will compare and contrast multiple models. There are two common methods, namely deleting records or using reasonable input to fill in missing values. It is not recommended to delete this record, especially most records, unless it does represent an incomplete record. Instead, it is better to estimate missing values. The basic method of qualitative data is to estimate usage patterns. The basic method of quantitative data is to use the mean, median or mean + random standard deviation estimation. The intermediate method is to use the basic method according to a specific standard. For example, average age divided by class, or departure to port by fare and SES. There are more sophisticated methods, but before deployment, it should be compared with the base model to determine whether complexity really adds value. For this data set, age will be estimated with the median, cabin attributes will be deleted, and boarding will be estimated with mode. Subsequent model iterations may modify this decision to determine whether it will improve the accuracy of the model.


Data creation and conversion


Data creation

Feature engineering is when we use existing features to create new features to determine whether they provide new signals to predict our results. For this data set, we will create a title function to determine whether it plays a role in survival.

Data conversion

Last but not least, we will convert the data format. For this data set, we convert the object data type to categorical dummy variables.

print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)


print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)


data_raw.describe(include = 'all')

Start cleaning

To complete this step, you need to have a certain understanding of the following Pandas functions

  • pandas.DataFrame

  • pandas.DataFrame.info

  • pandas.DataFrame.describe

  • Indexing and Selecting Data

  • pandas.isnull

  • pandas.DataFrame.sum

  • pandas.DataFrame.mode

  • pandas.DataFrame.copy

  • pandas.DataFrame.fillna

  • pandas.DataFrame.drop

  • pandas.Series.value_counts

  • pandas.DataFrame.loc

###缺失值处理
for dataset in data_cleaner:    
    #用中位数填充
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    
#删除部分数据
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)


print(data1.isnull().sum())
print("-"*10)
print(data_val.isnull().sum())

Next we will convert the categorical data into dummy variables for mathematical analysis. There are multiple ways to encode categorical variables. In this step, we will also define X (independent/feature/explain/predictor/etc) and Y (dependency/target/outcome/response/etc) variables for the data Modeling.

label = LabelEncoder()
for dataset in data_cleaner:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])


Target = ['Survived']


data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy =  Target + data1_x
print('Original X Y: ', data1_xy, '\n')
#为原始特征定义x变量以删除连续变量
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')


data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')


data1_dummy.head()

Now that we have basically completed the data cleaning, let's check again

print('Train columns with null values: \n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)


print('Test/Validation columns with null values: \n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)


data_raw.describe(include = 'all')

划分测试集与训练集

As mentioned earlier, the test file provided is actually the verification data submitted by the competition. Therefore, we will use the sklearn function to divide the training data into two data sets, which will not overfit our model.

train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)


print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))


train1_x_bin.head()

探索性分析

Now that our data has been cleaned up, we will explore the data using descriptive and graphical statistics to describe and summarize variables. At this stage, you will find yourself categorizing features and determining their interrelationships with target variables.

for x in data1_x:
    if data1[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
        print('-'*10, '\n')
        
print(pd.crosstab(data1['Title'],data1[Target[0]]))

Next is some visual analysis. First, explore the comparison of survival results of different eigenvalues ​​of each label

We know that class is important for survival, now let us compare the characteristics of different classes

It can be seen from the figure: 1) The higher the cabin class, the more expensive the fare. 2) People with high cabin class are relatively older. 3) The higher the cabin level, the fewer the number of family tourists. The death rate has little to do with cabin class. In addition, we know that gender is important for survival. Now let’s compare gender and the second characteristic.

It can be seen that the survival rate of women is greater than that of men, and the survival rate of C-deck and women traveling alone is higher. Then observe more comparisons

Next, plot the age distribution of surviving or non-surviving passengers

Draw a histogram of survivors' gender and age

Finally, visualize the entire data set


Modeling analysis

First, we must understand that the purpose of machine learning is to solve human problems. Machine learning can be divided into: supervised learning, unsupervised learning and reinforcement learning. In supervised learning, you can train the model by providing it with a training data set containing the correct answers. In unsupervised learning, you can use training data sets that do not contain the correct answers to train the model. Reinforcement learning is a mixture of the first two methods. In this case, the model will not get the correct answer immediately, but only after a series of events. We are doing supervised machine learning because we are training our algorithms by showing them a set of functions and their corresponding goals. Then, we hope to provide it with a new subset from the same data set and have similar results in terms of prediction accuracy.

There are many machine learning algorithms, but according to the different target variables and data modeling goals, they can be divided into four categories: classification, regression, clustering or dimensionality reduction. We will focus on classification and regression. In a nutshell, continuous target variables require regression algorithms, while discrete target variables require classification algorithms. ; In addition, although logistic regression has regression in its name, it is actually a classification algorithm. Since our problem is to predict whether the passenger will survive, this is a discrete target variable. We will use the classification algorithm in the sklearn library to start our analysis. And use cross-validation and scoring metrics (discussed in a later section) to rank and compare the performance of the algorithms.

Common machine learning classification algorithms are

  • EM method

  • Generalized Linear Model (GLM)

  • Naive Bayes

  • K neighbors

  • Support Vector Machine (SVM)

  • Decision tree

Below, we will use different methods to compare (because the code is too long, please reply to kaggle in the background to obtain the source code for detailed code)


Model evaluation

Let's recap, through some basic data cleaning, analysis, and machine learning algorithms (MLA), we were able to predict the survival rate of passengers with about 82% accuracy. A few lines of code are not bad. However, the question we always ask is, we can do better, and more importantly, can we get the return on investment that we have invested in the time we invest? For example, if we only increase accuracy by 1%, is it really worth 3 months of development? Therefore, keep this in mind when improving the model.

Before deciding how to improve the model, let us determine whether our model is worth keeping. For this, we must return to the basics of Data Science 101. We know this is a binary problem because there are only two possible outcomes. The passenger survived or died. Think of it as a coin flip problem. If you have a coin and guess the heads or tails, then you have a 50-50 chance of guessing it right. So we set 50% as the worst model performance.

Well, without information about the data set, we can always get 50% binary problems. But we have information about the data set, so we should be able to do better. We know that 1502/2224 or 67.5% of people died. Therefore, if we predict the highest probability, 100% of people will die, then our prediction is correct 67.5% of the time. So, let's consider 68% as bad model performance. Any model lower than this is meaningless. I can predict all the objects will die.

Next, we will use cross-validation to evaluate the model.

This is a purely handmade model, only for learning purposes. You can create your own prediction model without fancy algorithms. 1 means survival, 0 means death

Coin Flip Model Accuracy: 49.49%
Coin Flip Model Accuracy w/SciKit: 49.49%

Summarize the results

Survival Decision Tree w/Female Node: 
 Sex     Pclass  Embarked  FareBin        
female  1       C         (14.454, 31.0]     0.666667
                          (31.0, 512.329]    1.000000
                Q         (31.0, 512.329]    1.000000
                S         (14.454, 31.0]     1.000000
                          (31.0, 512.329]    0.955556
        2       C         (7.91, 14.454]     1.000000
                          (14.454, 31.0]     1.000000
                          (31.0, 512.329]    1.000000
                Q         (7.91, 14.454]     1.000000
                S         (7.91, 14.454]     0.875000
                          (14.454, 31.0]     0.916667
                          (31.0, 512.329]    1.000000
        3       C         (-0.001, 7.91]     1.000000
                          (7.91, 14.454]     0.428571
                          (14.454, 31.0]     0.666667
                Q         (-0.001, 7.91]     0.750000
                          (7.91, 14.454]     0.500000
                          (14.454, 31.0]     0.714286
                S         (-0.001, 7.91]     0.533333
                          (7.91, 14.454]     0.448276
                          (14.454, 31.0]     0.357143
                          (31.0, 512.329]    0.125000

Now use decision trees to model and view relevant metrics

Decision Tree Model Accuracy/Precision Score: 82.04%


              precision    recall  f1-score   support


           0       0.82      0.91      0.86       549
           1       0.82      0.68      0.75       342


    accuracy                           0.82       891
   macro avg       0.82      0.79      0.80       891
weighted avg       0.82      0.82      0.82       891

You can see that the accuracy rate is 82%, and the results are visualized below


Cross-validation

Next is cross-validation, but the important thing is that we use different subsets to train the data to build the model, and use the test data to evaluate the model. Otherwise, our model will overfit. This means that it is great in terms of "predicting" data that has already been seen, but terrible in terms of predicting data that has not yet been seen; this is not a prediction at all. It’s like cheating in a school test to get a 100% score, but then when you go to take the test, you will fail.

CV is essentially a shortcut to split and score the model multiple times, so we can understand how it performs on unseen data. It is more expensive in computer processing, but it is very important, so we will not have false confidence. This is useful in Kaggle competitions or any use case that avoids consistency and surprises.


Hyperparameter adjustment

When we used the sklearn decision tree (DT) classifier, we accepted all the feature defaults. This gives us the opportunity to understand how various hyperparameter settings will change the accuracy of the model. (Click here to learn more about parameters and hyperparameters.)

However, in order to adjust the model, we need to actually understand it. This is why I spent time in the previous sections to show you the principle of prediction, so you need to understand the advantages and disadvantages of the decision tree algorithm in detail!

Below is the use of ParameterGrid, GridSearchCV and customizedsklearn scoring scoring to adjust our model results

BEFORE DT Parameters:  {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 0, 'splitter': 'best'}
BEFORE DT Training w/bin score mean: 82.09
BEFORE DT Test w/bin score mean: 82.09
BEFORE DT Test w/bin score 3*std: +/- 5.57
----------
AFTER DT Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT Training w/bin score mean: 87.40
AFTER DT Test w/bin score mean: 87.40
AFTER DT Test w/bin score 3*std: +/- 5.00

特征选择

As I said at the beginning, it is not that the more predictors the better the model, but the correct predictor can improve the accuracy of the model. Therefore, another step of data modeling is feature selection. In Sklearn we will use recursive feature elimination (RFE) and cross validation (CV)

Here are some results

BEFORE DT RFE Training Shape Old:  (891, 7)
BEFORE DT RFE Training Columns Old:  ['Sex_Code' 'Pclass' 'Embarked_Code' 'Title_Code' 'FamilySize'
 'AgeBin_Code' 'FareBin_Code']
BEFORE DT RFE Training w/bin score mean: 82.09
BEFORE DT RFE Test w/bin score mean: 82.09
BEFORE DT RFE Test w/bin score 3*std: +/- 5.57
----------
AFTER DT RFE Training Shape New:  (891, 6)
AFTER DT RFE Training Columns New:  ['Sex_Code' 'Pclass' 'Title_Code' 'FamilySize' 'AgeBin_Code'
 'FareBin_Code']
AFTER DT RFE Training w/bin score mean: 83.06
AFTER DT RFE Test w/bin score mean: 83.06
AFTER DT RFE Test w/bin score 3*std: +/- 6.22
----------
AFTER DT RFE Tuned Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT RFE Tuned Training w/bin score mean: 87.34
AFTER DT RFE Tuned Test w/bin score mean: 87.34
AFTER DT RFE Tuned Test w/bin score 3*std: +/- 6.21
----------

模型验证

Compare different algorithms first

Next is to choose a model by voting

Hard Voting Training w/bin score mean: 86.59
Hard Voting Test w/bin score mean: 82.39
Hard Voting Test w/bin score 3*std: +/- 4.95
----------
Soft Voting Training w/bin score mean: 87.15
Soft Voting Test w/bin score mean: 82.35
Soft Voting Test w/bin score 3*std: +/- 4.85
----------

After performing the same operation again, the final retained variables and results are as follows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId      418 non-null int64
Pclass           418 non-null int64
Name             418 non-null object
Sex              418 non-null object
Age              418 non-null float64
SibSp            418 non-null int64
Parch            418 non-null int64
Ticket           418 non-null object
Fare             418 non-null float64
Cabin            91 non-null object
Embarked         418 non-null object
FamilySize       418 non-null int64
IsAlone          418 non-null int64
Title            418 non-null object
FareBin          418 non-null category
AgeBin           418 non-null category
Sex_Code         418 non-null int64
Embarked_Code    418 non-null int64
Title_Code       418 non-null int64
AgeBin_Code      418 non-null int64
FareBin_Code     418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
Validation Data Distribution: 
 0    0.633971
1    0.366029
Name: Survived, dtype: float64

总结

In the end we got a model with 0.77990 accuracy. Using the same data set and different decision tree implementations (adaboost, random forest, gradient boost, xgboost, etc.) and tuning will not exceed the submission accuracy of 0.77990. For this data set, the interesting thing is that the simple decision tree algorithm has the best default submission score, and the same best accuracy score is obtained through tuning.

And although testing a few algorithms on a single data set cannot produce the same results, there are some observations on the mentioned data set. At the same time, the distribution of the training data set is different from the test/validation data set and the filling data set. This makes a big gap between the accuracy score of cross-validation (CV) and the accuracy score submitted by Kaggle. For the same data set, the decision tree-based algorithm seems to converge to the same accuracy score after appropriate adjustments.

In order to better align the CV score and Kaggle score and improve the overall accuracy, more processing can be done in preprocessing and feature engineering later, and these are left to interested readers to complete.

更多精彩推荐

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/108764667