Kaggle entry project: Titanic survival prediction (2) data processing

Original kaggle competition address: https://www.kaggle.com/c/titanic

原kernel地址:A Data Science Framework: To Achieve 99% Accuracy

Things to know before troubleshooting:

A Data Science Framework

1. Define the Problem:

Problem→Requirement→Method→Design→Technology, this is the process of solving the problem at the beginning, so before we use some fancy skills and algorithms to solve the problem, we must clarify what the problem we need to solve is

2. Get data:

Method to convert from dirty data to clean data.

3. Prepare Data for Consumption:

Not sure how to accurately translate Prepare Data for Consumption, but according to the explanation of this kernel. It is data sorting. Simply put, it is to convert "wild" data into "tamed" data, organize the data into a data structure suitable for storage and processing, and extract, clean, and deal with abnormal, missing, and outlier data. point.

4: Exploratory Data Analysis (Perform Exploratory Analysi):

In GIGO, reducing meaningless input can improve the quality of the output. Therefore, visualization methods such as charts and data display can be used to discover the correlation between potential problems and features and features. Similarly, data classification also facilitates understanding and selection of appropriate data models.

5: Model Data:

Knowing how to choose the right tool for the problem, a poor model leads to a poor conclusion, of course.

6: Validate and Implement Data Model:

This step ensures that there is no overfit - great performance on the training set and blinding on the test set.

7: Optimize and Strategize:

Improve performance through iteration.

 

Next, we will analyze the Titanic problem one by one through these seven steps.

Step 1:Define the Problem

The tragedy of the Titanic occurred on April 15, 1912, when 1,502 of the 2,224 passengers died, shocking the world. Our problem this time is to predict the survival probability of the passenger through Titanic's passenger information.

Knowledge point: binary classification problem, Pyhton or R language (the kernel uses Pyhton)

Step 2:Gather the Data

https://www.kaggle.com/c/titanic/data (Titanic dataset download)

Step 3:Prepare Data for Consumption

3.11 Load Data Modelling Libraies

Data cleaning, required Python packages are already included in the kernel

It should be noted that the author uses python3, and I used Pyhton2.7 in the previous study, so I uninstalled all python environments and various packages of the old version, and directly downloaded the anaconda3.6 version in one step. It is still recommended that you use python3 for data analysis, which is more standardized and advanced than python2.

3.2 Meet and Great Data

To understand the specific attributes through the name and explanation of the data set feature, and to obtain the feature information through the info() and sample() functions, the following points should be noted:

1. The Survived column needs to be output, a binary variable, 1 represents survival, 0 represents death. Other features are provided for learning and prediction. Note: more features is not necessarily a good thing, it is important to choose the right features.

2.PassengerID and Ticket columns are random unique identifiers that do not contribute to prediction and are removed directly.

3. Pclass is the position level, and we will replace it with 1, 2, and 3 if the contribution is greater.

4. The Name column seems useless, in fact, we can determine the gender, family size, and social status through the title of the name.

5.Sex and Embarked can be converted to dummy variables - dummy variables for easy model construction

6. Age and Fare, continuous variables.

7. SibSp refers to the number of siblings in the same boat, and Parch refers to the number of parents in the same boat. These two attributes can skillfully synthesize the family size, which is very helpful for prediction.

8. The absence of the Cabin column is too much, in this case we directly delete it.

We found that the missing data is not too much, but for model training, we have to deal with missing data, which is discussed later.

3.21 The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting

The 4Cs of data cleaning:

1.Correcting Accuracy: A simple example, if Age should be 80 but wrongly marked as 800, this will be incorrect.

2.Completing integrity: Some algorithms cannot handle missing information, so of course missing data processing is required.

Two common methods: either delete or fill, it is not recommended to delete , because many features contain a lot of information, and simple deletion will definitely produce bias. Use the mean, median, or mean + randomization standard deviation while padding.

3.Creating creativity: use existing data to generate new data sets, in the words of one of my senior brothers, relying on brain holes! Of course, this is a rather ridiculous statement. There are many creat feature methods for reference. As a beginner, I have not studied in depth. I will share my experience later.

4.Converting data conversion: In the words of this kernel, it is data formatting, because the classified data is not suitable for algorithm calculation, so we introduce dummy variables , which is related to One-hot encoding.

3.22 Clean Data

The author uses the fillna() method of pandas to fill in the empty values ​​in the dataset:

For continuous variables, such as Age and Fare, use the median to fill in median(). As for why not to use the mean to fill in because there are many outliers that will interfere with the data, the median is more representative of the middle index of most data. , which will be reflected in the subsequent data visualization analysis.

For categorical variables such as Embarked, use mode to fill in mode()[0]. After all, it is reasonable to classify a row as a category with higher probability when we don't know which category it belongs to.

then use

dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

The family size attribute is constructed.

Add the IsAlone property to identify whether one person is on board.

Extract Title from the Name attribute, because we found that almost all names contain title, which is Mr, Miss, etc.

Therefore, we extract the elements between (,) and (.) to form the Title attribute, which hides information such as gender and social status.

We construct FareBin and AgeBin to divide ticket price and age into several different grades, making continuous data sets better fit for the algorithm. The qcut() and cut() functions of pandas are used here.

3.23 Convert Formats

Convert the classified variables into dummy variables. The so-called dummy variable is to use 0 or 1 to indicate whether the attribute appears . We can find that the processed data1_dummy has Embarked_C, Embarked_Q, Embarked_S and other dummy variables, which should be well understood.

3.24 Da-Double Check Cleaned Data

Confirm the data processing result again, cleaning Done!

To be continued~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325067323&siteId=291194637