[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

The Dataset was acquired from https://www.kaggle.com/c/titanic

For data preprocessing, I firstly defined three transformers:

  • DataFrameSelector: Select features to handle.
  • CombinedAttributesAdder: Add a categorical feature Age_cat which divided all passengers into three catagories according to their ages.
  • ImputeMostFrequent: Since the SimpleImputer( ) method was only suitable for numerical variables, I wrote an transformer to impute string missing values with the mode value. Here I was inspired by https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn.

Then I wrote pipelines separately for different features

  • For numerical features, I applied DataFrameSelector, SimpleImputer and StandardScaler
  • For categorical features, I applied DataFrameSelector, ImputeMostFrequent and OneHotEncoder
  • For the new created feature Age_cat, since itself was a category but was derived from a numerical feature, I wrote an individual pipeline to impute the missing values and encode the categories.

Finally, we can build a full pipeline through FeatureUnion. Here is the code:

 1 # Read data
 2 import pandas as pd
 3 import numpy as np
 4 import os
 5 titanic_train = pd.read_csv('Dataset/Titanic/train.csv')
 6 titanic_test = pd.read_csv('Dataset/Titanic/test.csv')
 7 submission = pd.read_csv('Dataset/Titanic/gender_submission.csv')
 8 
 9 # Divide attributes and labels
10 titanic_labels = titanic_train['Survived'].copy()
11 titanic = titanic_train.drop(['Survived'],axis=1)
12 
13 # Feature Selection
14 from sklearn.base import BaseEstimator, TransformerMixin
15 
16 class DataFrameSelector(BaseEstimator, TransformerMixin):  
17     def __init__(self,attribute_name):
18         self.attribute_name = attribute_name
19     def fit(self, X):
20         return self
21     def transform (self, X, y=None):
22         if 'Pclass' in self.attribute_name:
23             X['Pclass'] = X['Pclass'].astype(str)
24         return X[self.attribute_name]
25 
26 # Feature Creation
27 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
28     def fit(self, X, y=None):
29         return self  # nothing else to do
30     def transform(self, X, y=None):
31         Age_cat = pd.cut(X['Age'],[0,18,60,100],labels=['child', 'adult', 'old'])
32         Age_cat=np.array(Age_cat)        
33         return pd.DataFrame(Age_cat,columns=['Age_Cat'])
34 
35 # Impute Categorical variables
36 class ImputeMostFrequent(BaseEstimator, TransformerMixin):      
37     def fit(self, X, y=None):
38         self.fill = pd.Series([X[c].value_counts().index[0] for c in X],index=X.columns)
39         return self
40     def transform(self, X, y=None):
41         return X.fillna(self.fill)
42 
43 #Pipeline
44 from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
45 from sklearn.pipeline import Pipeline
46 from sklearn.preprocessing import StandardScaler
47 from sklearn.preprocessing import OneHotEncoder
48 from sklearn.pipeline import FeatureUnion
49 
50 num_pipeline = Pipeline([
51     ('selector',DataFrameSelector(['Age','SibSp','Parch','Fare'])),
52     ('imputer', SimpleImputer(strategy="median")),
53     ('std_scaler', StandardScaler()),
54 ])
55 
56 cat_pipeline = Pipeline([
57     ('selector',DataFrameSelector(['Pclass','Sex','Embarked'])),
58     ('imputer',ImputeMostFrequent()),
59     ('encoder', OneHotEncoder()),
60 ])
61 
62 new_pipeline = Pipeline([
63     ('selector',DataFrameSelector(['Age'])),
64     #('imputer', SimpleImputer(strategy="median")),
65     ('attr_adder',CombinedAttributesAdder()),
66     ('imputer',ImputeMostFrequent()),
67     ('encoder', OneHotEncoder()),
68 ])
69 
70 full_pipeline = FeatureUnion([
71     ("num", num_pipeline),
72     ("cat", cat_pipeline),
73     ("new", new_pipeline),
74 ])
75 
76 titanic_prepared = full_pipeline.fit_transform(titanic)

Another thing I want to mention is that the output of a pipeline should be a 2D array rather a 1D array. So if you wanna choose only one feature, don't forget to transform the 1D array by reshape() method. Otherwise, you will receive an error like

ValueError: Expected 2D array, got 1D array instead

Specifically, apply reshape(-1,1) for column and reshape(1,-1). More about the issue can be found at https://stackoverflow.com/questions/51150153/valueerror-expected-2d-array-got-1d-array-instead.



猜你喜欢

转载自www.cnblogs.com/sherrydatascience/p/10217817.html