Task2：数据清洗及特征处理

通过导入kaggle上泰坦尼克的数据集，实战数据分析全流程

2 第二章：数据清洗及特征处理

我们拿到的数据通常是不干净的，所谓的不干净，就是数据中有缺失值，有一些异常点等，需要经过一定的处理才能继续做后面的分析或建模，所以拿到数据的第一步是进行数据清洗，本章我们将学习缺失值、重复值、字符串和数据转换等操作，将数据清洗成可以分析或建模的亚子。

2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值，比如我们可以看到Cabin列存在NaN，那其他列还有没有缺失值，这些缺失值要怎么处理呢

2.1.1 任务一：缺失值观察

(1) 请查看每个特征缺失值个数
(2) 请查看Age， Cabin， Embarked列的数据
以上方式都有多种方式，所以大家多多益善

#写入代码
# 通过info() 查看缺失值和数据类型
# 这里可以看到Age Cabin Embarked 列存在缺失值
df.info()
# 通过[]查看列标签
df[['Age', 'Cabin', 'Embarked']]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

	Age	Cabin	Embarked
0	22.0	NaN	S
1	38.0	C85	C
2	26.0	NaN	S
3	35.0	C123	S
4	35.0	NaN	S
...	...	...	...
886	27.0	NaN	S
887	19.0	B42	S
888	NaN	NaN	S
889	26.0	C148	C
890	32.0	NaN	Q

891 rows × 3 columns

#写入代码
# 查看某列是否有缺失值
# df.notna().all()
# 非缺失值总数
df.notna().sum()
df.loc[:, ['Age', 'Cabin', 'Embarked']]

	Age	Cabin	Embarked
0	22.0	NaN	S
1	38.0	C85	C
2	26.0	NaN	S
3	35.0	C123	S
4	35.0	NaN	S
...	...	...	...
886	27.0	NaN	S
887	19.0	B42	S
888	NaN	NaN	S
889	26.0	C148	C
890	32.0	NaN	Q

891 rows × 3 columns

#写入代码
# 查看缺失值
# df.isna().sum()
df.isna().any()

  PassengerId    False
    Survived       False
    Pclass         False
    Name           False
    Sex            False
    Age             True
    SibSp          False
    Parch          False
    Ticket         False
    Fare           False
    Cabin           True
    Embarked        True
    dtype: bool

2.1.2 任务二：对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

#处理缺失值的一般思路：
#提醒：可使用的函数有--->dropna函数与fillna函数
df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

#写入代码
df_Age = df.copy()
df_Age['Age'] = df_Age['Age'].fillna(0)
df_Age['Age'].isna()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

df[df['Age'] == np.nan] = 0
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

#写入代码
df_inter = df.copy()
# 等差插值
df_inter['Age'].interpolate().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2d1e0636630>

在这里插入图片描述

#写入代码
df_drop = df.copy()

df_drop.dropna(inplace=True)

【思考1】dropna和fillna有哪些参数，分别如何使用呢?

axis: 轴向 0是行 1是列
how : 对缺失值的处理方式
inplace: 就地操作
thresh: 保留NA值的个数

【思考2】检索空缺值用np.nan要比用None好，这是为什么？

#思考回答
None 是Python自带的None 属于NoneType类型不能参与运算
np.nan可以参与运算结果为nan

2.2 重复值观察与处理

由于这样那样的原因，数据中会不会存在重复值呢，如果存在要怎样处理呢

2.2.1 任务一：请查看数据中的重复值

#写入代码
df[df.duplicated()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked

2.2.2 任务二：对重复值进行处理

(1)重复值有哪些处理方式呢？

(2)处理我们数据的重复值

方法多多益善

重复值有哪些处理方式：

可以填充为均值
可以删除重复值

#写入代码
df.drop_duplicates().head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

2.2.3 任务三：将前面清洗的数据保存为csv格式

#写入代码
df.to_csv('test_clear.csv')

2.3 特征观察与处理

我们对特征进行一下观察，可以把特征大概分为两大类：
数值型特征：Survived ，Pclass， Age ，SibSp， Parch， Fare，其中Survived， Pclass为离散型数值特征，Age，SibSp， Parch， Fare为连续型数值特征
文本型特征：Name， Sex， Cabin，Embarked， Ticket，其中Sex， Cabin， Embarked， Ticket为类别型文本特征，数值型特征一般可以直接用于模型的训练，但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

2.3.1 任务一：对年龄进行分箱（离散化）处理

(1) 分箱操作是什么？

(2) 将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示

(3) 将连续变量Age划分为[0,5) [5,15) [15,30) [30,50) [50,80)五个年龄段，并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50 70% 90%五个年龄段，并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存，保存为csv格式

#分箱操作是什么：

数据分箱（也称为离散分箱或分段）是一种数据预处理技术，用于减少次要观察误差的影响，是一种将多个连续值分组为较少数量的“分箱”的方法。

对异常数据有很强的鲁棒性：比如一个特征是会话时长=702341sec，换算成天是8.1天，这属于明显的异常值。如果特征没有离散化，一个异常数据“会话时长=8.1天”会给模型造成很大的干扰；
在很多网页分析系统中，0点之后会话将被强行切分，所以会话时长不可能超过1天。

在逻辑回归模型中，单变量离散化为N个哑变量后，每个哑变量有单独的权重，相当于为模型引入了非线性，能够提升模型表达能力，加大拟合；

缺失值也可以作为一类特殊的变量进入模型

分箱后降低模型运算复杂度，提升模型运算速度，对后后期生产上线较为友好

df_cut1 = df.copy()
df_cut1['Age_range'] = pd.cut(df_cut1['Age'], 5, labels=['1','2','3','4','5'])
df_cut1

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Age_range
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	3
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	3
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	3
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	2
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	2
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	NaN
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	2
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	2

891 rows × 13 columns

df_cut1.to_csv("test_ave.csv")

df_cut2 = df.copy()
df_cut2['Age_range'] = pd.cut(df_cut2['Age'], [1e-10,5,15,30,50,80], labels=['1','2','3','4','5'])
df_cut2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Age_range
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	4
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	4
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	3
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	3
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	NaN
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	3
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	4

891 rows × 13 columns

df_cut2.to_csv("test_cut.csv")

df_cut3 = df.copy()
df_cut3['Age_range'] = pd.qcut(df_cut3['Age'], [0,0.1,0.3,0.5,0.7,0.9], labels=['1','2','3','4','5'])
df_cut3

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Age_range
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	5
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	4
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	3
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	2
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	NaN
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	3
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	4

891 rows × 13 columns

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

【参考】https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

2.3.2 任务二：对文本变量进行转换

(1) 查看文本变量名及种类
(2) 将文本变量Sex， Cabin ，Embarked用数值变量12345表示
(3) 将文本变量Sex， Cabin， Embarked用one-hot编码表示

#写入代码
df_test = df.copy()
df_test['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

#写入代码
df_test['Cabin'].value_counts()

G6             4
C23 C25 C27    4
B96 B98        4
E101           3
C22 C26        3
              ..
B80            1
B39            1
A16            1
C91            1
B37            1
Name: Cabin, Length: 147, dtype: int64

#写入代码
df_test['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

df_test['Sex'].unique()

array(['male', 'female'], dtype=object)

df_test['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)

df_test['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

from sklearn.preprocessing import LabelEncoder

for feature in ['Sex', 'Cabin', 'Embarked']:
    label = LabelEncoder()
    label_dict = dict(zip(df_test[feature].unique(), range(df_test[feature].nunique())))
    df_test[feature + "_labelEncode"] = df_test[feature].map(label_dict)
    df_test[feature + "_labelEncode"] = label.fit_transform(df_test[feature].astype(str))
    
df_test

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Sex_labelEncode	Cabin_labelEncode	Embarked_labelEncode
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	1	147	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	0	81	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	0	147	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	0	55	2
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	1	147	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	1	147	2
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	0	30	2
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	0	147	2
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	1	60	0
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	1	147	1

891 rows × 15 columns

for feature in ['Sex', 'Cabin', 'Embarked']:
    x = pd.get_dummies(df_test[feature], prefix=feature)
    df_test = pd.concat([df_test, x], axis=1)
    
df_test

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	...	Cabin_F G73	Cabin_F2	Cabin_F33	Cabin_F38	Cabin_F4	Cabin_G6	Cabin_T	Embarked_C	Embarked_Q	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	...	0	0	0	0	0	0	0	0	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	...	0	0	0	0	0	0	0	1	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	...	0	0	0	0	0	0	0	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	...	0	0	0	0	0	0	0	0	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	...	0	0	0	0	0	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	...	0	0	0	0	0	0	0	0	0	1
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	...	0	0	0	0	0	0	0	0	0	1
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	...	0	0	0	0	0	0	0	0	0	1
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	...	0	0	0	0	0	0	0	1	0	0
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	...	0	0	0	0	0	0	0	0	1	0

891 rows × 167 columns

2.3.3 任务三：从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

#写入代码

df_test['Title'] = df_test['Name'].str.extract('([A-Za-z]+)\.', expand=False)
df_test.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	...	Embarked_C	Embarked_S	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	...	0	1	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	...	1	0	Mrs
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	...	0	1	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	...	0	1	Mrs
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	...	0	1	Mr

5 rows × 168 columns

#保存最终你完成的已经清理好的数据
df_test.to_csv('test_fin.csv')

【数据分析入门】动手学数据分析 Task2

Task2：数据清洗及特征处理

2 第二章：数据清洗及特征处理

2.1 缺失值观察与处理

2.1.1 任务一：缺失值观察

2.1.2 任务二：对缺失值进行处理

2.2 重复值观察与处理

2.2.1 任务一：请查看数据中的重复值

2.2.2 任务二：对重复值进行处理

重复值有哪些处理方式：

2.2.3 任务三：将前面清洗的数据保存为csv格式

2.3 特征观察与处理

2.3.1 任务一：对年龄进行分箱（离散化）处理

2.3.2 任务二：对文本变量进行转换

2.3.3 任务三：从纯文本Name特征里提取出Titles的特征(所谓的Titles就是Mr,Miss,Mrs等)

猜你喜欢