1.导入需要的库

如numpy、pandas、sklearn等。

2.导入数据集

这次用到的数据集为.csv格式，这种格式的文件是以文本的形式保存表格数据，文件的每一行是一条数据记录。

这里主要使用pandas库中的read_csv()方法读取csv文件。该方法官方文档为：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv。

读取文件后需要获取特征值和标签，这里需要用到pandas.DataFrame.loc[]方法或pandas.DataFrame.iloc[]方法。其中loc为通过行标签索引行数据，iloc为通过行号索引行数据（之前还有ix方法可以通过行标签或行号索引行数据，但现在更新的库中已经停止使用该方法，因为loc和iloc已足够使用）。两种方法的官方文档介绍如下：

3.处理丢失数据

对于数据中是否存在缺失值可以用np.isnan()来判定。在处理数据缺失值时一般用整列平均值或中间值等替换丢失数据。这里主要用到sklearn.preprocessing.Imputer类，该类官方文档：http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer。另外，可参考：https://blog.csdn.net/kancy110/article/details/75041923?TPSecNotice。

4.解析分类数据

分类数据指的是指含有标签值而不是数字值的变量，取值范围通常是固定的。如“male”“female”是不能计算的值，所以需要解析成数字。这里需要用到sklearn.preprocessing.LabelEncoder类和sklearn.preprocessing.OneHotEncoder类。官方文档为http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder、http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder。另外，可参考：https://www.cnblogs.com/king-lps/p/7846414.html、https://blog.csdn.net/accumulate_zhang/article/details/78510571。

5.拆分数据集

即将原始数据集按照比例拆分为训练集和测试集，比例一般为8:2或7:3。这里需要用到sklearn.model_selection.train_test_split()方法。该方法官方文档为：http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split。

6.特征缩放

特征值在幅度、单位和范围姿态问题上变化很大，在距离计算中，高幅度的特征比低幅度的特征权重更大，所以需要对特征值进行标准化或归一化。这里常用到sklearn.preprocessing.StandardScaler类或sklearn.preprocessing.MinMaxScaler类。官方文档：http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler、http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler。另外，可参考：https://blog.csdn.net/ybdesire/article/details/56027408。

代码：

# -*- coding: utf-8 -*-
"""
Created on Sun Sep  2 19:33:18 2018

@author: zhengyuv
"""
#1.导入库
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#2.导入数据集
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values
print("X")
print(X)
print("Y")
print(Y)

#3.处理丢失数据
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis=0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
print("处理缺失值后")
print("X")
print(X)

#4.解析分类数据
labelencoder_X = LabelEncoder()
X[ : ,0] = labelencoder_X.fit_transform(X[ : , 0])
#创建虚拟变量
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print("对数据重编码后")
print("X")
print(X)
print("Y")
print(Y)

#5.拆分数据集
X_train, X_test, Y_train, Y_test=train_test_split(X, Y, test_size=0.2,random_state=0)
print("拆分完数据集")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)

#6.特征缩放
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
print("特征缩放后")
print("X_train")
print(X_train)
print("X_test")
print(X_test)

运行结果：

X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
处理缺失值后
X
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
对数据重编码后
X
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]]
Y
[0 1 0 0 1 1 0 1 0 1]
拆分完数据集
X_train
[[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]]
X_test
[[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
[0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]]
Y_train
[1 1 1 0 1 0 0 1]
Y_test
[0 0]
特征缩放后
X_train
[[-1. 2.64575131 -0.77459667 0.26306757 0.12381479]
[ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632]
[-1. -0.37796447 1.29099445 -1.97539832 -1.53093341]
[-1. -0.37796447 1.29099445 0.05261351 -1.11141978]
[ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ]
[-1. -0.37796447 1.29099445 -0.0813118 -0.16751412]
[ 1. -0.37796447 -0.77459667 0.95182631 0.98614835]
[ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
X_test
[[ 0. 0. 0. -1. -1.]
[ 0. 0. 0. 1. 1.]]

【我的python机器学习之路·1】数据预处理

我们用python进行数据预处理主要经过以下6个步骤：

1.导入需要的库

如numpy、pandas、sklearn等。

2.导入数据集

这次用到的数据集为.csv格式，这种格式的文件是以文本的形式保存表格数据，文件的每一行是一条数据记录。

这里主要使用pandas库中的read_csv()方法读取csv文件。该方法官方文档为：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv。

3.处理丢失数据

4.解析分类数据

5.拆分数据集

6.特征缩放

代码：

运行结果：

猜你喜欢