100-Days-Of-ML-Code 第一天 数据预处理

https://github.com/MLEveryday 中文版
https://github.com/Avik-Jain/100-Days-Of-ML-Code/ 英文版
第一天 数据预处理
在这里插入图片描述
1 import所需库

import numpy as np
import pandas as pd

2 读取数据 X 和Y
注意 :为左闭右开 此处 -1不取,即X的最后一列不取
i.loc[ ].value得到所取的值,形成array

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values
  1. 处理缺少数据
    此处采用mean 方法,axis=0 表示列 asis =1 表示行
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
X
array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

4 对标签进行编码 lableEconder 以及对特征进行OneHotEncoder
lableEncoder 从0到max_type-1 ,其中max_value 为种类

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])
X
array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

One_hotEncoder (catagorical_features=[i])对第i列特征进行编码
onehotencoder.fit_transform(X).toarray() 此函数对X中的第i列特征进行编码并转换至列 ,此处对第一列的数进行了one_hotEncoder,并将其放入了第一列

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
X
array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

同理 lable enconder 有几类 从0-max_type-1

labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)
Y
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
  1. 划分数据
    此处由于采用的scikit-learn 版本为0.2.0 所以从 sklearn.model_selection 导入,而不是原博客的sklearn.cross_validation
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
  1. 归一化数据
    此处采用的是x-u/v , 其中u为均值,v为方差,产生的数据均值为0,方差为1
    从sklearn.preprocessing 到如StandardScaler
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

总结,第一天主要学会了LableEncoder以及OneHotEncoder
需要注意的是读取数据是iloc[].value得到array
并且函数有可能目录会改变,记得查官方文档。

猜你喜欢

转载自blog.csdn.net/wehung/article/details/82962645
今日推荐