机器学习一百天学习笔记
原github:
https://github.com/Avik-Jain/100-Days-Of-ML-Code
中文版地址:
https://github.com/MLEveryday/100-Days-Of-ML-Code
参考:
https://blog.csdn.net/ssswill/article/details/86022051
day1数据预处理
一,导入库
import numpy as np import pandas as pd
二,导入数据集
#导入数据集 dataset = pd.read_csv('D:\\100Days\datasets\Data.csv') #print(dataset) X = dataset.iloc[ : , :-1].values print(X) #提取列数据,前面的:表示提取所有行,后面的为切片,提取1到倒数第二列 Y = dataset.iloc[ : , 3].values print(Y)
loc是显式索引,iloc是隐式索引(左闭右开),
.values把dataframe转为np数组,X,Y为
[['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 nan] ['France' 35.0 58000.0] ['Spain' nan 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]] ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
把df转为np二维数组
三,处理丢失数据
方法一:按照原网站方法
#处理丢失数据 from sklearn.preprocessing import Imputer imputer = Imputer(missing_values="NaN",strategy="mean",axis=0) imputer = imputer.fit(X[ : ,1:3]) X[ : , 1:3] = imputer.transform(X[ : , 1:3])