and data processing based sklearn

 sklearn  library incorporates a variety of machine learning algorithms can be quickly modeled in the data analysis process. By  pandas  library Although providing data combined, washed, standardized (normalized deviation, normalized standard deviation, normalized fractional scaling), to build a machine learning model feature needs to process more data at a pre-operation, thus  sklearn  relevant pretreatment a uniform interface for the function package  ---  converter (  the Transformer  ). Use  sklearn  converter enables incoming  NumPy  array normalization processing, binarization,  the PCA  will be other operations.

       When it comes to data conversion, in fact  padas  library also provides a dummy variable data processing categories, discrete features such as continuous data. This is why only learn  SQL  can not completely replace the  pandas  one functional reasons. But  sklearn  introduction converter can be more convenient to unify the training and test sets operate.

       sklearn  also provides easy to learn classic data sets, the data sets stored dictionary-like manner. By  ancanda the  Spyder  variable data interface and their values can be seen intuitively. Before these data we can understand the data analysis of the data format of the final mess look like. For example: Data (  Data  ), the label (  target  ), characteristics ( Feature  three basic elements). Split training and follow-up training and test preparation are inseparable from these data.

 

1, the data set is loaded datasats

If you need to load a data set, you can assign a function corresponding to a variable, re-emphasized three elements of the data set: data (data), labels (target), characteristic (feature). As shown in the following code:


from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer () ## The data set assigned to the variable iris

print ( 'length breast_cancer dataset is:', len (cancer))

print ( 'type breast_cancer datasets as:', type (cancer))

 

 


cancer_data = cancer['data']

print ( 'data set to data breast_cancer:', '\ n', cancer_data)

 

cancer_target = cancer [ 'target'] ## to remove the label data set

print ( 'label breast_cancer dataset is: \ n', cancer_target)

 

cancer_names = cancer [ 'feature_names'] ## extracted feature data set name

print('breast_cancer数据集的特征名为:\n',cancer_names)

 

cancer_desc = cancer['DESCR'] ## 取出数据集的描述信息

print('breast_cancer数据集的描述信息为:\n',cancer_desc)

 

2、将数据划分为训练集和测试集

数据为什么要拆分?因为这是机器学习方法创新点。让计算思维来发掘数据内部的关联关系。这个方法不像传统的实验思维和理论思维。机器学习思路就是根据给定标签训练集来找出数据内在规律和关系。

sklearn  model_selection 模块提供了 train_test_split 函数,能够对数据集进行拆分。


print('原始数据集数据的形状为:',cancer_data.shape)

print('原始数据集标签的形状为:',cancer_target.shape)

 

from sklearn.model_selection import train_test_split

cancer_data_train, cancer_data_test,\

cancer_target_train, cancer_target_test = \

train_test_split(cancer_data, cancer_target,

    test_size=0.2, random_state=42)

print('训练集数据的形状为:',cancer_data_train.shape)

print('训练集标签的形状为:',cancer_target_train.shape)

print('测试集数据的形状为:',cancer_data_test.shape)

print('测试集标签的形状为:',cancer_target_test.shape)

 

3 、通过 sklearn 转换器进行数据预处理和降维

为了消除特征之间量纲和取值范围差异可能会造成的影响需要对数据进行标准化处理,也叫做规范化处理。实际上规范化就是减少空间复杂度的过程,PCA降维对应于时间复杂度降低过程。

sklearn 的转换器主要包括 3 个方法: fit  transform  fit_transform 等。

郑州不孕不育医院:http://www.xbzztj.com/

import numpy as np

from sklearn.preprocessing import MinMaxScaler

Scaler = MinMaxScaler().fit(cancer_data_train) ##生成规则

##将规则应用于训练集

cancer_trainScaler = Scaler.transform(cancer_data_train)

##将规则应用于测试集

##cancer_testScaler = Scaler.transform(cancer_data_test)

Scaler = MinMaxScaler().fit(cancer_data_test) ##生成规则

cancer_testScaler = Scaler.transform(cancer_data_test)

print('离差标准化前训练集数据的最小值为:',np.min(cancer_data_train))

print('离差标准化后训练集数据的最小值为:',np.min(cancer_trainScaler))

print('离差标准化前训练集数据的最大值为:',np.max(cancer_data_train))

print('离差标准化后训练集数据的最大值为:',np.max(cancer_trainScaler))

print('离差标准化前测试集数据的最小值为:',np.min(cancer_data_test))

print('离差标准化后测试集数据的最小值为:',np.min(cancer_testScaler))

print('离差标准化前测试集数据的最大值为:',np.max(cancer_data_test))

print('离差标准化后测试集数据的最大值为:',np.max(cancer_testScaler))

 

 

from sklearn.decomposition import PCA

pca_model = PCA(n_components=10).fit(cancer_trainScaler) ##生成规则

cancer_trainPca = pca_model.transform(cancer_trainScaler) ##将规则应用于训练集

cancer_testPca = pca_model.transform(cancer_testScaler) ##将规则应用于测试集

print('PCA降维前训练集数据的形状为:',cancer_trainScaler.shape)

print('PCA降维后训练集数据的形状为:',cancer_trainPca.shape)

print('PCA降维前测试集数据的形状为:',cancer_testScaler.shape)

print('PCA降维后测试集数据的形状为:',cancer_testPca.shape)


Guess you like

Origin blog.51cto.com/14510351/2435843