Python3 PCA理解小攻略

主成分分析（Principal Component Analysis，PCA），是一种多元统计方法，也广泛应用于机器学习和其它领域。通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量，转换后的这组变量叫主成分。它的主要作用是对高维数据进行降维。PCA把原先的n个特征用数目更少的k个特征取代，新特征是旧特征的线性组合，这些线性组合最大化样本方差，尽量使新的k个特征互不相关。关于PCA的更多介绍，请参考：https://en.wikipedia.org/wiki/Principal_component_analysis.

PCA的主要算法如下：
(1)组织数据为矩阵形式，以便于模型使用，矩阵的一行表示一个样本，矩阵的一列表示一个属性；
(2)计算样本每个特征的平均值；
(3)每个样本数据减去该特征的平均值（归一化处理）；
(4)求协方差矩阵；
(5)找到协方差矩阵的特征值和特征向量；
(6)对特征值和特征向量重新排列（特征值从大到小排列）；
(7) 对特征值求取累计贡献率；
(8)对累计贡献率按照某个特定比例选取特征向量集的子集合；
(9)对原始数据（第三步后）进行转换。

······其中协方差矩阵的分解可以通过按对称矩阵的特征向量来，也可以通过分解矩阵的SVD来实现，而在Scikit-learn中，也是采用SVD来实现PCA算法的。关于SVD的介绍及其原理，可以参考：矩阵的奇异值分解（SVD）。
　　本文将用三种方法来实现PCA算法，一种是原始算法，即上面所描述的算法过程，具体的计算方法和过程，可以参考：A tutorial on Principal Components Analysis, Lindsay I Smith. 一种是带SVD的原始算法，在Python的Numpy模块中已经实现了SVD算法，并且将特征值从大从小排列，省去了对特征值和特征向量重新排列这一步。最后一种方法是用Python的Scikit-learn模块实现的PCA类直接进行计算，来验证前面两种方法的正确性。
·······本文从代码的角度，做了一次PCA独立成分分析的小攻略，主要是为了回答如下几个问题：

经过PCA降维后数据变了没有？变了没有取决于最后训练直接输入机器学习的数据到底是PCA的输出，还是通过PCA找出来的对应主属性列的原始数据列。
在继续使用机器学习算法做分类时，是直接使用PCA降维后的特征？还是需要恢复数据到原始数据？本文针对主要针对此问题做攻略展开！
实现PCA的三种方式

代码

# -*- coding: utf-8 -*-
import numpy as np
from sklearn.decomposition import PCA
import sys
#返回选择多少个主要的特征(属性列)
def index_lst(lst, component=0, rate=0):
    #component: numbers of main factors
    #rate: rate of sum(main factors)/sum(all factors)
    #rate range suggest: (0.8,1)
    #if you choose rate parameter, return index = 0 or less than len(lst)
    if component and rate:
        print('Component and rate must choose only one!')
        sys.exit(0)
    if not component and not rate:
        print('Invalid parameter for numbers of components!')
        sys.exit(0)
    elif component:
        print('Choosing by component, components are %s......'%component)
        return component
    else:
        print('Choosing by rate, rate is %s ......'%rate)
        for i in range(1, len(lst)):
            #sum(lst[:i])/sum(lst)用于求解PCA的置信度,置信度=选取的k个最重要特征的特征值/所有属性的特征值之和
            if sum(lst[:i])/sum(lst) >= rate: 
                return i
        return 0


# test data
from sklearn import datasets
#导入分解模块
from sklearn import decomposition
pca = decomposition.PCA() #初始化一个对象
pca
iris = datasets.load_iris()
mat = iris.data #iris的数据部分
y_pred=iris.target #iris的数标签部分
# simple transform of test data
Mat = np.array(mat, dtype='float64')
#print('Before PCA transforMation, data is:\n', Mat)
print('\nMethod 1: PCA by original algorithm:')
p,n = np.shape(Mat) # shape of Mat 
t = np.mean(Mat, 0) # mean of each column
"""
4个属性的均值t：
sepal length       5.84333
sepal width        3.054
petal length       3.75867
petal width        1.19867
"""

#1.平均值(归一化处理)，减去每一列属性的均值
for i in range(p):
    for j in range(n):
        Mat[i,j] = float(Mat[i,j]-t[j])

在这里插入图片描述

PCA by original algorithm

#2.协方差矩阵(covariance Matrix)
cov_Mat = np.dot(Mat.T, Mat)/(p-1) #协方差矩阵
# eigvalues and eigenvectors of covariance Matrix with eigvalues descending
U,V = np.linalg.eigh(cov_Mat)  #从U这里来看特征值的大小,从而确定应该选择的主要属性是那k个?如：component=2，则方差最大的属性主属性是2列和3列。

在这里插入图片描述

# Rearrange the eigenvectors and eigenvalues
U = U[::-1]
for i in range(n):
    V[i,:] = V[i,:][::-1]  #交换向量的位置,这里是倒序
# choose eigenvalue by component or rate, not both of them euqal to 0
Index = index_lst(U, component=2)  # 只选取最重要的两个属性用于分类特征
if Index:
    v = V[:,:Index]  # subset of Unitary matrix,酉矩阵的子集,即:选出最大k个特征值对应的特征向量
else:  # improper rate choice may return Index=0
    print('Invalid rate choice.\nPlease adjust the rate.')
    print('Rate distribute follows:')
    print([sum(U[:i])/sum(U) for i in range(1, len(U)+1)]) 
    #iris 数据集中选择最重要的第1个,第1-2个,第1-3个,第1-4个特征属性列分类达到的置信度大小分别如下:
    #[0.9246162071742683, 0.9776317750248034, 0.99481691454981, 1.0]
    sys.exit(0)
# data transformation
T1 = np.dot(Mat, v) #150x4*4*2=150x2
# print the transformed data
print('We choose %d main factors.'%Index)
print('After PCA transformation, data becomes:\n',T1)

PCA by original algorithm using SVD

print('\nMethod 2: PCA by original algorithm using SVD:')
# u: Unitary matrix,  eigenvectors in columns 
# d: list of the singular values, sorted in descending order
u,d,v = np.linalg.svd(cov_Mat)
Index = index_lst(d, rate=0.95)  # choose how many main factors
T2 = np.dot(Mat, u[:,:Index])  # transformed data
print('We choose %d main factors.'%Index)
print('After PCA transformation, data becomes:\n',T2)

PCA by Scikit-learn

pca = PCA(n_components=2) # n_components can be integer or float in (0,1)
pca.fit(mat)  # fit the model
print('\nMethod 3: PCA by Scikit-learn:')
print('After PCA transformation, data becomes:')
print(pca.fit_transform(mat))  # transformed data
pca.explained_variance_ratio_  #可解释性方差,最大特征值对应的属性列的可解释方差是0.92461621,最大告知是已经排过序的了
T3=pca.fit_transform(mat)

4.画出iris数据

#显示iris鸢尾花的PCA的数据
import pandas as pd
import matplotlib.pyplot as plt

def plot_pca_scatter():
    colors=['black','blue','purple','yellow','white','red','lime','cyan','orange','gray']
    for i in range(len(colors)):
        px=T3[:,0][y_pred==i]
        py=T3[:,1][y_pred==i] #DataFrame.as_matrix(),将df转换为Numpy表示
        #py=T3[:,1][y_pred.as_matrix==i] #报错啦,y_pred已经是numpy.array()啦,直接使用即可。
        plt.scatter(px,py,c=colors[i])
    plt.legend(np.arange(3).astype(str))
    plt.xlabel('First PCA')
    plt.ylabel('Second PCA')
    plt.show()
plot_pca_scatter()

在这里插入图片描述
5.分类中的PCA
5.1. 情况1:使用PCA降维后的输出作为特征进行分类

from sklearn.svm import LinearSVC
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
#情况1:使用PCA降维后的输出作为特征进行分类
X_train,X_test,y_train,y_test=train_test_split(T3,iris.target)
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)
pca_svc=LinearSVC()
pca_svc.fit(X_train,y_train) #模型训练
svc_pred=pca_svc.predict(X_test)
print('得分: ',pca_svc.score(X_test,y_test))
#综合评分
print('综合评分: ',classification_report(y_test,svc_pred,target_names=np.arange(3).astype(str)))
print('pca的可解释性方差: ',pca.explained_variance_ratio_)  #特征重要性权重,降序

得分:  0.8947368421052632
综合评分:               precision    recall  f1-score   support

          0       1.00      1.00      1.00        17
          1       0.86      0.67      0.75         9
          2       0.79      0.92      0.85        12

avg / total       0.90      0.89      0.89        38

pca的可解释性方差:  [0.92461621 0.05301557]

5.2. 情况2:使用PCA降维后对应的,最重要的,几列原始数据作为特征进行分类

#情况2:使用PCA降维后对应的,最重要的,几列原始数据作为特征进行分类
print('pca的可解释性方差: ',pca.explained_variance_ratio_)  #特征重要性权重,降序,从这里并不能得到方差最大的主属性列
#[0.92461621 0.05301557 0.01718514 0.00518309]
train11=iris.data[:,:2] 
X_train,X_test,y_train,y_test=train_test_split(train11,iris.target)
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)
pca_svc=LinearSVC()
pca_svc.fit(X_train,y_train) #模型训练
svc_pred=pca_svc.predict(X_test)
print('得分: ',pca_svc.score(X_test,y_test))
#综合评分
print('综合评分: ',classification_report(y_test,svc_pred,target_names=np.arange(3).astype(str)))

#错误更正
train11=iris.data[:,2:] 
#train11=iris.data
X_train,X_test,y_train,y_test=train_test_split(train11,iris.target)
ss=StandardScaler()
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)
pca_svc=LinearSVC()
pca_svc.fit(X_train,y_train) #模型训练
svc_pred=pca_svc.predict(X_test)
print('得分: ',pca_svc.score(X_test,y_test))
#综合评分
print('综合评分: ',classification_report(y_test,svc_pred,target_names=np.arange(3).astype(str)))

得分:  0.9736842105263158
综合评分:               precision    recall  f1-score   support

          0       1.00      1.00      1.00        15
          1       0.92      1.00      0.96        12
          2       1.00      0.91      0.95        11

avg / total       0.98      0.97      0.97        38

参考链接

1.超好理解的PCA 特征选择
https://blog.csdn.net/qq_36336522/article/details/79765558

2.PCA的数学原理
https://blog.csdn.net/shulixu/article/details/52894413 https://blog.csdn.net/shulixu/article/details/52894413

Python3 PCA理解小攻略

猜你喜欢