Mathematical modeling algorithms and applications [Principal Component Analysis]

The purpose of principal component analysis: data compression; data interpretation; data dimensionality reduction

What is principal component analysis?

Study how to explain the internal structure among multiple variables through a few principal components. That is, a few principal components are derived from the original variables so that they retain as much information as possible about the original variables and are independent of each other. They are often used to find comprehensive indicators
for judging things or phenomena , and to analyze the comprehensive indicators contained in them. Information is interpreted appropriately

The basic idea of ​​principal component analysis

Insert image description here
Insert image description here
Insert image description here

  • Principal component analysis is to try to recombine many original variables (such as p variables) with certain correlations into a new set of unrelated comprehensive variables to replace the original variables. How to deal with it? The usual mathematical treatment is to linearly combine the original p variables as a new comprehensive variable.
  • If the first linear combination selected, that is, the first comprehensive variable, is recorded as F1, it is natural to hope that F1 can reflect as much information of the original variables as possible.
  • The most classic method is to express it by variance, that is, the larger varF1), the more information F1 contains. Therefore, the F1 selected among all linear combinations should have the largest variance, so it is called the first principal component (principal componentl).
  • If the first principal component is not enough to represent the information of the original p variables, then consider selecting F2, the second linear combination. F2 is called the second principal component (principalcomponent II). What is the relationship between F1 and F2?
  • In order to effectively reflect the original information, the existing information of F1 will no longer appear in F2, that is, cov (F1, F2) = 0. By analogy, p principal components can be obtained. Therefore, these principal components are uncorrelated with each other, and the variances decrease in sequence. In practice, the first few largest principal components are selected to represent. standard?
  • The cumulative variance contribution rate of each principal component is >85% or the characteristic root is >1.

Mathematical model of principal component analysis

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Principal component analysis can be done if KMO exceeds 0.5, it can also be done if it is 0.3-0.5 but is not recommended, and principal component analysis cannot be done if it is below 0.3.
If the Sig significance level test is less than 0.05, it can be done. If it is greater than 0.05, it is not recommended to
perform the main program test if one of the two types of data is satisfied.
Insert image description here
Insert image description here
Insert image description here
Insert image description here

Principal component analysis steps

  1. Standardize the original p indicators to eliminate the impact of variables on levels and dimensions
  2. Calculate the correlation coefficient matrix based on the standardized data matrix
  3. Find the eigenroots and eigenvectors of the covariance matrix
  4. Determine the principal components and give appropriate explanations of the information contained in each principal component.
    This part is easier to understand when compared with the book.

example

Use principal component analysis to analyze and rank investment benefits.
Insert image description here
code

clc,clear
gj=load('data14_7.txt');%获取数据
gj=zscore(gj); %数据标准化
r=corrcoef(gj);  %计算相关系数矩阵
%下面利用相关系数矩阵进行主成分分析,x的列为r的特征向量,即主成分的系数
[x,y,z]=pcacov(r) %y为r的特征值,z为各个主成分的贡献率
f=repmat(sign(sum(x)),size(x,1),1); %构造与x同维数的元素为±1的矩阵
x=x.*f %修改特征向量的正负号,每个特征向量乘以所有分量和的符号函数值
num=3;  %num为选取的主成分的个数
df=gj*x(:,[1:num]);  %计算各个主成分的得分
tf=df*z(1:num)/100; %计算综合得分
[stf,ind]=sort(tf,'descend');  %把得分按照从高到低的次序排列
stf=stf', ind=ind'

operation result
Insert image description here

Guess you like

Origin blog.csdn.net/Luohuasheng_/article/details/128687758