1. PCA principle analysis
The principle of PCA is mainly to reduce the dimensionality of original data. For its specific working principle, please refer to:CodingLabs - Mathematical Principles of PCA
2. Data preprocessing
The training data set (only positive samples) is dimensional data, that is, there are n sample values, and each sample value has m features.
2.1 Data normalization
Normalize the data X for each feature to have a mean of 0 and a root mean square of 1.
in:
3. PCA dimensionality reduction
3.1 First find the covariance matrix
The formula for the covariance matrix is:
The calculated covariance matrix is a characteristic m*m dimensional matrix.
3.2 Find eigenvalues and eigenvectors
Find the eigenvalues and eigenvectors of the covariance matrix R, and arrange the eigenvalues in order from large to small
After rearranging the eigenvectors according to their eigenvalues, we get:
3.3 Select appropriate k features for PCA dimensionality reduction
You can select the top k features whose cumulative eigenvalues are greater than 85% for PCA dimensionality reduction.
Let the first k eigenvalues from large to small form a diagonal matrix, and the k corresponding eigenvectors will form a matrix. That is:
After PCA dimensionality reduction, the number of samples is still n samples, but the number of features becomes k. The dimensionality reduction formula is:
The calculation formula of the matrix of
4. Find the statistical limit
4.1 统计数
4.1.1 Calculation formula of statistics:
where is the confidence level, is an F distribution with the first degree of freedom being k and the second degree of freedom being n-k, usually a> takes 0.01.
Another important point to emphasize is: n is the number of samples in the training data set, and k is the number of features selected after PCA.
4.1.2 Calculate test datastatistics
Calculate the value of each sample value of the test data. Assume that a sample value in the test sample is a sample value of 1*m. This sample value is also normalized by the mean and variance of the training sample (note that the mean and variance here use training The mean and variance of the sample instead of selecting the sample and variance of the training data), itscalculation formula is:
In addition the calculation formula of can also be simplified as:
where means that each element in the diagonal matrix takes -1/2 index, and means the square of the 2 norm.
4.1.3 Fault determination
If the system is running normally, the sample's value should satisfy T, otherwise it is considered to be a failure.
4.2 SPE statistics (also called Q statistics)
4.2.1 Calculation of SPE control quantity limits
in:
is the confidence limit of the standard normal distribution.
4.2.2 Calculate SPE value of test data
Test data selection and calculationThe same sampling value, also undergo the same normalization process.
4.3.3 Determine whether a fault occurs
If the system is running normally, the SPE value of the sample should satisfy , otherwise, a failure can be determined.
5. matlab implementation
clc;clear;
%% 1.导入数据
%产生训练数据
num_sample=100;
a=10*randn(num_sample,1);
x1=a+randn(num_sample,1);
x2=1*sin(a)+randn(num_sample,1);
x3=5*cos(5*a)+randn(num_sample,1);
x4=0.8*x2+0.1*x3+randn(num_sample,1);
xx_train=[x1,x2,x3,x4];
% 产生测试数据
a=10*randn(num_sample,1);
x1=a+randn(num_sample,1);
x2=1*sin(a)+randn(num_sample,1);
x3=5*cos(5*a)+randn(num_sample,1);
x4=0.8*x2+0.1*x3+randn(num_sample,1);
xx_test=[x1,x2,x3,x4];
xx_test(51:100,2)=xx_test(51:100,2)+15*ones(50,1);
%% 2.数据处理
Xtrain=xx_train;
Xtest=xx_test;
X_mean=mean(Xtrain);
X_std=std(Xtrain);
[X_row, X_col]=size(Xtrain);
Xtrain=(Xtrain-repmat(X_mean,X_row,1))./repmat(X_std,X_row,1); %标准化处理
%% 3.PCA降维
SXtrain = cov(Xtrain);%求协方差矩阵
[T,lm]=eig(SXtrain);%求特征值及特征向量,特征值排列顺序为从小到大
D=flipud(diag(lm));%将特征值从大到小排列
% 确定降维后的数量
num=1;
while sum(D(1:num))/sum(D)<0.85
num = num+1;
end
P = T(:,X_col-num+1:X_col); %取对应的向量
P_=fliplr(P); %特征向量由大到小排列
%% 4.计算T2和Q的限值
%求置信度为99%时的T2统计控制限,T=k*(n^2-1)/n(n-k)*F(k,n-k)
%其中k对应num,n对应X_row
T2UCL1=num*(X_row-1)*(X_row+1)*finv(0.99,num,X_row - num)/(X_row*(X_row - num));%求置信度为99%时的T2统计控制限
%求置信度为99%的Q统计控制限
for i = 1:3
th(i) = sum((D(num+1:X_col)).^i);
end
h0 = 1 - 2*th(1)*th(3)/(3*th(2)^2);
ca = norminv(0.99,0,1);
QU = th(1)*(h0*ca*sqrt(2*th(2))/th(1) + 1 + th(2)*h0*(h0 - 1)/th(1)^2)^(1/h0); %置信度为99%的Q统计控制限
%% 5.模型测试
n = size(Xtest,1);
Xtest=(Xtest-repmat(X_mean,n,1))./repmat(X_std,n,1);%标准化处理
%求T2统计量,Q统计量
[r,y] = size(P*P');
I = eye(r,y);
T2 = zeros(n,1);
Q = zeros(n,1);
lm_=fliplr(flipud(lm));
%T2的计算公式Xtest.T*P_*inv(S)*P_*Xtest
for i = 1:n
T2(i)=Xtest(i,:)*P_*inv(lm_(1:num,1:num))*P_'*Xtest(i,:)';
Q(i) = Xtest(i,:)*(I - P*P')*Xtest(i,:)';
end
%% 6.绘制T2和SPE图
figure('Name','PCA');
subplot(2,1,1);
plot(1:i,T2(1:i),'k');
hold on;
plot(i:n,T2(i:n),'k');
title('统计量变化图');
xlabel('采样数');
ylabel('T2');
hold on;
line([0,n],[T2UCL1,T2UCL1],'LineStyle','--','Color','r');
subplot(2,1,2);
plot(1:i,Q(1:i),'k');
hold on;
plot(i:n,Q(i:n),'k');
title('统计量变化图');
xlabel('采样数');
ylabel('SPE');
hold on;
line([0,n],[QU,QU],'LineStyle','--','Color','r');
%% 7.绘制贡献图
%7.1.确定造成失控状态的得分
S = Xtest(51,:)*P(:,1:num);
r = [ ];
for i = 1:num
if S(i)^2/lm_(i) > T2UCL1/num
r = cat(2,r,i);
end
end
%7.2.计算每个变量相对于上述失控得分的贡献
cont = zeros(length(r),X_col);
for i = length(r)
for j = 1:X_col
cont(i,j) = abs(S(i)/D(i)*P(j,i)*Xtest(51,j));
end
end
%7.3.计算每个变量的总贡献
CONTJ = zeros(X_col,1);
for j = 1:X_col
CONTJ(j) = sum(cont(:,j));
end
%7.4.计算每个变量对Q的贡献
e = Xtest(51,:)*(I - P*P');%选取第60个样本来检测哪个变量出现问题。
contq = e.^2;
%5. 绘制贡献图
figure
subplot(2,1,1);
bar(contq,'g');
xlabel('变量号');
ylabel('SPE贡献率 %');
hold on;
subplot(2,1,2);
bar(CONTJ,'r');
xlabel('变量号');
ylabel('T^2贡献率 %');
The training data is x1, x2, x3 and x4 created by yourself, where x4 is a variable related to x2 and x3. The test data is exactly the same as the training data, except that a fault is added to x2 after the 50th data.
The result obtained is as follows:
From the above figure, it can be clearly seen that the test data starts from the 50th data, and both T2 and SPE values exceed the limit, proving that a fault has occurred.
Through contribution diagram analysis, it can be seen that variable 2 is the fault occurrence point, which is consistent with the actual situation. There are more granular approaches to where failures occur. Not going into details here.