Data Analysis - Cluster Analysis Factor Model &

  • Cluster analysis

Cluster analysis means the analysis set will be physical or abstract objects grouped by a plurality of similar objects classes: Baidu Encyclopedia. There are a lot of similarities with a cluster of objects, and objects between different clusters have a great dissimilarity.

Method - (also directly with the SPSS)

  1. The hierarchical clustering method (for the case of small amount of data)

  2. K- means method: first initial sample roughly divided into K classes, individually assigned to the sample mean of its nearest class mean value (typically a standardized Euclidean distance data), class recalculated, until no new elements out Happening.

matlab code -

The Y = pdist (X-); 
SF = squareform (the Y); 
the Z = Linkage (the Y, 'Average'); 
dendrogram (the Z); 
T = Cluster (the Z, 'maxclust', n-)% n-is the maximum number of classes 

% Code reference: https: //blog.csdn.net/henu111/article/details/81512314
  • Principal Component Analysis model & Factor

Factor Model proposed mainly to solve the problem of too large dimension of data, assuming the original X variables of P, by now to be measured with less than X P F of m variables, wherein A is a transform coefficient matrix, the elements of which can be called factor loading, β analogy normalized parameters, the absolute value the better.

Wherein the load factor aij is statistically significant variable i-th and the j-th common factor dependent correlation coefficients Fj i.e. Xi represents the weight (gravity))

Construction of a total factor model (load factor three methods) three ways -

  1. Principal component analysis

. A raw data X is normalized as Z, while calculating simple correlation coefficient matrix R / covariance matrix from the normalized data [Sigma; The correlation coefficient matrix R / covariance matrix [Sigma Solutions eigenvalues ​​and principal component coefficients, and the eigenvalues ​​are arranged in descending.

[Coeff, latent, explained] = pcacov (X);% coeff principal component coefficients; latent eigenvalues; Explained variance is the percentage of total variance of each principal component

  Here matlab principal component coefficients outputted from the line represents the original variables X, column represents the main component Z, Z each column in the data table is a combination out * X.

. B The characteristic values ​​and principal component coefficients calculated factor loading matrix B:

C. According to a feature value is greater than 1 or greater than the integrated value of the ratio of a particular characteristic value to determine how to choose common factors Factor Model enters in the column.

d. If you have rotation factor based on the factor loading matrix C. That is where the factor rotation orthogonal transformation, not only to identify common factors and variables grouped more important to find out the meaning of each common factor. And the factor has a total rotation of three rotation method: maximum variance / quartic maximum rotation / equivalent maximum method; The following code uses the first method:

[lambda2,t] = rotatefactors(lambda(:,1:num),'method','varimax');

e calcd factor scores: Score F = X * R '* C' (is equivalent to X represents F, before the Z is represented by F); Finally, according to F1, F2, F3, ...... other common factors each * score weighting factor to calculate the contribution rate of heavy final score F

f. regression analysis based on the final score and the variables F real economic significance of the regression equation Factor Analysis

  2. The principal factor method

  3. The maximum likelihood estimate

[lambda, psi, T, stats , F] = factoran (X, m, param, val);% X matrix for the analysis of the data, m is the number of common factors, param represent attributes and val orthogonality factor and rotation matrix its value (the default is the maximum variance orthogonal transform) 
% is the lambda factor loading matrix, psi factor for the individual variance vector, T is the rotation matrix factor, stats test statistic is a series (in common hypothesis H0 is the factor m), F is the common factor score

  

Guess you like

Origin www.cnblogs.com/caiweijun/p/11567858.html