Machine learning dimensionality reduction of principal component analysis

1. The basic idea of ​​the principal component

The basic idea of the main component: In the principal component analysis, the first given data normalization, so that the average dimension of each variable data 0 and variance 1, the data after the orthogonal transform, the original data is represented by the linear correlation variable into data represented by a plurality of linearly independent new variables by orthogonal transformation. The new variable is the variance of possible orthogonal transformation and variable maximum, variance represents the size of the information on the new variable, the new variables are referred to the first principal component, a second principal component and the like

Principal component analysis, principal component can be approximated the original data, which is understood to find 'basic structure' data, data can also be represented by a small number of principal components, it is understood that dimension reduction

2. General definition of the main component

\ (Assuming X = {(x_1, x_2, x_3, ..., x_m)} ^ T is an m-dimensional random variable, with mean vector is \ MU \) , \ [\ E MU = (X-) = {(\ mu_1, \ mu_2, ..., \ mu_m)} ^ T \]

\ (Covariance matrix \ XI \) , \ [\ XI = CoV (x_i, x_j) = E [(x_i- \ MU) {(x_j- \ MU)} ^ T] \]

\ (Considering the m-dimensional random variable x to the m-dimensional random variables y = {(y_1, y_2, ..., y_m)} ^ T linear transformation \)
\ [a_i = y_i A_ ^ = {1i} the TX x_1 + a_ {2i} x_2 + ... + a_ {mi} x_m \]

其中\ (a_i ^ T = (a_ {1i}, a_ {2i}, ..., a_ {mi}) \)

By the nature of random variables can be known:
\ [E (y_i) = A_ {I} ^ T \ MU \]
\ [var (y_i) = a_i ^ T \ XI a_i \]
\ [CoV (y_i, y_j) = a_i ^ T \ xi a_j \]

The following definitions are given overall main component

Defined (generally a main component): Given a top \ (y_i = a_i ^ a_ { 2i} x_2 + ... + a_ {mi} x_m TX = a_ {1i} x_1 + \) linear transformations, if the following conditions are satisfied:

  • (1) the coefficient vector \ (a_i ^ T is a unit vector, i.e. a_i ^ T a_i = 1 \)
  • (2) \ (y_i and y_j uncorrelated variables, i.e. their covariance is 0 \)
  • (3) \ (y_1 all variables in the X linear transformation largest variance; Y_2 linear transformation is not associated with all of the X y_1 the largest variance; \) \ (generally y_i with y_1, y_2, .. ., y_ {i-1} that are not related to all linear transformation X in the largest variance; \) \ (respectively, said time y_1, y_2, ..., y_m X is the first principal component, a second main composition, ..., m-th principal component \)

3. The sample mean and variance

M-dimensional random variable assumed to \ (X = {(x_1, x_2, ..., x_m)} ^ T \) be n independent observations, \ (x_1, x_2, ..., x_n \) represents the observed sample, wherein \ (x_j = {(x_ { 1j}, x_ {2j}, ..., x_ {mj})} ^ T \) represents the j-th observation sample, \ (} X_ {ij of j-th observation sample the i-th variable \)

A given sample matrix X, the sample can be estimated mean and covariance sample, the sample mean vector \ [\ tilde x = \ frac {1} {n} \ sum_ {j = 1} ^ nx_j \]

Sample variance \ [S = \ frac {1 } {n-1} \ sum_ {j = 1} ^ n (x_ {ik} - \ tilde x_i) (x_ {jk} - \ tilde j) \]

3.1 derivation sample variance

Sample variance formula \ [S = \ frac {1
} {n-1} \ sum_ {i = 1} ^ n (x_i- \ mu_i) ^ 2 \] spread apart to get \ [S = \ frac {1 } { }. 1-n-[(X-\ FRAC. 1 {{}} n-X-TI_nI_n ^ T ^) ^ T (X-\ FRAC. 1 {{}} n-X-T ^ ^ TI_nI_n)] \]
\ [S = \ FRAC {1} {n-1}
X ^ T (I_n - \ frac {1} {n} I_nI_n ^ T) (I_n - \ frac {1} {n} I_nI_n ^ T) X \] order \ (H = I_n - \ frac {1} {n } I_nI_n ^ T \) to give \ [S = \ frac {1
} {n-1} X ^ THX \] where H is idempotent matrix HH = H and the center matrix \ (H_n * I_n = 0 \)

4. PCA solving process

  • (1) The data were normalized, zero mean, unit variance
  • (2) calculation of the covariance matrix
  • Eigenvalues ​​and eigenvectors (3) calculation of the covariance matrix
  • (4) feature values ​​sorted in descending order
  • (5) to retain the uppermost N eigenvectors
  • (6) to convert the data into a new space of the N feature vectors constructed in

4.1 python achieve PCA

def pca(dataMat, topNfeat=9999999):
    meanVals = mean(dataMat, axis=0)
    meanRemoved = dataMat - meanVals #remove mean
    covMat = cov(meanRemoved, rowvar=0)
    eigVals,eigVects = linalg.eig(mat(covMat))
    eigValInd = argsort(eigVals)            #sort, sort goes smallest to largest
    eigValInd = eigValInd[:-(topNfeat+1):-1]  #cut off unwanted dimensions
    redEigVects = eigVects[:,eigValInd]       #reorganize eig vects largest to smallest
    lowDDataMat = meanRemoved * redEigVects#transform data into new dimensions
    reconMat = (lowDDataMat * redEigVects.T) + meanVals
    return lowDDataMat, reconMat

5. PCA theoretical derivation least square error

PCA is actually solved to find the best projection direction, i.e., a plurality of orthonormal basis directions constituting a hyperplane.

Theoretical ideas: in high-dimensional space, we actually want to find a d-dimensional hyper-plane, so that the data point to the hyperplane square of the distance and minimum

Suppose \ (x_k \) represents a k-dimensional space of points p, \ (Z_K \) represents \ (x_k \) projection vectors on the hyperplane D, \ (= {W is W_1, w_2, ...,} w_d \) is an orthonormal basis of D-dimensional space, i.e., the least square error theory PCA converted to the following optimization problem \ [z_k = \ sum_ {i = 1} ^ d (w_i ^ T x_k) w_i --- (1) \ ]
\ [argmin \ sum_ {I} = ^ K. 1 x_k || - || Z_K _2 ^ 2 \]
\ [^ Tw_j W_i ST = P (when i == j when p = 1, or p = 0) \ ]

Note: \ (w_i Tx_k ^ \) is the projected length in the w_i x_k basis vectors, \ (w_i Tx_kw_i ^ \) coordinate values of basis vectors w_i

Solving:

\(L = (x_k - z_k)^T(x_k-z_k)\)

\ (L = x_k ^ Tx_k - x_k ^ Tz_k - z_k ^ Tx_k + z_k ^ Tz_k \)

Due to the nature of the product vector \ (x_k ^ Tz_k = z_k ^ Tx_k \)

\ (^ L = x_k Tx_k - 2x_k Tz_k ^ + ^ z_k Tz_k \)

The (1) obtained into \ [x_k ^ Tz_k = \ sum_ {i = 1} ^ dw_i ^ Tx_kx_k ^ Tw_i \]

\[z_k^Tz_k = \sum_{i=1}^d\sum_{j=1}^d(w_i^Tx_kw_i)^T(w_j^Tx_kw_j)\]

The constraints have st \ [z_k ^ Tz_k = \ sum_ {i = 1} ^ dw_i ^ Tx_k ^ Tx_kw_i \]

\[L =x_k^Tx_k - \sum_{i=1}^dw_i^Tx_kx_k^Tw_i\]

The singular value decomposition \ [\ sum_ {i = 1 } ^ dw_i ^ Tx_kx_k ^ Tw_i = tr (W ^ Tx_k ^ Tx_kW) \]

\[L =argmin\sum_{i=1}^kx_k^Tx_k - tr(W^Tx_k^Tx_kW) = argmin\sum_{i=1}^k- tr(W^Tx_k^Tx_kW) + C\]

It is equivalent to the optimization problem with constraints obtained: \ [argmaxtr (TXX ^ ^ W is TW) \]
\ [^ ST TW = W is the I \]

Solving optimal projection optimal hyperplane W and varimax the same direction, i.e., the maximum eigenvalue of the covariance matrix of the corresponding eigenvectors, the only difference is the covariance matrix \ (\ XI \) is a multiple of

5.1 Theorem

\ [argmin \ phi (W, Z | X) = tr ((^ XW TZ) ^ (T ^ XW TZ)) = || ^ XW TZ _F || ^ 2 \]
\ [^ STM TW = I_q \]

Note: X is (n, p), Z is (n, q), q <p, w is the (p, q)

The meaning of the expression is the squared difference theorem theory, lowering the matrix W ^ T dimensional back projection, and then calculate the minimum squared difference X, the smaller the value, the less the loss of information through

\ (\ Phi \) objective function h, W is a matrix before the eigenvectors q and X \ (Z = W ^ TX \ )

Above optimization can be determined by Lagrangian Dual Problem, eventually give \ [argmaxtr (TXX ^ ^ W is TW) \]
\ [^ ST TW = W is the I \]

6. Nuclear PCA derivation

Kernel: Let X be input space ( \ (n-R & lt ^ \) a discrete subset or subset), and F is the feature space (Hilbert space), if a hint is present from X to F \ [ \ phi (X): X - > F \] such that for all x, z \ in X, the function K (x, z) satisfies the condition \ [K (x, z) = \ phi (x) \ bullet \ phi ( z)\]

The following derivation plane F is projected onto the main component defined by decomposition according to the characteristic value F sample variance obtained (for convenience derived remove the front ( \ (\ FRAC {. 1} {n--. 1} \) ) \ [F ^ THFV_i = \ lambda _i V_i \] Since the inverse matrix H for the other, then \ [F ^ THHFV_i = \ lambda _i V_i \]

Since F want difficult, we will change the thinking transfer request on request F K, the relation AA ^ T and A ^ TA: the same qualities nonzero value, to give \ [HFF ^ THU_i = \ lambda _iU_i \]

Simultaneously multiplying both sides \ (F ^ TH \) to give \ [F ^ THHFF ^ THU_i = \ lambda _iF ^ THU_i \]

Can be obtained from the formula \ (F ^ THU_i \) of \ (F ^ THHF \) eigenvectors

The \ (F ^ THU_i \) normalized \ [U_ {normal} = \ frac {F ^ THU_i} {{|| U_i ^ THFF ^ THU_i ||} _2} \]

Because of \ (^ TH = HFF the HKH = \ the lambda _i \) , then \ [U_ {normal} = \ lambda ^ {- \ frac {1} {2}} F ^ THU_i \]

Projected F \ (U_normal \) defines a plane \ [P = F_ {center} U_ {normal} \]

\[P= (F-\frac{1}{n}\sum_{i=1}^nF_i)(\lambda ^{-\frac{1}{2}}F^THU_i)\]

\[P= (F-\frac{1}{n}F^TI_n)(\lambda ^{-\frac{1}{2}}F^THU_i)\]

\[P= \lambda ^{-\frac{1}{2}}(K - \frac{1}{n}K(x,x_i))HU_i\]

Annex: Singular Value Decomposition

Singular value decomposition is a decomposition method can be applied to one of the arbitrary matrix: \ [A = the U-\ XI} {V ^ T \]

Suppose A is an M N matrix, then the U M is a square matrix of M (which is orthogonal vectors, which vectors U left singular vectors), \ (\ XI \) is M real diagonal matrix of N ( except the diagonal elements are zero, the diagonal elements of the singular values), \
(V ^ T \) is an N
N matrix (which are orthogonal vector, V is a vector which is called a right singular vectors)

Combined Eigenvalue Decomposition: \ [(A ^ T \ bullet {A}) \ bullet V_i} = {\ the lambda _i {} \ {V_i bullet} \]

Obtained above \ (V_i \) is the singular value decomposition thereof right singular vectors, \ (\ the lambda _i} {\) are characterized

In addition, we also obtain: \ [\ Sigma _i} = {\ sqrt {\ the lambda _i} {} \\ u_i = \ {FRAC. 1} {\ Sigma _i} {} AV_i \]

The above \ (\ sigma {_i} singular value \) , \ (u_i \) is the left singular vectors

Common practice is to singular values in descending order, in most cases, even 1% of the top 10% of singular value and accounted for all of the above and singular values of 99%, which means that we can use before r large singular values approximately describe matrix
\ [A_ {m \ times { n}} \ approx {U_ {m \ times {r}} \ xi {_ {r \ times {r}}} V_ {r \ times { n}} ^ T} \]

r is much smaller than m, the number n

References:

  • (1) Li Hang teacher <statistical learning methods>
  • (2) <based machine learning practical scikit-learn and tensorflow>
  • (3) <face one hundred machine learning>

Guess you like

Origin www.cnblogs.com/xiaobingqianrui/p/11206728.html