Matlab distance function pdist pdist2

Transfer: https://blog.csdn.net/liuci3234/article/details/9255119

A, pdist

Pairwise distance between pairs of objects

Syntax

D = pdist(X)

D = pdist(X,distance)

Description

D = pdist(X)

Computing a row vector in each of X in the mutual distance (X is an m-by-n matrix). Where D pay special attention, D is a length of m row vector (m-1) / 2 As can be understood D generating: first generates a distance X of the square, because the matrix is ​​symmetric, and the ones on the diagonal is zero, taking the elements of this triangular square, stored in columns matrix in accordance with the principles Matlab, this triangular arrangement of the index of each element is the (2,1), (3,1), ..., (m, 1), (3,2), ..., (m, 2), .. ., (m, m-1). with the command squareform (D) converter this row vector from the original matrix. (squareform function is designed to do the job, which is also the inverse transform squareform.)

D = pdist (X, distance) .distance specified distance can take the following values ​​in parentheses, shown in red!

Metrics

Given an m-by-n data matrix X, which is treated as m (1-by-n) row vectors x1, x2, ..., xm, the various distances between the vector xs and xt are defined as follows:

Euclidean distance Euclidean distance ( 'euclidean')

d 2 s,t =(x s x t )(x s x t ) ′  

 

Notice that the Euclidean distance is a special case of the Minkowski metric, where p = 2.

Although Euclidean distance is useful, but there are obvious shortcomings.

A: The difference between the properties of the different samples it (i.e., each of the variables or indicators) equated, we sometimes can not meet the practical requirements.

Two: It does not consider the magnitude of the variables (dimension), easy to make large numbers of decimal places to eat wrong. It is possible to process the raw data and then normalized distance calculation.

 

Standard Euclidean distance Standardized Euclidean distance ( 'seuclidean')

d 2 s,t =(x s x t )V 1 (x s x t ) ′  

 

where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2, where S is the vector of standard deviations.

Compared to the simple Euclidean distance, standard Euclidean distance can effectively solve the above-mentioned drawbacks. Note that, V is here set itself in many Matlab function, not necessarily take the standard deviation may be set different value according to the degree of importance of each variable, such as the properties knnsearch Scale function.

 

Mahalanobis distance Mahalanobis distance ( 'mahalanobis')

d 2 s,t =(x s x t )C 1 (x s x t ) ′  

 

where C is the covariance matrix.

Mahalanobis distance is put forward by the Indian statistician Mahalanobis (PC Mahalanobis), expressed covariance distance data. It is an effective method of similarity calculated two unknown sample set. And the Euclidean distance difference is that it takes into account the links between the various characteristics (for example: a message will bring about the height of a piece of information about weight, because the two are related) and is independent of the scale (scale-invariant ), i.e., independent of the measurement scale.

If the covariance matrix is ​​a unit matrix, the Mahalanobis distance is reduced to the Euclidean distance, if the covariance matrix is ​​a diagonal matrix, which may also be referred to as the normalized Euclidean distance.

Markov advantages and disadvantages:

  1) calculate the Mahalanobis distance is based on the overall sample, because C is calculated from the total sample from, so Mahalanobis distance is unstable;

  2) the process of calculating the Mahalanobis distance, the number of samples is greater than the requirements of the overall dimension of the sample.

  3) the inverse matrix of the covariance matrix may not exist. 

 

Manhattan distance (city block distance) City block metric ( 'cityblock')

d s,t =∑ j=1 n ∣ ∣ x s j  x t j  ∣ ∣  

 

Notice that the city block distance is a special case of the Minkowski metric, where p=1.

 

Minkowski distance Minkowski metric ( 'minkowski')

d s,t =∑ j=1 n ∣ ∣ x s j  x t j  ∣ ∣  p    p  

 

Notice that for the special case of p = 1, the Minkowski metric gives the city block metric, for the special case of p = 2, the Minkowski metric gives the Euclidean distance, and for the special case of p = ∞, the Minkowski metric gives the Chebychev distance.

Minkowski distance due to promote Euclidean distance, so that the disadvantages associated with substantially the same Euclidean distance.

 

Chebyshev distance Chebychev distance ( 'chebychev')

d s,t =max j ∣ ∣ x s j  x t j  ∣ ∣  

 

Notice that the Chebychev distance is a special case of the Minkowski metric, where p = ∞.

 

Cosine of the angle from Cosine distance ( 'cosine')

d s,t =1x s x t  ′  ∥x s ∥ 2 ∥x t ∥ 2    

 

Compared with the Jaccard distance, not only from Cosine ignored 0-0 match, and capable of processing non-binary vectors, that takes into account the size of the variable value.

 

Correlation distance Correlation distance ( 'correlation')

d s,t =1x s x t  ′  (x s x s  ˉ ˉ ˉ  )(x s x s  ˉ ˉ ˉ  ) ′   √ (x t x t  ˉ ˉ ˉ  )(x t x t  ˉ ˉ ˉ  ) ′   √    

 

Correlation is mainly used to measure the linear distance of two vectors relevance.

Hamming ( 'hamming') from the Hamming distance

d s,t =(#(x s j  ≠x t j  ) n  ) 

 

Hamming distance is defined between the two vectors as a percentage of the number of different variables two vectors percentage of the total number of variables.

 

Jaccard distance Jaccard distance ( 'jaccard')

d s,t =#[(x s j  ≠x t j  )∩((x s j  ≠0)∪(x t j  ≠0))] #[(x s j  ≠0)∪(x t j  ≠0)]   

 

Jaccard distance used to contain only processed asymmetric binary (0-1) properties. Obviously, Jaccard distances do not care about match 0-0, 0-0 match while the Hamming distance concerned.

 

Spearman distance('spearman')

d s,t =1(r s r s  ˉ ˉ ˉ  )(r t r t  ˉ ˉ ˉ  ) ′  (r s r s  ˉ ˉ ˉ  )(r s r s  ˉ ˉ ˉ  ) ′   √ (r t r t  ˉ ˉ ˉ  )(r t r t  ˉ ˉ ˉ  ) ′   √    

 

where

rsj is the rank of xsj taken over x1j, x2j, ...xmj, as computed by tiedrank

rs and rt are the coordinate-wise rank vectors of xs and xt, i.e., rs = (rs1, rs2, ... rsn)

r s  ˉ ˉ ˉ  =1 n  ∑ j r s j  =n+1 2   

r t  ˉ ˉ ˉ  =1 n  ∑ j r t j  =n+1 2   

Two, pdist2

Pairwise distance between two sets of observations

Syntax

D = pdist2(X,Y)

D = pdist2(X,Y,distance)

D = pdist2(X,Y,'minkowski',P)

D = pdist2(X,Y,'mahalanobis',C)

D = pdist2(X,Y,distance,'Smallest',K)

D = pdist2(X,Y,distance,'Largest',K)

[D,I] = pdist2(X,Y,distance,'Smallest',K)

[D,I] = pdist2(X,Y,distance,'Largest',K)

Description

Where X is an mx-by-n-dimensional matrix, Y is my-by-n-dimensional matrix, generating mx-by-my-dimensional distance matrix D.

[D,I] = pdist2(X,Y,distance,'Smallest',K)

Generating K-by-my-dimensional matrix of the same dimension D and matrix I, wherein each column of the original D is the smallest distance matrix element according to ascending order, i.e., the I corresponding to its column index. Note that here takes the K independently each column of the minimum value.

For example, let the original mx-by-my-dimensional distance matrix A, the K-by-my-dimensional matrix D satisfies D (:, j) = A (I (:, j), j).

Guess you like

Origin blog.csdn.net/hhsh49/article/details/82686182