Various distances Euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance, standard Euclidean distance, Mahalanobis distance, Cosine distance, Hamming distance, Gerald distance, correlation distance, information entropy

1. Euclidean Distance

Euclidean distance is the easiest distance measurement method to understand intuitively. The distance between two points in space that we touch in elementary school, junior high school and high school generally refers to Euclidean distance.

Euclidean distance

  • Euclidean distance between points a(x1,y1) and b(x2,y2) on a two-dimensional plane:

Euclidean distance 2D

  • Euclidean distance between points a(x1, y1, z1) and b(x2, y2, z2) in three-dimensional space:

Euclidean distance 3D

  • Euclidean distance between n-dimensional space points a(x11,x12,…,x1n) and b(x21,x22,…,x2n) (two n-dimensional vectors):

Euclidean distance n dimension

  • Matlab calculates the Euclidean distance:

Matlab calculates the distance using the pdist function. If X is an m×n matrix, then pdist(X) treats each row of the X matrix as an n-dimensional row vector, and then calculates the distance between the m vectors.

2. Manhattan Distance

As the name suggests, driving from one intersection to another on a Manhattan block is obviously not a straight-line distance between two points. This actual driving distance is the "Manhattan distance". Manhattan distance is also known as City Block distance.

Manhattan distance

  • Manhattan distance between two points a(x1,y1) and b(x2,y2) in a two-dimensional plane:

Manhattan distance 2D

  • Manhattan distance between n-dimensional space points a(x11,x12,...,x1n) and b(x21,x22,...,x2n):

Manhattan distance n dimension

  • Matlab calculates the Manhattan distance:

3. Chebyshev Distance

In chess, the king can move straight, sideways, and diagonally, so the king can move to any one of the 8 adjacent squares with one step. What is the minimum number of steps required for the king to walk from grid (x1, y1) to grid (x2, y2)? This distance is called the Chebyshev distance.

Chebyshev Distance_Chess

  • Chebyshev distance between two points a(x1,y1) and b(x2,y2) in a two-dimensional plane:

Chebyshev distance 2D

  • Chebyshev distance between n-dimensional space points a(x11,x12,...,x1n) and b(x21,x22,...,x2n):

Chebyshev distance n dimension

  • Matlab calculates the Chebyshev distance:

4. Minkowski Distance

Min's distance is not a distance, but a definition of a set of distances, which is a generalized expression of multiple distance measurement formulas.

  • Min's distance definition:
  • The Minkowski distance between two n-dimensional variables a(x11,x12,…,x1n) and b(x21,x22,…,x2n) is defined as:

Min type distance n dimension

where p is a variadic parameter:

When p=1, it is the Manhattan distance;

When p=2, it is the Euclidean distance;

When p→∞, it is the Chebyshev distance.

Therefore, according to the different parameters, Min's distance can represent the distance of a certain class/species.

  • Min's distance, including Manhattan distance, Euclidean distance and Chebyshev distance all have obvious shortcomings.
  • eg Two-dimensional sample (height [unit: cm], weight [unit: kg]), there are three existing samples: a(180,50), b(190,50), c(180,60). Then the Min's distance between a and b (whether it is Manhattan distance, Euclidean distance or Chebyshev distance) is equal to the Min's distance between a and c. But in fact, the height of 10cm is not the same as the weight of 10kg.
  • Disadvantages of Min's distance:
  • (1) Treat the scale of each component, that is, the "unit" in the same way;
  • (2) The distributions (expectations, variances, etc.) of the individual components may be different without considering.

  • Matlab calculates the Min's distance (using the Euclidean distance of p=2 as an example):

5. Standardized Euclidean Distance

 Definition: Normalized Euclidean distance is an improvement for the shortcomings of Euclidean distance. The idea of ​​​​standard Euclidean distance: Since the distribution of the data components in each dimension is different, first "standardize" each component to the mean and equal variance. Assuming that the mean (mean) of the sample set X is m and the standard deviation (standard deviation) is s, the "standardized variable" of X is expressed as:

normalized euclidean distance

  • Normalized Euclidean distance formula:

Normalized Euclidean Distance Formula

If the inverse of the variance is regarded as a weight, it can also be called the Weighted Euclidean distance.

  • Matlab calculates the normalized euclidean distance (assuming the standard deviations of the two components are 0.5 and 1):

6. Mahalanobis Distance

 The derivation of Mahalanobis distance:

Mahalanobis distance source

There are two normally distributed populations in the above figure. Their means are a and b, respectively, but the variances are different. Which population is closer to point A in the figure? Or to whom does A have a greater probability to belong? Obviously, A is closer to the left, and A has a greater probability of belonging to the left population, although the Euclidean distance between A and a is greater. This is the intuitive interpretation of Mahalanobis distance.

  • Concept: Mahalanobis distance is a distance based on sample distribution. The physical meaning is the Euclidean distance in the normalized principal component space. The so-called normalized principal component space is to use principal component analysis to perform principal component decomposition on some data. Then normalize all principal component decomposition axes to form a new coordinate axis. The space spanned by these axes is the normalized principal component space.

Mahalanobis distance concept

  • Definition: There are M sample vectors X1~Xm, the covariance matrix is ​​denoted as S, and the mean value is denoted as vector μ, then the Mahalanobis distance from the sample vector X to μ is expressed as:

Mahalanobis distance formula

The Mahalanobis distance between vectors Xi and Xj is defined as:

Mahalanobis distance formula

If the covariance matrix is ​​an identity matrix (independent and identically distributed among each sample vector), then the Mahalanobis distance between Xi and Xj is equal to their Euclidean distance:

Mahalanobis distance formula

If the covariance matrix is ​​a diagonal matrix, it is the normalized Euclidean distance.

  • Euclidean distance & Mahalanobis distance:

Euclidean distance & Mahalanobis distance

Euclidean distance & Mahalanobis distance

  • The characteristics of Mahalanobis distance:
  • Dimension-independent, excluding the interference of correlation between variables;
  • The calculation of Mahalanobis distance is based on the overall sample. If the same two samples are taken and placed in two different populations, the Mahalanobis distance between the two samples finally calculated is usually different. , unless the covariance matrices of the two populations happen to be the same;
  • In the process of calculating the Mahalanobis distance, the number of overall samples is required to be greater than the dimension of the samples, otherwise the inverse matrix of the overall sample covariance matrix obtained does not exist. In this case, the Euclidean distance can be used for calculation.
  • Matlab calculates Mahalanobis distance:

7. Cosine Distance

In geometry, the cosine of the included angle can be used to measure the difference between the directions of two vectors; in machine learning, this concept is borrowed to measure the difference between sample vectors.

  • The cosine formula of the angle between vector A(x1,y1) and vector B(x2,y2) in two-dimensional space:

cosine distance

  • The cosine of the angle between two n-dimensional sample points a(x11,x12,...,x1n) and b(x21,x22,...,x2n) is:

cosine distance

which is:

cosine distance

The value range of the cosine of the included angle is [-1,1]. The larger the cosine, the smaller the angle between the two vectors, and the smaller the cosine, the larger the angle between the two vectors. When the directions of the two vectors coincide, the cosine takes the maximum value of 1, and when the directions of the two vectors are completely opposite, the cosine takes the minimum value of -1.

  • Matlab calculates the cosine of the included angle (pdist(X, 'cosine') in Matlab gets the value of 1 minus the cosine of the included angle):

8. Hamming Distance

Hamming distance

  • Definition: The Hamming distance of two equal-length strings s1 and s2 is: the minimum number of character replacements required to change one of them into the other. E.g:

  • Hamming weight: is the Hamming distance of a string relative to a zero string of the same length, that is, it is the number of non-zero elements in the string: for binary strings, it is the number of 1s, so The Hamming weight of the 11101 is 4. Thus, if the Hamming distance between elements a and b in the vector space is equal to the difference ab of their Hamming weights.

  • Applications: Hamming gravimetric analysis has applications in fields including information theory, coding theory, and cryptography. For example, in the process of information encoding, in order to enhance fault tolerance, the minimum Hamming distance between encodings should be made as large as possible. However, if two strings of different lengths are to be compared, not only replacements, but also insertion and deletion operations are required. In this case, more complex algorithms such as edit distance are usually used.

  • Matlab calculates the Hamming distance (the Hamming distance between 2 vectors in Matlab is defined as the percentage of the different components of the 2 vectors):

9. Jaccard Distance

Jaccard similarity coefficient: the proportion of the intersection elements of two sets A and B in the union of A and B, called the Jaccard similarity coefficient of the two sets, with the symbol J(A, B) means:

Jaccard similarity coefficient

  • Jaccard Distance: Contrary to the Jaccard similarity coefficient, the difference between two sets is measured by the proportion of different elements in all elements in the two sets:

Jaccard distance

  • Matlab calculates the Jaccard distance (Matlab defines the Jaccard distance as the ratio of the number of different dimensions to "non-zero dimensions"):

10. Correlation distance

Correlation Coefficient Diagram

  • Correlation coefficient: It is a method to measure the degree of correlation between random variables X and Y. The value range of the correlation coefficient is [-1, 1]. The larger the absolute value of the correlation coefficient, the higher the correlation between X and Y. When X and Y are linearly related, the correlation coefficient takes a value of 1 (positive linear correlation) or -1 (negative linear correlation):

Correlation coefficient

  • Correlation distance:

Correlation distance

  • Matlab calculates the correlation coefficient and the correlation distance:

11. Information Entropy

 The above distance measurement methods all measure the distance between two samples (vectors), while the information entropy describes a distance between samples within the entire system, or the concentration of the sample distribution within the system (the degree of consistency). ), degree of dispersion, degree of confusion (degree of inconsistency). The more dispersed the sample distribution in the system (or the more even the distribution), the greater the information entropy. The more ordered the distribution (or the more concentrated the distribution), the smaller the information entropy.

Information entropy formula

  • The origin of information entropy: Please refer to the blog: XXXXXXXX.

  • The formula to calculate the information entropy of a given sample set X:

Information entropy formula

The meaning of the parameters:

n: the number of classifications in the sample set X

pi: the probability of occurrence of the i-th element in X

The larger the information entropy, the more dispersed the distribution of the sample set S (the distribution is balanced), the smaller the information entropy, the more concentrated the distribution of the sample set X (the distribution is unbalanced). When the probability of occurrence of n categories in S is the same (all are 1/n), the information entropy takes the maximum value log2(n). When X has only one classification, the information entropy takes the minimum value of 0.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324860619&siteId=291194637