Euclidean distance
Euclidean distance is a commonly used definition of distance, which is the real distance between two points in N-dimensional space .
Manhattan distance
Manhattan distance is a geometric term used in geometric metric spaces to indicate the sum of the absolute axis distances of two points on a standard coordinate system .
The red line in the figure below represents the Manhattan distance, the green represents the Euclidean distance , that is, the straight-line distance, and the blue and yellow represent the equivalent Manhattan distance.
Minkowski distance
Minkowski distance is not a kind of distance, but a definition of a group of distances. It is a general expression of multiple distance measurement formulas, which includes Euclidean distance (p = 2) and Manhattan distance (p = 1 ), which is a special case of the Minkowski distance:
Python implementation
def MinkowskiDistance(x, y, p):
import math
import numpy as np
zipped_coordinate = zip(x, y)
return math.pow(np.sum([math.pow(np.abs(i[0]-i[1]), p) for i in zipped_coordinate]), 1/p)
Mahalanobis distance
Before introducing the Mahalanobis distance, let's look at the following concepts:
Variance : Variance is the square of the standard deviation, and the meaning of the standard deviation is the average distance from each point in the data set to the mean point. It reflects the degree of dispersion of the data.
Covariance : Standard deviation and variance describe one-dimensional data. When there are multidimensional data, we usually need to know whether there is a relationship among the variables of each dimension. Covariance is a statistic that measures the correlation between variables in a cube. For example, the relationship between a person's height and his weight needs to be measured by covariance. If the covariance between two variables is positive, then there is a positive correlation between the two variables, and if it is negative, then there is a negative correlation.
Covariance matrix : When there are many variables, more than two variables. Then, use the covariance matrix to measure the correlation between so many variables. Suppose X is a column vector composed of n random variables (each of which is also a vector, of course a row vector):
where is the expected value of the i-th element, ie . The i, j items of the covariance matrix (the i, j items are a covariance) are defined as follows:
Right now:
The ( i , j ) th element in the matrix is the covariance of and .
Mahalanobis distance (Mahalanobis Distance) is a measure of distance, which can be regarded as a correction of Euclidean distance. It corrects the problem that the scales of each dimension in Euclidean distance are inconsistent and related; The problem of non-independent and identical distribution among dimensions .
Hamming distance
Hamming Distance (Hamming Distance) is a distance measurement method applied to data transmission error control coding, which represents the number of different bits of two (same length) strings. Perform an XOR operation on two strings and count the number of 1s, then this number is the Hamming distance. We can also understand the Hamming distance as the minimum number of substitutions between two equal-length strings to change one of them into the other.
Python implementation
def HammingDistance(x, y):
return sum(x_ch != y_ch for x_ch, y_ch in zip(x, y))