Machine learning---common distance formulas (Euclidean distance, Manhattan distance, standardized Euclidean distance, cosine distance, Jaccard distance, Mahalanobis distance, Chebyshev distance, Minkowski distance, KL divergence)

1. Euclidean distance

Euclidean metric (also called Euclidean distance) is a commonly used distance definition, which refers to the distance in m-dimensional space.

The true distance between two points in the interval, or the natural length of the vector (that is, the distance from the point to the origin). in two and three dimensions

The Euclidean distance is the actual distance between two points.

from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)

print(distance.euclidean(a, b))

2. Manhattan distance

The red line in the figure represents the Manhattan distance, the green represents the Euclidean distance, which is the straight-line distance, and the blue and yellow represent the equivalent distance.

Manhattan distance. Manhattan distance - the distance between two points in the north-south direction plus the distance in the east-west direction, that is, d(i,j)=|xi-

xj|+|yi-yj|. For a town with a regular layout of streets in the directions of due south, due north, due east and due west, the distance from one point to another

It is the distance traveled in the north-south direction plus the distance traveled in the east-west direction. Therefore, the Manhattan distance is also known as the taxi distance.

Leave. Manhattan distance is not a distance invariant. When the coordinate axis changes, the distance between points will be different. Manhattan distance diagram in the early days

In computer graphics, the screen is composed of pixels, which are integers, and the coordinates of points are generally integers. The reason is that floating point operations are very expensive.

Expensive, slow and error-prone, if you use AB's Euclidean distance directly (Euclidean distance: Euclidean distance in two- and three-dimensional spaces

distance is the distance between two points), you must perform floating point operations. If you use AC and CB, you only need to calculate addition and subtraction.

However, this greatly improves the calculation speed, and no matter how many times the calculation is accumulated, there will be no error.

import numpy as np
from scipy.spatial import distance

A = np.array([7,8,9])
B = np.array([4,5,6])

# 方式一：直接构造公式计算
dist1 = np.sum(np.abs(A-B))

# 方式二：内置线性代数函数计算
dist2 = np.linalg.norm(A-B,ord=1)  #ord为范数类型，取值1（一范数）,2（二范数）,np.inf（无穷范数），默认2。

# 方式三：scipy库计算
dist3 = distance.cityblock(A,B)

3. Standardized Euclidean distance

The standardized Euclidean distance is an improvement to address the shortcomings of the Euclidean distance.

Since the distribution of each dimensional component of the data is different, each component must first be "standardized" to have the same mean and variance.

S represents the standard deviation of each dimension:

If the reciprocal of the variance is regarded as a weight, it can also be called the Weighted Euclidean distance.

from scipy.spatial.distance import pdist
dist2 = pdist(Vec,’seuclidean’)

4. Cosine distance

Where, the angle cosine can be used to measure the difference in the direction of the two vectors; in machine learning, this concept is used to measure the direction of the sample.

difference between quantities.

The formula of the angle cosine between vector A(x1,y1) and vector B(x2,y2) in two-dimensional space:

The intersection cosine of two n-dimensional sample points a(x11,x12,…,x1n) and b(x21,x22,…,x2n) is:

The value range of the angle cosine is [-1,1]. The larger the cosine, the smaller the angle between the two vectors, and the smaller the cosine, the smaller the angle between the two vectors.

The bigger. When the directions of the two vectors coincide, the cosine takes the maximum value 1, and when the directions of the two vectors are completely opposite, the cosine takes the minimum value -1.

import numpy as np
from scipy.spatial import distance

A = np.array([7,8,9])
B = np.array([4,5,6])

# 方式一：直接构造公式计算
dist1 = np.sum(A*B)/(np.sqrt(np.sum(A**2))*np.sqrt(np.sum(B**2)))

# 方式二：scipy库计算
dist2 = 1-distance.cosine(A,B)

5. Hamming distance

The Hamming distance of two equal-length strings s1 and s2 is: the minimum number of character substitutions required to change one of them into the other.

Hamming weight: It is the Hamming distance of a string relative to a zero string of the same length, that is, it is the non-zero in the string.

The number of elements, for binary strings, is the number of 1s, so the Hamming weight of 11101 is 4. Therefore, the vector is empty

The Hamming distance between elements a and b in the interval is equal to the difference ab of their Hamming weights.

Application: Hamming weight analysis has applications in fields including information theory, coding theory, cryptography and other fields. For example, in the process of information encoding,

In order to enhance fault tolerance, the minimum Hamming distance between codes should be made as large as possible. However, if you want to compare two characters of different lengths

Strings must not only be replaced, but also inserted and deleted. In this case, more complex editing distances are usually used.

Isolation algorithm.

import numpy as np
from scipy.spatial import distance

A = np.array([1,2,3])
B = np.array([4,5,6])

# scipy库计算
dist1 = distance.hamming(A,B)

6. Jaccard distance

Jaccard similarity coefficient: The intersection element of two sets A and B is located in the union of A and B.

The proportion of , is called the Jaccard similarity coefficient of the two sets, represented by the symbol J(A,B):

Jaccard Distance: Contrary to the Jaccard similarity coefficient, different elements in the two sets account for all elements

To measure the difference between the two sets:

# 方案一
# 根据公式求解
up = np.double(np.bitwise_and((vec1!=vec2),np.bitwise_or(vec1!=0,vec2!=0)).sum())
down = np.double(np.bitwise_or(vec1!=0,vec2!=0).sum())
dist1=(up/down)
print("杰卡德距离测试结果是:"+str(dist1))
# 方案二
# 根据scipy库求解
from scipy.spatial.distance import pdist
Vec = np.vstack([vec1,vec2])
dist2 = pdist(Vec,'jaccard')
print("杰卡德距离测试结果是:"+str(dist2))

7. Mahalanobis distance

Mahalanobis distance is a distance based on sample distribution.

Mahalanobis distance was proposed by Indian statistician Mahalanobis and represents the covariance distance of data. It is an efficient way to calculate two

Method of similarity of sample sets at each location. Different from Euclidean distance, it takes into account the connection between various characteristics, that is, it is independent of

Measurement scale.

Mahalanobis distance can also be defined as the degree of difference between two random variables that obey the same distribution and whose covariance matrix is Σ. If

The covariance matrix is the identity matrix, and the Mahalanobis distance is simplified to the Euclidean distance; if the covariance matrix is a diagonal matrix, it can also be called

is the normalized Euclidean distance.

In the process of calculating Mahalanobis distance, it is required that the number of overall samples is greater than the dimension of the samples, otherwise the inverse matrix of the overall sample covariance matrix obtained will not be the same.

exists. In this case, just use Euclidean distance calculation.

import numpy as np
from scipy.spatial.distance import pdist
a=np.random.random(10)
b=np.random.random(10)
#马氏距离要求样本数要大于维数，否则无法求协方差矩阵
X=np.vstack([a,b])
XT=X.T  #此处进行转置，表示10个样本，每个样本2维

pdist(XT,'mahalanobis')

8. Chebyshev distance

In mathematics, Chebyshev distance or L∞ metric is a metric in vector space. The distance between two points is defined as

The maximum value of the absolute value of the numerical difference between each coordinate. From a mathematical point of view, the Chebyshev distance is determined by the uniform norm (uniform

norm) (or called the supremum norm) is also a type of hyperconvex metric (injective metric space).

# 根据scipy库求解
from scipy.spatial.distance import pdist
Vec = np.vstack([vec1,vec2])
dist2 = pdist(Vec,'chebyshev')
print('切比雪夫距离测试结果是：' + str(dist2))

9. Minkowski distance

Minkowski Distance , also known as Min's distance . It is not just a distance, but more

The distance formulas (Manhattan distance, Euclidean distance, Chebyshev distance) are summarized into one formula .

The Minkowski distance between two n-dimensional variables a(x11,x12,…,x1n) and b(x21,x22,…,x2n) is defined as:

# 根据scipy库求解
from scipy.spatial.distance import pdist
Vec = np.vstack([vec1,vec2])
dist2 = pdist(Vec,'cityblock',p=1)
print('当P=1时就是曼哈顿距离，测试结果是：' + str(dist2))

# 根据公式求解，p=1
from numpy import *
dist3 = sum(abs(vec1-vec2))# abs()绝对值
print('当p=1时就是曼哈顿距离，测试结果是：' + str(dist3))

10. KL divergence

KL divergence (Kullback–Leibler divergence), also known as KL distance, relative entropy.

When the similarity between P(x) and Q(x) is higher, the KL divergence becomes smaller.

KL divergence mainly has two properties:

(1) Asymmetry

Although KL divergence is intuitively a metric or distance function, it is not a true metric or distance because it does not have

Symmetry, that is, D(P||Q)!=D(Q||P).

(2) Non-negativity

The value of relative entropy is non-negative, that is, D(P||Q)>0.

# 利用scipy API进行计算

KL = scipy.stats.entropy(x, y)
print(KL)

# 用公式编程就用px和py

KL = 0.0
for i in range(10):
    KL += px[i] * np.log(px[i] / py[i])
print(KL)