Python distance measure and achieve (a)

Transfer:  https://www.cnblogs.com/denny402/p/7027954.html

1.  Euclidean distance (Euclidean Distance)
       Euclidean distance is a distance calculation method is most easily understood, from the formula derived from the Euclidean space between the two.
(1) a two-dimensional plane on the two o'clock a (x1, y1) between the Euclidean b (x2, y2) from:

(2) three two points in space a (x1, y1, z1) and b (x2, y2, : Euclidean distance between z2)

Euclidean distance (3) two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) between:

(4) may be expressed by in the form of vector operations:

in python implementation:

method one:

Copy the code
AS NP numpy Import 
X = np.random.random (10) 
Y = np.random.random (10) 

# Method a: Solution according to the formula 
D1 = np.sqrt (np.sum (np.square (XY))) 

# method two: The solver library scipy 
from scipy.spatial.distance Import pdist 
X-np.vstack = ([X, Y]) 
D2 = pdist (X-)
Copy the code

2. The  Manhattan distance (Manhattan Distance)
       from the name can guess the calculation of this distance. Imagine you drive from one intersection to another intersection in Manhattan, the driving distance between two points is a straight line from it? Obviously not, unless you can pass through the building. The actual driving distance is the "Manhattan distance." This is also the source of Manhattan from the name, the Manhattan distance is also known as a city block distance (City Block Distance) .
(A) two two-dimensional plane a (x1, y1) and the Manhattan distance between B (X2, Y2)

(2) two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, Manhattan distance between ..., x2n)

in python implementation:

Copy the code
AS NP numpy Import 
X = np.random.random (10) 
Y = np.random.random (10) 

# Method a: Solution according to the formula 
D1 = np.sum (np.abs (XY)) 

# Method Two: The scipy solving library 
from scipy.spatial.distance Import pdist 
X-np.vstack = ([X, Y]) 
D2 = pdist (X-, 'CityBlock')
Copy the code

3.  Chebyshev distance  (Chebyshev Distance)
       chess played it? Kings take things one can move to any one adjacent eight squares in. Then the king went to the grid (x2, y2) from the grid (x1, y1) requires a minimum of how many steps? Try your own walk. You will find the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) steps. There is a similar kind of distance measurement method called Chebyshev distance.
(A) two two-dimensional plane a (x1, y1) between the cut and the b (x2, y2) Chebyshev distance

(2) two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) cut Chebyshev distance between

  another equivalent form of this equation was

       not see two the formula is equivalent? Reminder: Try scaling method and Squeeze law to prove.

In python implementation:

Copy the code
AS NP numpy Import 
X = np.random.random (10) 
Y = np.random.random (10) 

# method: According to the formula Solution 
D1 = np.max (np.abs (XY)) 

# Method Two: The scipy solving library 
from scipy.spatial.distance Import pdist 
X-np.vstack = ([X, Y]) 
D2 = pdist (X-, 'Chebyshev')
Copy the code

4.  Minkowski distance (Minkowski Distance)
is not a distance Minkowski distance, but rather defines a set distance.
(1) defines Minkowski distance
       two n-dimensional variable a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) between the Minkowski distance is defined as:

It can also be written as


Wherein p is a variable parameter.
When p = 1, the Manhattan distance is
when p = 2, is the Euclidean distance
when p → ∞, the Chebyshev distance is
       depending on variable parameters, Minkowski distance may represent the distance of a class.
(2) Disadvantages Minkowski distance
  Minkowski distances including a Manhattan distance, Euclidean distance and Chebyshev distance there are significant drawbacks.
  For example: sample-dimensional (height, weight), wherein the height range of 150 to 190, weight range 50 to 60, there are three samples: a (180,50), b ( 190,50), c (180, 60). So Min's distance between a and b (both Manhattan distance, Euclidean distance or Chebyshev distance) equal to the Minkowski distance between a and c, but really equivalent to the height of 10cm 10kg weight it? So with Min's distance to measure the similarity between these samples is very problematic.
       Simply put, Min's shortcomings from the main two: (1) the dimensions of each component (scale), which is a "unit" as the same look. (2) does not consider the distribution of each component (desired, variance, etc.) may be different.

in python implementation:

Copy the code
AS NP numpy Import 
X = np.random.random (10) 
Y = np.random.random (10) 

# Method a: The equation solving, P = 2 
D1 = np.sqrt (np.sum (np.square (XY ))) 

# method two: The solver library scipy 
from scipy.spatial.distance Import pdist 
X-np.vstack = ([X, Y]) 
D2 = pdist (X-, 'Minkowski', P = 2)
Copy the code

The  normalized Euclidean distance  (Standardized Euclidean distance)
defined in (1) standard Euclidean distance
  normalized Euclidean distance is a disadvantage for the development of a simple Euclidean distance from the made. Thinking standard Euclidean distance: Since dimensional distribution of the component data is not the same, all right! I first individual components are "standardized" to mean, variance equal to it. How many mean and variance normalized to do? Here first review statistical knowledge point it is assumed that the mean of sample set X (mean) is m, the standard deviation (standard deviation) is s, then the X's "normalized variable" is expressed as:

  normalized value = (normalized value before - component mean) / standard differential component
  through simple derivation can be obtained two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., normalized Euclidean distance between X2n) of formula:

  If the reciprocal of the variance as a weight, this formula can be seen as a weighted Euclidean distance (weighted Euclidean distance) .

in python implementation:

Copy the code
import numpy as np
x=np.random.random(10)
y=np.random.random(10)

X=np.vstack([x,y])

#方法一:根据公式求解
sk=np.var(X,axis=0,ddof=1)
d1=np.sqrt(((x - y) ** 2 /sk).sum())

#方法二:根据scipy库求解
from scipy.spatial.distance import pdist
d2=pdist(X,'seuclidean')
Copy the code

6. 马氏距离(Mahalanobis Distance)
(1)马氏距离定义
       有M个样本向量X1~Xm,协方差矩阵记为S,均值记为向量μ,则其中样本向量X到u的马氏距离表示为:

       而其中向量Xi与Xj之间的马氏距离定义为:

       若协方差矩阵是单位矩阵(各个样本向量之间独立同分布),则公式就成了:

       也就是欧氏距离了。
  若协方差矩阵是对角矩阵,公式变成了标准化欧氏距离。
python 中的实现:

Copy the code
import numpy as np
x=np.random.random(10)
y=np.random.random(10)

#马氏距离要求样本数要大于维数,否则无法求协方差矩阵
#此处进行转置,表示10个样本,每个样本2维
X=np.vstack([x,y])
XT=X.T

#方法一:根据公式求解
S=np.cov(X)   #两个维度之间协方差矩阵
SI = np.linalg.inv(S) #协方差矩阵的逆矩阵
#马氏距离计算两个样本之间的距离,此处共有10个样本,两两组合,共有45个距离。
n=XT.shape[0]
d1=[]
for i in range(0,n):
    for j in range(i+1,n):
        delta=XT[i]-XT[j]
        d=np.sqrt(np.dot(np.dot(delta,SI),delta.T))
        d1.append(d)
        
#方法二:根据scipy库求解
from scipy.spatial.distance import pdist
d2=pdist(XT,'mahalanobis')
Copy the code

马氏优缺点:

1) calculating the Mahalanobis distance is based on overall sample, which can be drawn from the above explained covariance matrix, that is to say, if you take the same two samples into two different overall, the Mahalanobis distance is calculated between the last two samples derived generally are not the same, the same except both the overall covariance matrix happen;

2) the process of calculating the Mahalanobis distance, the number of samples is greater than the dimension requirements of the overall number of samples, or to obtain an overall sample covariance matrix inverse matrix does not exist in this case, can be calculated with the Euclidean distance.

3) In another case, the conditions to meet the dimensions of the number of samples greater than the overall sample, but the inverse matrix of the covariance matrix still does not exist, such as three sample points (3,4), (5,6) and (7, 8), this is because the samples were three-dimensional space in the plane of the line which it is located. In this case, using Euclidean distance calculation.

4) In practice, "the overall number of samples is greater than the dimension of the sample." This condition is easily satisfied, and the case 3) as described in the emergence of all sample points are rare, so in most cases, Mahalanobis distance is calculated smoothly, but Mahalanobis distance is unstable, source of instability is the covariance matrix, which is the biggest difference at the Mahalanobis distance and Euclidean distance.

Advantages: it is not affected dimension, units of measure Mahalanobis distance between two points is independent of the original data; between standardized data calculated by the center and the data (i.e., the difference between the mean of the raw data) of the two points Mahalanobis same distance. Mahalanobis distance can also eliminate the interference correlation between variables. Disadvantages: The disadvantage is that small changes in variables exaggerated role.

reference:

http://www.cnblogs.com/daniel-D/p/3244718.html

http://www.cnblogs.com/likai198981/p/3167928.html

Guess you like

Origin www.cnblogs.com/jiangkejie/p/11595905.html