Analysis of various distances

In doing classification is often necessary to estimate the similarity measure between the different samples (SimilarityMeasurement), a method generally employed is to calculate the time "distance" (Distance) between samples. What kind of method that calculates the distance is very particular about, even to the correctly classified or not.

  The purpose of this paper is commonly used as a summary measure of similarity.

 

This article directory:

1. Euclidean distance

2. Manhattan distance

3. Chebyshev distance

4. Minkowski distance

The normalized Euclidean distance

6. Mahalanobis distance

7. Cosine

8. Hamming distance

9. & Jaccard distance Jaccard similarity coefficient

10. & correlation coefficient from the correlation

11. Information Entropy

 

1. Euclidean distance (EuclideanDistance)

       A distance-calculation Euclidean distance is most easily understood, from the formula derived from the Euclidean space between the two.

(1) a two-dimensional Euclidean plane between two points on a (x1, y1) and b (x2, y2) Distance:

 

Between (2) two-dimensional space a (x1, y1, z1) and b (x2, y2, z2) Euclidean distance:

 

(3) two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) the Euclidean distance between:

 

  It may be expressed by the form of vector operations:

 

(4) Matlab computing Euclidean distance

Calculating a distance using Matlab pdist main functions. If X is an M × N matrix, then pdist (X) of each row of the X matrix M rows as a N-dimensional vector, and then calculates the distance between any two vectors of the M.

Examples: calculating a vector (0,0), (1,0), the Euclidean distance (0,2) between any two

X= [0 0 ; 1 0 ; 0 2]

D= pdist(X,'euclidean')

result:

D=

    1.0000   2.0000    2.2361

 

 

2. Manhattan distance (ManhattanDistance)

       We can guess from the name of this method of calculation of the distance. Imagine you drive from one intersection to another intersection in Manhattan, the driving distance between two points is a straight line from it? Obviously not, unless you can pass through the building. The actual driving distance is the "Manhattan distance." This is also the source of the name of the Manhattan distance, Manhattan distance is also known as a city block distance (CityBlock Distance) .

(A) two two-dimensional plane a (x1, y1) and between b (x2, y2) of Manhattan distance

 

Manhattan distance between (2) two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n)

 

(3) Matlab computing the Manhattan distance

Examples: calculating a vector (0,0), (1,0), the Manhattan distance (0,2) between any two

X= [0 0 ; 1 0 ; 0 2]

D= pdist(X, 'cityblock')

result:

D=

     1    2     3

 

3. Chebyshev distance (Chebyshev Distance)

       Chess played it? Kings take things one can move to any one adjacent eight squares in. Then the king went to the grid (x2, y2) from the grid (x1, y1) requires a minimum of how many steps? Try your own walk. You will find the minimum number of steps is always max (| x2-x1 |, | y2-y1 |) steps. There is a similar kind of distance measurement method called Chebyshev distance.

(A) two two-dimensional plane a (x1, y1) between the cut and the b (x2, y2) Chebyshev distance

 

Cut between (2) two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) Chebyshev distance

 

  Another form of this equation is equivalent

 

       I do not see the two formulas are equivalent? Reminder: Try scaling method and Squeeze law to prove.

(3) Matlab computing Chebyshev distance

Examples: calculating a vector (0,0), (1,0), cutting (0,2) between any two Chebyshev distance

X= [0 0 ; 1 0 ; 0 2]

D= pdist(X, 'chebychev')

result:

D=

     1    2     2

 

 

4. Minkowski distance (MinkowskiDistance)

Min's distance is not a distance but a set of defined distance.

Define (1) Minkowski distance

       Two n-dimensional variable a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) between the Minkowski distance is defined as:

 

Wherein p is a variable parameter.

When p = 1, the Manhattan distance is

When p = 2, is the Euclidean distance

When p → ∞, the Chebyshev distance is

       Depending on the variable parameters, Minkowski distance may represent the distance of a class.

Disadvantage (2) Minkowski distance

  Minkowski distances including a Manhattan distance, Euclidean distance and Chebyshev distance there are significant drawbacks.

  For example: sample-dimensional (height, weight), wherein the height range of 150 to 190, weight range 50 to 60, there are three samples: a (180,50), b (190,50), c (180, 60). So Min's distance between a and b (both Manhattan distance, Euclidean distance or Chebyshev distance) equal to the Minkowski distance between a and c, but really equivalent to the height of 10cm 10kg weight it? So with Min's distance to measure the similarity between these samples is very problematic.

       Simply put, Min's shortcomings from the main two: (1) the dimensions of each component (scale), which is a "unit" as the same look. (2) does not consider the distribution of each component (desired, variance, etc.) may be different.

(3) Matlab computing Minkowski distance

Examples: calculating a vector (0,0), (1,0), (0,2) Minkowski distance between any two (in Euclidean distance varying parameters of Example 2)

X= [0 0 ; 1 0 ; 0 2]

D= pdist(X,'minkowski',2)

result:

D=

    1.0000   2.0000    2.2361

 

 

The normalized Euclidean distance (Standardized Euclidean distance)

Define (1) the standard Euclidean distance

  Standardized Euclidean distance is an improved scheme for the shortcomings of simple Euclidean distance and made. Thinking standard Euclidean distance: Since dimensional distribution of the component data is not the same, all right! I first individual components are "standardized" to mean, variance equal to it. How many mean and variance normalized to do? Here to review the statistical point of knowledge it is assumed that the mean sample sets of X (mean) is m, standard deviation (standarddeviation) is s, then X is "normalized variable" is expressed as:

  Standardized variables and mathematical expectation 0 and variance 1. Therefore standardization process (Standardization) sample set is described by a formula:

  = Normalized value (normalization value before - mean component) / component standard deviation

  After a simple derivation can be obtained two n-dimensional vector a (x11, x12, ..., x1n) and b (x21, x22, ..., x2n) normalized Euclidean distance between the formulas:

  If the reciprocal of the variance as a weight, this formula can be seen as a weighted Euclidean distance (Distance WeightedEuclidean) .

(2) Matlab computing normalized Euclidean distance

Examples: calculating a vector (0,0), (1,0), (0,2) normalized Euclidean distance between any two (assuming the standard deviation of the two components 1 and 0.5, respectively)

X= [0 0 ; 1 0 ; 0 2]

D= pdist(X, 'seuclidean',[0.5,1])

result:

D=

    2.0000   2.0000    2.8284

 

 

6. Mahalanobis distance (MahalanobisDistance)

(1) The Mahalanobis distance is defined

       There are M samples vectors X1 ~ Xm, referred to as a covariance matrix S, referred to as the mean vector μ, the sample vectors X where u is the Mahalanobis distance is expressed as:

 

       Wherein the Mahalanobis distance between the vectors Xi and Xj is defined as:

       If the covariance matrix is ​​the identity matrix (independent and identically distributed among the various sample vectors), then the formula has become:

       That is the Euclidean distance.

  If the covariance matrix is ​​a diagonal matrix, equation becomes normalized Euclidean distance.

(2) the advantages and disadvantages of the Mahalanobis distance: dimensionless independent eliminate interference correlation between variables.

(3) Calculation of Matlab (12), (13), (22), between Markov (31) from twenty-two

X = [1 2; 1 3; 2 2; 3 1]

Y = pdist(X,'mahalanobis')

 

result:

Y =

    2.3452   2.0000    2.3452    1.2247   2.4495    1.2247

 

 

7. The cosine of the angle (Cosine)

       有没有搞错,又不是学几何,怎么扯到夹角余弦了?各位看官稍安勿躁。几何中夹角余弦可用来衡量两个向量方向的差异,机器学习中借用这一概念来衡量样本向量之间的差异。

(1)在二维空间中向量A(x1,y1)与向量B(x2,y2)的夹角余弦公式:

(2)两个n维样本点a(x11,x12,…,x1n)和b(x21,x22,…,x2n)的夹角余弦

       类似的,对于两个n维样本点a(x11,x12,…,x1n)和b(x21,x22,…,x2n),可以使用类似于夹角余弦的概念来衡量它们间的相似程度。

  即:

       夹角余弦取值范围为[-1,1]。夹角余弦越大表示两个向量的夹角越小,夹角余弦越小表示两向量的夹角越大。当两个向量的方向重合时夹角余弦取最大值1,当两个向量的方向完全相反夹角余弦取最小值-1。

       夹角余弦的具体应用可以参阅参考文献[1]。

(3)Matlab计算夹角余弦

例子:计算(1,0)、( 1,1.732)、(-1,0)两两间的夹角余弦

X= [1 0 ; 1 1.732 ; -1 0]

D= 1- pdist(X, 'cosine')  % Matlab中的pdist(X,'cosine')得到的是1减夹角余弦的值

结果:

D=

    0.5000  -1.0000   -0.5000

 

 

8. 汉明距离(Hammingdistance)

(1)汉明距离的定义

       两个等长字符串s1与s2之间的汉明距离定义为将其中一个变为另外一个所需要作的最小替换次数。例如字符串“1111”与“1001”之间的汉明距离为2。

       应用:信息编码(为了增强容错性,应使得编码间的最小汉明距离尽可能大)。

(2)Matlab计算汉明距离

  Matlab中2个向量之间的汉明距离的定义为2个向量不同的分量所占的百分比。

       例子:计算向量(0,0)、(1,0)、(0,2)两两间的汉明距离

X = [0 0 ; 1 0 ; 0 2];

D = PDIST(X, 'hamming')

结果:

D=

    0.5000   0.5000    1.0000

 

 

9. 杰卡德相似系数(Jaccardsimilarity coefficient)

(1) 杰卡德相似系数

       两个集合A和B的交集元素在A,B的并集中所占的比例,称为两个集合的杰卡德相似系数,用符号J(A,B)表示。

  杰卡德相似系数是衡量两个集合的相似度一种指标。

(2) 杰卡德距离

       与杰卡德相似系数相反的概念是杰卡德距离(Jaccarddistance)。杰卡德距离可用如下公式表示:

  杰卡德距离用两个集合中不同元素占所有元素的比例来衡量两个集合的区分度。

(3)杰卡德相似系数与杰卡德距离的应用

       可将杰卡德相似系数用在衡量样本的相似度上。

  样本A与样本B是两个n维向量,而且所有维度的取值都是0或1。例如:A(0111)和B(1011)。我们将样本看成是一个集合,1表示集合包含该元素,0表示集合不包含该元素。

p:样本A与B都是1的维度的个数

q:样本A是1,样本B是0的维度的个数

r:样本A是0,样本B是1的维度的个数

s:样本A与B都是0的维度的个数

 

那么样本A与B的杰卡德相似系数可以表示为:

这里p+q+r可理解为A与B的并集的元素个数,而p是A与B的交集的元素个数。

而样本A与B的杰卡德距离表示为:

(4)Matlab计算杰卡德距离

Matlab的pdist函数定义的杰卡德距离跟我这里的定义有一些差别,Matlab中将其定义为不同的维度的个数占“非全零维度”的比例。

例子:计算(1,1,0)、(1,-1,0)、(-1,1,0)两两之间的杰卡德距离

X= [1 1 0; 1 -1 0; -1 1 0]

D= pdist( X , 'jaccard')

结果

D=

0.5000    0.5000   1.0000

 

 

10. 相关系数( Correlation coefficient )与相关距离(Correlation distance)

(1)相关系数的定义

相关系数是衡量随机变量X与Y相关程度的一种方法,相关系数的取值范围是[-1,1]。相关系数的绝对值越大,则表明X与Y相关度越高。当X与Y线性相关时,相关系数取值为1(正线性相关)或-1(负线性相关)。

(2)相关距离的定义

 

(3)Matlab计算(1, 2 ,3 ,4 )与( 3 ,8 ,7 ,6 )之间的相关系数与相关距离

X = [1 2 3 4 ; 3 8 7 6]

C = corrcoef( X' )   %将返回相关系数矩阵

D = pdist( X , 'correlation')

结果:

C=

    1.0000   0.4781

    0.4781   1.0000

D=

0.5219

      其中0.4781就是相关系数,0.5219是相关距离。

 

11. 信息熵(Information Entropy)

       信息熵并不属于一种相似性度量。那为什么放在这篇文章中啊?这个。。。我也不知道。 (╯▽╰)

信息熵是衡量分布的混乱程度或分散程度的一种度量。分布越分散(或者说分布越平均),信息熵就越大。分布越有序(或者说分布越集中),信息熵就越小。

       计算给定的样本集X的信息熵的公式:

参数的含义:

n:样本集X的分类数

pi:X中第i类元素出现的概率

       信息熵越大表明样本集S分类越分散,信息熵越小则表明样本集X分类越集中。。当S中n个分类出现的概率一样大时(都是1/n),信息熵取最大值log2(n)。当X只有一个分类时,信息熵取最小值0

Guess you like

Origin www.cnblogs.com/navysummer/p/11293788.html