Common distance measures and similarity measures

In pattern recognition, data mining, machine learning and other fields, distance measurement and similarity measurement are widely used. A certain degree of understanding of these measurement algorithms can help us better deal with and optimize the problems encountered in these fields. .

Distance measurement algorithm and similarity measurement algorithm are basic algorithms, which are often used in other more advanced algorithms. For example, K-Nearest Neighbor (KNN) and K-Means (K-Means) can use Manhattan distance or Euclidean distance as a measurement method.

This article introduces some common distance measurement algorithms and similarity measurement algorithms.

Algorithm definition description, part of the content is extracted from Baidu Encyclopedia, and will not be marked one by one below.

Common Distance Metrics

Manhattan distance

Algorithm description

Manhattan Distance, also known as taxi distance, is a geometric term used in a geometric metric space to indicate the sum of the absolute wheelbases of two points on a standard coordinate system.

Manhattan distance can be understood as path distance.

Using diagrams to illustrate is more intuitive.

image-20201222133406519

As shown in the figure, the red line represents the Manhattan distance, the green line represents the Euclidean distance (the straight-line distance between two points in space, which will be introduced below), and the blue and yellow lines represent the equivalent Manhattan distance.

It can be seen that in a two-dimensional space, the Manhattan distance is the sum of the absolute values ​​​​of the differences between the horizontal and vertical coordinates of two points, that is, D ( A , B ) = ∣ A x − B x ∣ + ∣ A y − B y ∣ D(A,B)=|A_x-B_x|+|A_y-B_y|D(A,B)=AxBx+AyBy

In a town street with a regular layout of due south and north, due east and west, the shortest moving distance from one point to another is the shortest moving distance in the north-south direction plus the shortest moving distance in the east-west direction, so , Manhattan distance is also known as taxi distance.

Manhattan distance calculation formula:
D ( A , B ) = ∑ i = 1 n ∣ A i − B i ∣ D(A,B) = \sum_{i=1}^n\left | A_i - B_i \right |D(A,B)=i=1nAiBi

Application Scenario

Computer Graphics

In early computer graphics, the screen is composed of pixels, which are integers, and the coordinates of points are generally integers. The Manhattan distance is used to measure the distance between two pixel points AB.

If you use the Euclidean distance of AB, you must perform floating-point calculations, which are expensive, slow and have errors. If you use AC and CB, you only need to calculate addition and subtraction, which greatly improves the calculation speed , and no matter how many times the cumulative calculation is performed, there will be no error .

Similarity

It can be used to measure the similarity of two vectors, the smaller the distance, the more similar they are.

Euclidean distance

Algorithm description

Euclidean Distance, the more academic name is Euclidean metric, is a commonly used definition of distance, which refers to the real distance between two points in n-dimensional space, or the natural length of a vector (i.e. the distance from the point to the origin). The Euclidean distance in 2D and 3D space is the actual distance between two points.

In the above figure introducing the Manhattan distance, the green line represents the Euclidean distance between two points in two-dimensional space.

European distance calculation formula:
D ( A , B ) = ( A 1 − B 1 ) 2 + ( A 2 − B 2 ) 2 + ⋯ + ( A n − B n ) 2 = ∑ i = 1 n ( A i − B i ) 2D(A,B) = \sqrt{(A_1-B_1)^2+(A_2-B_2)^2+\dots+(A_n-B_n)^2} = \sqrt{\sum_{i=1 }^n(A_i-B_i)^2}D(A,B)=(A1B1)2+(A2B2)2++(AnBn)2 =i=1n(AiBi)2

Application Scenario

Widely used to measure the distance between two vectors in a vector space.

It is also often used in the user similarity measure of collaborative filtering system. The user's rating of the item is abstracted into a user vector, and then the Euclidean distance between two user vectors is calculated to represent the similarity between users. The smaller the distance, the more similar.

Chebyshev distance

Algorithm description

Chebyshev distance (Chebyshev distance), is a measure in vector space, the distance between two points is defined as the maximum value of the absolute value of the difference between the coordinate values.

Chebyshev distance calculation formula:
D ( A , B ) = max ⁡ i ( ∣ A i − B i ∣ ) D(A,B) = \max_{i}(\left | A_i-B_i \right |)D(A,B)=imax(AiBi )
In chess, the distance from a king to another point is the Chebyshev distance between two points in a two-dimensional Cartesian coordinate system. Therefore, Chebyshev distance is also called chessboard distance.

As shown in the figure below, the position of the king is f6, and the number on the grid represents the number of steps from the king to the grid, that is, the distance.

icon

Minkowski distance

Algorithm description

Minkowski distance is a measure of the distance between two points in Minkowski space.

Minkowski space refers to the space-time composed of one time dimension and three space dimensions in the special theory of relativity, which was first expressed by the Russian-German mathematician Minkowski (H. Minkowski, 1864-1909). The concept of Min's space and the geometry expressed as a special distance quantity are consistent with the requirements of special relativity.

Therefore, Min's distance sometimes also refers to the space-time interval. Assuming that there are two coordinates A, B, and p in the n-dimensional space are constants, the Min-type distance is defined as
D ( A , B ) = ( ∑ i = 1 n ∣ A i − B i ∣ p ) 1 p D(A, B) = (\sum_{i=1}^n\left | A_i-B_i \right |^p)^\frac{1}{p}D(A,B)=(i=1nAiBip)p1

Notice:

(1) Min's distance is related to the dimension of characteristic parameters, and Min's distance with characteristic parameters of different dimensions is often meaningless.

(2) Min's distance does not consider the correlation between feature parameters, while Mahalanobis distance (Mahalanobis distance) solves this problem.

some exceptions

Min's distance can be considered as the definition of a set of distances, when p represents different values, Min's distance can get other distances.

When p=1, the Manhattan distance is obtained;

When p=2, the Euclidean distance is obtained;

When p -> ∞, get Chebyshev distance.

Among them, the derivation of Manhattan distance and Euclidean distance is relatively simple, and the derivation of Chebyshev distance requires the use of norm knowledge.

The following norm content is taken from Baidu Encyclopedia - Norm .
insert image description here
As mentioned above, the dimension and correlation of characteristic parameters will affect Min's distance, which will also affect Manhattan distance, Euclidean distance and Chebyshev distance.

Euclidean distance is more widely used than other distance measures. To illustrate with Euclidean distance, for example, there are two vectors A(1,10) and B(10,1000). When calculating the Euclidean distance of AB, we can find that the second dimension has a much greater impact on the distance measure than the first one Dimension, which is the impact of the dimensions of different dimensions on the distance measure.

If you want to eliminate the influence of dimensional dimension and correlation, you can perform principal component analysis (PCA), extract comprehensive indicators, eliminate the influence of dimensional correlation, and then perform feature scaling to eliminate the influence of dimension, and finally calculate the distance. In addition, you can consider using the Mahalanobis distance to be introduced next.

Mahalanobis distance

Algorithm description

Mahalanobis distance (Mahalanobis distance) was proposed by Indian statistician Mahalanobis (PC Mahalanobis), which represents the distance between a point and a distribution. It is an efficient method to calculate the similarity between two unknown sample sets. Unlike Euclidean distance , it takes into account the connection between various characteristics (for example: a piece of information about height will bring a piece of information about weight, because the two are related), and is scale-independent ( scale-invariant), that is, independent of the measurement scale.

The Mahalanobis distance formula is referenced from https://blog.csdn.net/qq_37053885/article/details/79359427.

There are M sample vectors X 1 ∼ X m X_1 \sim X_mX1Xm, the covariance matrix is ​​recorded as S, then the Mahalanobis distance between two vectors is defined as:
D ( A , B ) = ( A i − B i ) TS − 1 ( A i − B i ) D(A,B ) = \sqrt{(A_i-B_i)^TS^{-1}(A_i-B_i)}D(A,B)=(AiBi)TS1(AiBi)
When the covariance matrix S is an identity matrix (independent and identical distribution between each sample vector), the formula becomes:
D ( A , B ) = ( A i − B i ) T ( A i − B i ) D( A,B) = \sqrt{(A_i-B_i)^T(A_i-B_i)}D(A,B)=(AiBi)T(AiBi)
At this time, the Mahalanobis distance is the Euclidean distance.

When the covariance matrix is ​​an identity matrix, take the Mahalanobis distance between vector A=(a1,a2,…,an) and vector B=(b1,b2,…,bn) as an example, convert the vector into a matrix, A
= [ a 1 a 2 . . . an ] \begin{bmatrix} a_1\\a_2\\...\\a_n\end{bmatrix}a1a2...an和B= [ b 1 b 2 . . . bn ] \begin{bmatrix} b_1\\b_2\\...\\b_n\end{bmatrix}b1b2...bn,
the third formula for general use, the recommended European distance is as follows:
D ( A , B ) = ( A i − B i ) T ( A i − B i ) = [ a 1 − b 1 , a 2 − b 2 , . . , an − bn ] [ a 1 − b 1 a 2 − b 2 . an − bn ) 2 = ∑ i = 1 n ( ai − bi ) 2 \begin{aligned} D(A,B) &= \sqrt{(A_i - B_i)^T(A_i - B_i)}\\ &= \sqrt{\begin{bmatrix}a_1 - b_1, a_2 - b_2,..., a_n - b_n\end{bmatrix} \begin{bmatrix} a_1 - b_1\\ a_2 - b_2\\ ...\\ a_n - b_n \end{bmatrix}}\\ &= \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... + (a_n - b_n)^2}\\ &= \sqrt{\ sum_{i=1}^n(a_i - b_i)^2} \end{aligned}D(A,B)=(AiBi)T(AiBi) =[a1b1,a2b2,...,anbn]a1b1a2b2...anbn =(a1b1)2+(a2b2)2+...+(anbn)2 =i=1n(aibi)2

Converting a vector to a matrix can be converted into a row matrix or a column matrix. Here, the calculation result of the matrix is ​​a value, so the first factor matrix needs to be a row matrix, and it is more appropriate to convert the two vectors A and B into a column matrix.

Application Scenario

Same as the Euclidean distance, when the correlation between the characteristic parameters has a greater influence, consider using the Mahalanobis distance instead of the Euclidean distance.

similarity measure

cosine similarity

Algorithm description

Cosine similarity, also known as cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine value of the angle between them . The formula is as follows: cos ( Θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ = ∑ i = 1 n ( A i × B i ) ∑ i = 1 n ( A i ) 2 × ∑ i = 1 n ( B i ) 2 cos(\Theta) = \frac{A \cdot B}{\left \| A \right \| \left \| B \right \|} = \frac{\sum_{i=1}^n(A_i \times B_i)}{\sqrt{\sum_{i=1}^n(A_i)^2} \times \sqrt{\sum_{i=1}^n(B_i)^2}}
c o s ( Θ )=ABAB=i=1n(Ai)2 ×i=1n(Bi)2 i=1n(Ai×Bi)
Cosine similarity is commonly used to calculate text similarity and is very effective. However, in some scenarios, the results obtained by using cosine similarity are not accurate, so modified cosine similarity appears. The formula is as follows:
sim ( A , B ) = ∑ i = 1 n ( A i − D i ˉ ) ( B i − D i ˉ ) ∑ i = 1 n ( A i − D i ˉ ) 2 × ∑ i = 1 n ( B i − D i ˉ ) 2 sim(A,B) = \frac{\sum_{i= 1}^n(A_i-\bar{D_i})(B_i-\bar{D_i})} {\sqrt{\sum_{i=1}^n(A_i-\bar{D_i})^2}\times \sqrt{\sum_{i=1}^n}(B_i-\bar{D_i})^2}yes m ( A , _B)=i=1n(AiDiˉ)2 ×i=1n (BiDiˉ)2i=1n(AiDiˉ)(BiDiˉ)
Among them D i ˉ \bar{D_i}DiˉRepresents the expectation of the i-th dimension (or component).

To compare cosine similarity and modified cosine similarity, give an example of collaborative filtering. Assuming that the user's rating range for items is [1,5], there are two users (U1, U2) rating three items (A, B, C), and the rating matrix is ​​as follows:

Item\User U1 U2
A 1 2
B 3 4
C 3 5

User U1 generally does not give a high score, 1 point means dislike, 3 points means like; user U2 scores strictly according to the grade. According to the idea of ​​collaborative filtering, through the scoring matrix, we can judge that A and B, C are not similar, and B and C are similar.

Now calculate the Item similarity through the algorithm, and get the Item vector according to the scoring matrix: A=(1,2), B=(3,4), C=(3,5)

Calculate Item similarity by cosine similarity:
sim ( A , B ) = 1 × 3 + 2 × 4 1 ​​2 + 2 2 × 3 2 + 4 2 = 11 5 × 25 ≈ 0.98386991 sim ( A , C ) = 1 × 3 + 2 × 5 1 2 + 2 2 × 3 2 + 5 2 = 13 5 × 34 ≈ 0.997054486 sim ( B , C ) = 3 × 3 + 4 × 5 3 2 + 4 2 × 3 2 + 5 2 = 29 25 × 34 ≈ 0.994691794 \begin{aligned} sim(A,B) &= \frac{1\times3+2\times4}{\sqrt{1^2+2^2}\times\sqrt{3^2+ 4^2}} = \frac{11}{\sqrt{5}\times\sqrt{25}} \approx 0.98386991\\ sim(A,C) &= \frac{1\times3+2\times5}{ \sqrt{1^2+2^2}\times\sqrt{3^2+5^2}} = \frac{13}{\sqrt{5}\times\sqrt{34}} \approx 0.997054486\\ sim(B,C) &= \frac{3\times3+4\times5}{\sqrt{3^2+4^2}\times\sqrt{3^2+5^2}} = \frac{29 }{\sqrt{25}\times\sqrt{34}} \approx 0.994691794 \end{aligned}yes m ( A , _B)yes m ( A , _C)sim(B,C)=12+22 ×32+42 1×3+2×4=5 ×25 110.98386991=12+22 ×32+52 1×3+2×5=5 ×34 130.997054486=32+42 ×32+52 3×3+4×5=25 ×34 290.994691794
Calculate the item similarity by modifying the cosine similarity, first calculate the dimension expectation, that is, calculate the mean value of each user's score:

U 1 ˉ = 1 + 3 + 3 3 ≈ 2.33 U 2 ˉ = 2 + 4 + 5 3 ≈ 3.67 \begin{aligned} \bar{U_1} = \frac{1+3+3}{3} \approx 2 .33\\ \bar{U_2} = \frac{2+4+5}{3} \approx 3.67 \end{aligned} U1ˉ=31+3+32.33U2ˉ=32+4+53.67
Next calculate the similarity:
s i m ( A , B ) = ( 1 − 2.33 ) × ( 3 − 2.33 ) + ( 2 − 3.67 ) × ( 4 − 3.67 ) ( 1 − 2.33 ) 2 + ( 2 − 3.67 ) 2 × ( 3 − 2.33 ) 2 + ( 4 − 3.67 ) 2 = − 1.4422 4.5578 × 0.5578 ≈ − 0.904500069 s i m ( A , C ) = ( 1 − 2.33 ) × ( 3 − 2.33 ) + ( 2 − 3.67 ) × ( 5 − 3.67 ) ( 1 − 2.33 ) 2 + ( 2 − 3.67 ) 2 × ( 3 − 2.33 ) 2 + ( 5 − 3.67 ) 2 = − 3.1122 4.5578 × 2.2178 ≈ − 0.978878245 s i m ( B , C ) = ( 3 − 2.33 ) × ( 3 − 2.33 ) + ( 4 − 3.67 ) × ( 5 − 3.67 ) ( 3 − 2.33 ) 2 + ( 4 − 3.67 ) 2 × ( 3 − 2.33 ) 2 + ( 5 − 3.67 ) 2 = 0.8878 0.5578 × 2.2178 ≈ 0.798205464 \begin{aligned} sim(A,B) &= \frac{(1-2.33)\times(3-2.33)+(2-3.67)\times(4-3.67)}{\sqrt{(1-2.33)^2+(2-3.67)^2}\times\sqrt{(3-2.33)^2+(4-3.67)^2}}\\ &= \frac{-1.4422}{\sqrt{4.5578}\times\sqrt{0.5578}} \approx -0.904500069\\ sim(A,C) &= \frac{(1-2.33)\times(3-2.33)+(2-3.67)\times(5-3.67)}{\sqrt{(1-2.33)^2+(2-3.67)^2}\times\sqrt{(3-2.33)^2+(5-3.67)^2}}\\ &= \frac{-3.1122}{\sqrt{4.5578}\times\sqrt{2.2178}} \approx -0.978878245\\ sim(B,C) &= \frac{(3-2.33)\times(3-2.33)+(4-3.67)\times(5-3.67)}{\sqrt{(3-2.33)^2+(4-3.67)^2}\times\sqrt{(3-2.33)^2+(5-3.67)^2}}\\ &= \frac{0.8878}{\sqrt{0.5578}\times\sqrt{2.2178}} \approx 0.798205464 \end{aligned} yes m ( A , _B)yes m ( A , _C)sim(B,C)=(12.33)2+(23.67)2 ×(32.33)2+(43.67)2 (12.33)×(32.33)+(23.67)×(43.67)=4.5578 ×0.5578 1.44220.904500069=(12.33)2+(23.67)2 ×(32.33)2+(53.67)2 (12.33)×(32.33)+(23.67)×(53.67)=4.5578 ×2.2178 3.11220.978878245=(32.33)2+(43.67)2 ×(32.33)2+(53.67)2 (32.33)×(32.33)+(43.67)×(53.67)=0.5578 ×2.2178 0.88780.798205464
The focus of the modified cosine similarity is the correction , which is to center the dimensions and then calculate the cosine similarity.

In fact, we can center A, B, and C first to get A1=(-1.33,-1.67), B1=(0.67,0.33), C1=(0.67,1.33), and then calculate the cosine similarity. In the actual implementation of the algorithm, this step-by-step operation can also be used to optimize the algorithm.

Application Scenario

Widely used in text similarity and item similarity in collaborative filtering systems.

Pearson correlation coefficient

Algorithm description

In statistics , Pearson correlation coefficient (Pearson correlation coefficient), also known as Pearson product-moment correlation coefficient ( PPMCC or PCCs for short ), is used to measure the relationship between two variables X and Y Correlation (linear correlation), with values ​​between -1 and 1.

The Pearson correlation coefficient between two variables is defined as the quotient of the covariance and standard deviation between the two variables :
ρ ( X , Y ) = cov ( X , Y ) σ X σ Y = E [ ( X − μ X ) ( Y − μ Y ) ] σ X σ Y = ∑ i = 1 n ( X i − X ˉ ) ( Y i − Y ˉ ) ∑ i = 1 n ( X i − X ˉ ) 2 ∑ i = 1 n ( Y i − Y ˉ ) 2 \begin{aligned} \rho (X,Y) &= \frac{cov(X,Y)}{\sigma _X\sigma _Y}\\ &= \frac{E[ (X-\mu _X)(Y-\mu _Y)]}{\sigma _X\sigma _Y}\\ &= \frac{\sum_{i=1}^n(X_i-\bar{X})( Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^n(X_i-\bar{X})^2}\sqrt{\sum_{i=1}^n(Y_i-\ bar{Y})^2 }} \end{aligned}p ( X ,Y)=pXpYc o v ( X ,Y)=pXpYE [ ( XmX)(YmY)]=i=1n(XiXˉ)2 i=1n(YiYˉ)2 i=1n(XiXˉ )(ANDiYˉ)
Note that this formula is very similar to the modified cosine similarity, and then look at the modified cosine similarity formula:
sim ( A , B ) = ∑ i = 1 n ( A i − D i ˉ ) ( B i − D i ˉ ) ∑ i = 1 n ( A i − D i ˉ ) 2 × ∑ i = 1 n ( B i − D i ˉ ) 2 sim(A,B) = \frac{\sum_{i=1}^n(A_i-\bar {D_i})(B_i-\bar{D_i})} {\sqrt{\sum_{i=1}^n(A_i-\bar{D_i})^2}\times\sqrt{\sum_{i=1 }^n}(B_i-\bar{D_i})^2}yes m ( A , _B)=i=1n(AiDiˉ)2 ×i=1n (BiDiˉ)2i=1n(AiDiˉ)(BiDiˉ)

The difference between the two can be understood as follows:

The modified cosine similarity solves the vector similarity. The set is regarded as a vector, and the modified cosine similarity is centered on the same dimension, that is, the column is centered; the Pearson correlation coefficient solves the linear correlation, and the set is regarded as For positive continuous variables, the Pearson correlation coefficient is centered on the same variable, that is, centered by row.

The following is also an example of collaborative filtering to calculate the user similarity and illustrate the centralization method of the Pearson correlation coefficient.

Still taking the scoring matrix as an example, suppose there is the following scoring matrix:

User\Item I1 I2 I3
A 1 2 3
B 4 4 1
C 3 4 5

Get the variables A=(1,2,3), B=(4,4,1), C=(3,4,5), according to the idea of ​​collaborative filtering, we can judge that A is similar to C through the scoring matrix, B and A, C are not similar.

Expected calculation:
U 1 ˉ = 1 + 2 + 3 3 = 2 U 2 ˉ = 4 + 4 + 1 3 = 3 U 3 ˉ = 3 + 4 + 5 3 = 4 \bar{U_1} = \frac{1+ 2+3}{3} = 2\\ \bar{U_2} = \frac{4+4+1}{3} = 3\\ \bar{U_3} = \frac{3+4+5}{3 } = 4U1ˉ=31+2+3=2U2ˉ=34+4+1=3U3ˉ=33+4+5=4

Centralize first to get new variables: A1=(-1,0,1), B1=(1,1,-2), C1=(-1,0,1)

Calculate the Pearson correlation coefficient:
ρ ( A 1 , B 1 ) = − 1 × 1 + 0 × 1 + 1 × ( − 2 ) ( − 1 ) 2 + 0 2 + 1 2 1 2 + 1 2 + ( − 2 ) 2 = − 3 2 6 ≈ − 0.866025404 ρ ( A 1 , C 1 ) = − 1 × ( − 1 ) + 0 × 0 + 1 × 1 ( − 1 ) 2 + 0 2 + 1 2 ( − 1 ) 2 + 0 2 + 1 2 = 2 2 2 = 1 ρ ( B 1 , C 1 ) = 1 × ( − 1 ) + 1 × 0 + ( − 2 ) × 1 1 2 + 1 2 + ( − 2 ) 2 ( − 1 ) 2 + 0 2 + 1 2 = − 3 6 2 ≈ − 0.866025404 \begin{aligned} \rho (A_1,B_1) &= \frac{-1\times1+0\times1+1\times(-2 )}{\sqrt{(-1)^2+0^2+1^2}\sqrt{1^2+1^2+(-2)^2}} = \frac{-3}{\sqrt {2}\sqrt{6}} \approx -0.866025404\\ \rho (A_1,C_1) &= \frac{-1\times(-1)+0\times0+1\times1}{\sqrt{(- 1)^2+0^2+1^2}\sqrt{(-1)^2+0^2+1^2}} = \frac{2}{\sqrt{2}\sqrt{2}} = 1\\ \rho (B_1,C_1) &= \frac{1\times(-1)+1\times0+(-2)\times1}{\sqrt{1^2+1^2+(-2) ^2}\sqrt{(-1)^2+0^2+1^2}} = \frac{-3}{\sqrt{6}\sqrt{2}} \approx -0.866025404\\ \end{aligned} r ( A1,B1)r ( A1,C1)r ( B1,C1)=(1)2+02+12 12+12+(2)2 1×1+0×1+1×(2)=2 6 30.866025404=(1)2+02+12 (1)2+02+12 1×(1)+0×0+1×1=2 2 2=1=12+12+(2)2 (1)2+02+12 1×(1)+1×0+(2)×1=6 2 30.866025404

Three major correlation coefficients in statistics: Pearson correlation coefficient , Spearman correlation coefficient , Kendall correlation coefficient .

Application Scenario

Applied to variable correlation observation, data dimensionality reduction.

User similarity that can be used in collaborative filtering systems.

Jaccard coefficient

Algorithm description

Jaccard index, also known as Jaccard similarity coefficient (Jaccard similarity coefficient), is used to compare the similarity and difference between limited sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.

Given two sets A, B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B, defined as follows: J ( A ,
B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ = ∣ A ∩ B ∣ ∣ A ∣ + ∣ B ∣ − ∣ A ∩ B ∣ J(A,B) = \frac{\left | A \cap B \right |}{\left | A \cup B \right |} = \frac{\left | A \cap B \right |}{\left | A \right |+\left | B \right |-\left | A \cap B \right |}J ( A ,B)=ABAB=A+BABAB
The value range of J(A,B) is [0,1]. When the sets A and B are both empty, J(A,B) is defined as 1.

The index related to Jaccard coefficient is called Jaccard distance, which is used to describe the degree of dissimilarity between sets. The larger the Jaccard distance, the lower the sample similarity. The formula is defined as follows:
D j ( A , B ) = 1 − J ( A , B ) = ∣ A ∪ B ∣ − ∣ A ∩ B ∣ ∣ A ∪ B ∣ = A Δ B ∣ A ∪ B ∣ D_j(A, B) = 1-J(A,B) = \frac{\left | A \cup B \right |-\left | A \cap B \right |}{\left | A \cup B \right |} = \frac{A \Delta B}{\left | A \cup B \right |}Dj(A,B)=1J ( A ,B)=ABABAB=ABAΔB
The Jaccard coefficient measures the similarity of asymmetric binary attributes.

In the field of data mining, it is often necessary to compare the distance between two objects with Boolean attributes, and the Jaccard distance is a commonly used method. Given two comparison objects A, B. Both A and B have n binary attributes, that is, each attribute takes the value {0,1}. Define the following 4 statistics:

M 00 M_{00} M00: The number of attributes whose attribute values ​​of A and B are 0 at the same time;

M 01 M_{01} M01: The number of attributes whose A attribute value is 0 and B attribute value is 1;

M 10 M_{10} M10: The number of attributes whose A attribute value is 1 and B attribute value is 0;

M 11 M_{11} M11: The number of attributes whose A and B attribute values ​​are 1 at the same time.

As shown in the following table:

A\B 0 1
0 M 00 M_{00} M00 M 01 M_{01} M01
1 M 10 M_{10}M10 M 11 M_{11}M11

Jaccard 系数:
J ( A , B ) = M 11 M 01 + M 10 + M 11 J(A,B) = \frac{M_{11}}{M_{01}+M{10}+M{11}} J ( A ,B)=M01+M 1 0+M11M11
Jaccard距离:
D j ( A , B ) = 1 − J ( A , B ) = M 01 + M 10 M 01 + M 10 + M 11 D_j(A,B) = 1-J(A,B) = \frac{M_{01}+M_{10}}{M_{01}+M_{10}+M_{11}} Dj(A,B)=1J ( A ,B)=M01+M10+M11M01+M10

Application Scenario

Compare text similarity for text duplication checking and deduplication;

Calculate the distance between objects, for data clustering, etc.

It can also be used to calculate the similarity of items in a collaborative filtering system. In related research, cosine similarity is commonly used in similarity measurement methods based on item collaborative filtering systems. However, in many practical applications, the evaluation data is too sparse, and the calculation of cosine similarity between items will produce misleading results. Apply the Jaccard similarity measure to the item-based collaborative filtering system, and establish the corresponding evaluation analysis method. Compared with the traditional similarity measurement method, the Jaccard method improves the drawbacks of cosine similarity that only considers user ratings and ignores other information, and is especially suitable for data that is too sparse.

Summary of Metric Differences

Euclidean distance and Mahalanobis distance

The Euclidean distance will be affected by the dimension and correlation of the characteristic parameters, and the Mahalanobis distance is divided by the covariance matrix, thereby eliminating the dimension influence. When the covariance matrix is ​​an identity matrix, the Mahalanobis distance is the Euclidean distance.

Euclidean distance and cosine similarity

The Euclidean distance measures the distance between two points in the vector space, and the cosine similarity measures the direction difference between two vectors in the vector space. For the same set of vectors, the results obtained by Euclidean distance and cosine similarity will be very different. When making a selection, it is necessary to judge whether the distance or direction is to be measured according to the actual scene, and then select the algorithm.

Cosine similarity and Pearson correlation coefficient

In a collaborative filtering system, cosine similarity is often used to calculate item similarity, and Pearson correlation coefficient is often used to calculate user similarity.

Cosine similarity and Jaccard coefficient

In a collaborative filtering system, cosine similarity is the most commonly used to calculate item similarity, but when the scoring matrix is ​​too sparse, cosine similarity will produce misleading results. At this time, by establishing a corresponding evaluation analysis method, and then combining Jaccard coefficient, the effect of the calculated item similarity may be better.

Guess you like

Origin blog.csdn.net/xwd127429/article/details/113397732