Similarity algorithm (reproduced)

In the process of data analysis and data mining, we often need to know the size of inter-individual differences, in order to evaluate the individual similarities and categories. The most common is to analyze the relevant data analysis, data mining, classification and clustering algorithms, such as K-nearest neighbor (KNN) and K-means (K-Means). Of course, there are many ways to measure individual differences, recently reviewed the relevant information, organize listed here under.

  For convenience of the following explanation and example, we have to compare the first set of differences between X and Y individual subject, they may include the N-dimensional feature, i.e., = X (X . 1 , X 2 , X . 3 , ... X n- ) , the Y = (Y . 1 , Y 2 , Y . 3 , ... Y n- ). Let's look at what methods can be used mainly to measure the difference between the two, divided into distance measure and similarity measure.

 

Distance Measurement

  Distance metric (Distance) individual measures the distance in space exists, the farther away that the greater the differences between individuals.

 

Euclidean distance (Euclidean Distance)

  Euclidean distance is the most common distance metric, measures the absolute distance between each point in a multidimensional space. Formula is as follows:

Euclidean Distance

  Because the calculation is based on absolute features of each dimension, so the need to ensure that each of the Euclidean metric dimensions in the same scale level indicators, such as indicators of different height (cm) and weight (kg) two units using the Euclidean distance may cause failure results .

 

Minkowski distance (Minkowski Distance)

  Minkowski distance is a generalization of the Euclidean distance, is the general expression of a plurality of distance measurement formula. Formula is as follows:

Minkowski Distance

  Where p is the value of a variable, when p = 2, when it was above the Euclidean distance.

 

Manhattan distance (Manhattan Distance)

  Manhattan distance from the city block distance, is the distance in the plurality of dimensions results summed, i.e. when the upper Minkowski distance obtained when p = distance metric equation 1, as follows:

Manhattan Distance

 

Cut (Chebyshev Distance) Distance Chebyshev

  Chebyshev distance originated in Chinese chess king moves, we know that chess king can only go 8 grid surrounding step, then if from the chessboard A grid (the X- 1 , the y- 1 ) went to B grid (the X- 2 , the y- 2 ) requires a minimum of a few steps away? Extended to multi-dimensional space, in fact Chebyshev distance is when p tends to infinity of Minkowski distance:

Chebyshev Distance

  In fact, the above Manhattan distance, Euclidean distance and Chebyshev distance applications are Minkowski distance under special conditions.

 

Mahalanobis distance (Mahalanobis Distance)

  Since the Euclidean distance can not ignore the differences in index measure, so before using Euclidean distance required for the underlying indicators standardized data , and then standardize the dimensions of each index based on Euclidean distance to another distance measure derived - - Mahalanobis distance (Mahalanobis distance), referred to as the Mahalanobis distance.

 

Similarity metric

  Similarity measure (Similarity), i.e. the degree of similarity between the calculated individual, as opposed to the distance metric, the smaller similarity measure value, indicating the degree of similarity between individuals is, the greater the difference.

 

Cosine similarity vector space (Cosine Similarity)

  The vector space with cosine similarity cosine of the angle of the two vectors as a measure of magnitude of the difference between the two individuals. Compared distance metric, cosine similarity pay more attention to the difference in the two direction vectors instead of distance or length. Formula is as follows:

Cosine Similarity

 

Pearson correlation coefficients (Pearson Correlation Coefficient)

  I.e., the correlation coefficient r of correlation analysis, the cosine of the angle are calculated space vector of the overall normalized X and Y based on itself. Formula is as follows:

Pearson Correlation Coefficient

 

The Jaccard coefficient (Jaccard Coefficient)

  Jaccard coefficient is mainly used for calculating the similarity between the individual symbol metrics or metrics Boolean value, because the individual characteristic properties are measured by the identification symbol, or a Boolean value, the difference can not measure the size of a specific value, can be obtained "is the same" the result, so Jaccard coefficient only care about whether between individuals share the feature of this issue is consistent. If the comparison is similar to X coefficient of Jaccard Y, only comparing X n- and Y n- the same number, the following formula:

Jaccard Coefficient

 

Adjusted cosine similarity (Adjusted Cosine Similarity)

  Although the cosine similarity between individuals on the existence of certain prejudices can be corrected, but because you can only tell the difference between the individual dimension, can not measure the difference between the value of each dimension will lead to such a situation: for example, the user content ratings , 5-point, X and Y two two content rating of the user, respectively (1,2) and (4,5), obtained using the results of cosine similarity is 0.98, the two are very similar, but the rates X point of view does not seem like the two content, and Y prefer, cosine similarity is not sensitive to the value of the results led to the error, you need to correct this irrationality, there have been adjusted cosine similarity, namely on all dimensions subtracting a mean of the numerical value, such as X and Y 3 are the mean ratings, then after adjustment (-2, -1) and (1,2), and then cosine similarity calculated to give -0.8, similarity It is negative and the difference is not small, but it is clearly more in line with reality.

 

Euclidean distance and cosine similarity

  Euclidean distance is the most common measure of distance, and cosine similarity is the most common similarity measure, measure the distance and a lot of similarity metrics are derived based on the deformation and the two, so the focus in the following comparison of the two next and realize the difference in the way the application environment when measuring individual differences.

  Dimensional coordinate system by means of the difference between the Euclidean distance view and the cosine similarity:

distance and similarity

  It can be seen from the distance metric measuring FIG absolute distance between each point in space, position coordinates (i.e., individual value of the characteristic dimension) with respective points located directly linked; and the cosine similarity measure is the angle of the space vector, the difference is reflected more in the direction, not the location. If the holding unchanged the position of the point A, B of the original point toward the direction away from the origin of the coordinate axes, then the cosine similarity cosθ time remain the same, because the same angle, the distance A, B occurs in two clearly changed, this is different from the Euclidean distance and cosine similarity.

 

  The Euclidean distance and cosine similarity calculated and measured their characteristics, are applied to different data analysis model: Euclidean distance difference absolute values ​​to reflect the characteristics of the individual, the need for more size from the value of a dimension analysis reflect the differences, such as the use of similarity or difference in user behavior metrics analysis of customer value; and cosine similarity is more to distinguish the difference from the direction, and the absolute value is not sensitive, more users for use of the content interest-rate to distinguish similarities and differences, and fixes metrics that may exist between users is not unified (since cosine similarity is not sensitive to absolute value).

 

  The above is the distance measure and similarity measure and a summary of some of the sort, select the appropriate distance in the real use metric or similarity metric can do a lot of data analysis and data mining modeling, there will be follow-up related to the introduction.

Accessories for the python part similarity algorithm:

#!/usr/bin/python
#coding=utf-8
critics = {
   'Lisa':{
       'Lady in the water':2.5,
       'Snake on a plane' :3.5
    },
   'Tom':{
       'Lady in the water':3.0,
       'Snake on a plane' :4.0
    },
   'Jerry':{
       'Lady in the water':2.0,
       'Snake on a plane' : 3.0 
    }, 
   ' WXM ' : {
        ' Lady in The Water ' : 3.3 ,
        ' Snake ON A Plane ' : 4.2 
    }, 
   ' JHZ ' : {
        ' Lady in The Water ' : 3.9 ,
        ' Snake ON A Plane ' : 4.5 of 5 
    } 
} 

from Math Import sqrt
 "" " 
Euclidean space method to calculate the degree of similarity 
" "" 
DEF sim_distance (P1, P2): 
    C = SET (p1.keys ()) &set(p2.keys())
    if not c:
       return 0
   sum_of_squares = sum([pow(p1.get(sk)-p2.get(sk),2) for sk in c])
    p = 1/(1+sqrt(sum_of_squares))
    return p
 
"""
皮尔逊相关度
"""
def sim_distance_pearson(p1,p2):
    c = set(p1.keys())&set(p2.keys())
    if not c:
       return 0
    s1 = sum([p1.get(sk) for sk in c])
    s2 = sum([p2.get(sk) for sk in c])
    sq1 = sum([pow(p1.get(sk),2) for sk in c])
    sq2 = sum([pow(p2.get(sk),2) for sk in c])
    ss = sum([p1.get(sk)*p2.get(sk) for sk in c])
    n = len(c)
    num = ss-s1*s2/n
    den = sqrt((sq1-pow(s1,2)/n)*(sq2-pow(s2-2)/n))
    if den == 0:
       return 0
    p = num/den
    return p
 
"""
Jaccard系数
"""
def sim_distance_jaccard(p1,p2):
    c = set(p1.keys())&set(p2.keys())
    if not c:
       return 0
    ss = sum([p1.get(sk)*p2.get(sk) for sk in c])
    sq1 = sum([pow(sk,2) for sk in p1.values()])
    sq2 = sum([pow(sk,2) for sk in p2.values()])
    p = float(ss)/(sq1+sq2-ss)
    return p
 
"""
余弦相似度
"""
def sim_distance_cos(p1,p2):
    c = set(p1.keys())&set(p2.keys())
    if not c:
       return 0
    ss = sum([p1.get(sk)*p2.get(sk) for sk in c])
    sq1 = sqrt(sum([pow(p1.get(sk),2) for sk in p1.values()]))
    sq2 = sqrt(sum([pow(p2.get(sk),2) for sk in p2.values()]))
    p = float(ss)/(sq1*sq2)
    return p

"""
得到top相似度高的前几位
"""
def topMatches(prefs,person,n=5,similarity=sim_distance_pearson):
    scores = [Similarity (Prefs, Person, OTHER) for OTHER in Prefs IF ! = OTHER Person] 
   scores.sort () 
   scores.reverse () 
    return Scores [0: n-] 

"" " 
# evaluation values using a weighted average of all others, is advise a person. 
"" " 
DEF getRecommendations (Prefs, person, Similarity = sim_distance): 
    totals = {} 
    simSums = {} 
 
    for OTHER in Prefs:
        IF OTHER == person: Continue 
       SIM = Similarity (Prefs, person, OTHER)
        # ignore evaluation value is zero or less than zero. 
       IFthe SIM <= 0: the Continue 
       for Item in prefs [OTHER]:
            # only for myself have not yet seen the film were evaluated. 
           IF Item not  in prefs [the Person] or prefs [the Person] [Item] == 0: 
              totals.setdefault (Item, 0) 
              totals [Item] + = * SIM Prefs [OTHER] [Item]
               # sum of similarities 
              simSums.setdefault (Item, 0) 
              simSums [Item] + = SIM
        # establishing a normalized list. 
       Rankings = [(Total / simSums [Item], Item) \
                    for Item, Total in totals.items()]
       rankings.sort()
       rankings.reverse()
       return rankings

 

参考文献:

[1]http://webdataanalysis.net/reference-and-source/distance-and-similarity/

[2]http://wpxiaomo.sinaapp.com/archives/424

[3]http://wpxiaomo.sinaapp.com/archives/423

[4] collective intelligence programming

[5]https://blog.csdn.net/yixianfeng41/article/details/61917158

Guess you like

Origin www.cnblogs.com/hellowzl/p/11496817.html