Spark DataFrame implemented using collaborative filtering algorithm article (ItemCF) based

Jane does not support Markdown Math grammar book, please move https://glassywing.github.io/2018/04/10/spark-itemcf/

Brief introduction

The current spark support collaborative filtering algorithm only ALS (collaborative filtering algorithm based on the model), but ALS algorithm for certain problems, the effect is not ideal, unlike the mahout offers a variety of recommendation algorithm. To enjoy the upgrade spark caused by the speed and to meet some business needs, then use the spark to build ItemCF algorithm. Meanwhile spark DataFrame provides new data types, algorithm development to make more clear and easy to implement,

premise

Common similarity calculation formula

The most important part of the algorithm is to calculate the collaborative filtering similarity between the items, different scenarios may apply different similarity calculation formula to calculate the degree of similarity, similarity calculation formula used is as follows:

Co-occurrence similarity (Co Occurrence)

Co-occurrence similarity formula

$$ w (x, y) = \ frac {| N (x) \ cap {N (y)} |} {| N (x) |} $$

The denominator is the formula x is an article like the number of users, and the numerator is the number of simultaneous users of the article item x and y of interest. Accordingly, the above formula can be understood how the probability of users interested x article also interested in y (association rule and the like)

But there is a problem of the above formula, if the items are popular items y, there are a lot of people are like, will result in W (x, y) is large, close to 1. It will cause any items are popular items and pay a great similarity. For this purpose we use the following formula to correct:

Improved co-occurrence similarity formula

$$ w (x, y) = \ frac {| N (x) \ cap {N (y)} |} {\ sqrt {| N (x) || N (y) |}} $$

The format of the right to punish y of heavy items, thus reducing the possibility of similar items and a lot of popular items. (Also normalized a)

Euclidean similarity (Eucledian Similarity)

Euclidean similarity calculated in accordance with Euclidean distance, the higher the closer similarity, whereas the opposite.

Euclidean distance is defined

In mathematics, Euclidean distance or Euclidean metric is the Euclidean space between two points, "normal" (ie linear) distance. Using this distance, Euclidean space becomes a metric space. Associated norm called the Euclidean norm. Pythagoras called the earlier literature metric.

Euclidean distance formula

$$ \ d_{X,Y}=\sqrt{ \sum_{i=1}n(x_i-y_i)2} $$

Pearson similarity

Pearson correlation coefficients, i.e., probability theory correlation coefficients in the range [-1, + 1]. When greater than zero, positive correlation between two variables, is less than zero when the two vectors represents negative correlation.

Pearson product-moment correlation coefficient is defined

Pearson correlation coefficient between two variables is defined as the covariance and standard deviation of the quotient between the two variables:

Pearson product-moment correlation coefficient equation

$$ \ rho {X, Y} = \ frac {cov (X, Y)} {\ {x} sigma_ \ sigma_ {y}} = \ frac {E ((X \ mu_x) (Y- \ mu_y) )} {\ {x} sigma_ \ sigma_ {y}} = \ frac {E (XY) -E (X) E (Y)} {\ sqrt {E (X2) -E2 (X)} \ sqrt {E (Y2) E2 (Y)}} $$

Cosine similarity (Cosine Similarity)

The larger the value cosine point extends the range of use of the multidimensional space and the set point angle is [-1,1], the larger the angle, the farther apart two o'clock, the smaller the degree of similarity.

Cosine between vectors defined

Cosine of the multi-dimensional space formed by two points and the set point angle

Cosine formula

$$ sim_{X,Y}=\frac{XY}{||X||||Y||}=\frac{ \sum_{i=1}n(x_iy_i)}{\sqrt{\sum_{i=1}n(x_i)2}*\sqrt{\sum_{i=1}n(y_i)^2}} $$

$ $ X_i formula represents the i th user ratings for items of x, $ y_i $ empathy.
The formula takes into account only the user's score, is likely to score higher items will be standing in the front regardless of other information items, improved version of the cosine similarity is calculated as follows:

Improved cosine similarity calculated

$$ sim_ {X, Y} = \ frac {XYnum_ {X \ cap {Y}}} {|| Y || X |||| num_ {X} log10 (10 + num_ {Y})} $$

Improved formula takes into account the number of individuals of the same two vectors, X vector size, Y size of the vector, Note:
$$ \ {X-sim_, Y} \ {NEQ sim_ the Y, X} $$

Tanimoto similarity (Jaccard coefficient)

Tanimoto similarity coefficient, also known as Jaccard, Cosine similarity is extended, even if used for document similarity. This similarity is not considered evaluation value, considering only two sets the number of common individuals.

Jaccard coefficient formula

$$ sim (x, y) = \ frac {X \ cap {Y}} {|| || X || Y || + - || X \ Y} || cap {} $$

User rating prediction formula

$$ pred_{u,p}=\frac{\sum_{i\in{ratedItems(u)}}{sim(i,p)r_{u,i}}}{\sum_{i\in{ratedItems(u)}}{sim(i,p)}} $$

Formula refers to the user u, p = articles, ratedItems (u) refers to the user u evaluated items, sim refers to the degree of similarity, r refers to an article the user scores (between the item).

Construction of ItemCFModel

Class definition

// item information case class Item (itemId: Int, itemName: String) // user - Item - Ratings case class Rating (userId: Int, itemId: Int, rating: Float) // user information case class User (userId: Int , userName: String)

Similarity metric

/**
  * SIMILARITY MEASURES
  */object SimilarityMeasures {  /**
    * The Co-occurrence similarity between two vectors A, B is
    * |N(i) ∩ N(j)| / sqrt(|N(i)||N(j)|)
    */
  def cooccurrence(numOfRatersForAAndB: Long, numOfRatersForA: Long, numOfRatersForB: Long): Double = {
    numOfRatersForAAndB / math.sqrt(numOfRatersForA * numOfRatersForB)
  }  /**
    * The correlation between two vectors A, B is
    * cov(A, B) / (stdDev(A) * stdDev(B))
    *
    * This is equivalent to
    * [n * dotProduct(A, B) - sum(A) * sum(B)] /
    * sqrt{ [n * norm(A)^2 - sum(A)^2] [n * norm(B)^2 - sum(B)^2] }
    */
  def correlation(size: Double, dotProduct: Double, ratingSum: Double,
                  rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double): Double = {    val numerator = size * dotProduct - ratingSum * rating2Sum    val denominator = scala.math.sqrt(size * ratingNormSq - ratingSum * ratingSum) *
      scala.math.sqrt(size * rating2NormSq - rating2Sum * rating2Sum)

    numerator / denominator
  }  /**
    * Regularize correlation by adding virtual pseudocounts over a prior:
    * RegularizedCorrelation = w * ActualCorrelation + (1 - w) * PriorCorrelation
    * where w = # actualPairs / (# actualPairs + # virtualPairs).
    */
  def regularizedCorrelation(size: Double, dotProduct: Double, ratingSum: Double,
                             rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double,
                             virtualCount: Double, priorCorrelation: Double): Double = {    val unregularizedCorrelation = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq)    val w = size / (size + virtualCount)

    w * unregularizedCorrelation + (1 - w) * priorCorrelation
  }  /**
    * The cosine similarity between two vectors A, B is
    * dotProduct(A, B) / (norm(A) * norm(B))
    */
  def cosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double): Double = {
    dotProduct / (ratingNorm * rating2Norm)
  }  /**
    * The improved cosine similarity between two vectors A, B is
    * dotProduct(A, B) * num(A ∩ B) / (norm(A) * norm(B) * num(A) * log10(10 + num(B)))
    */
  def improvedCosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double
                               , numAjoinB: Long, numA: Long, numB: Long): Double = {
    dotProduct * numAjoinB / (ratingNorm * rating2Norm * numA * math.log10(10 + numB))
  }  /**
    * The Jaccard Similarity between two sets A, B is
    * |Intersection(A, B)| / |Union(A, B)|
    */
  def jaccardSimilarity(usersInCommon: Double, totalUsers1: Double, totalUsers2: Double): Double = {    val union = totalUsers1 + totalUsers2 - usersInCommon
    usersInCommon / union
  }
}

Similarity computing items

def fit(ratings: Dataset[Rating]): ItemCFModel = {    this.ratings = Option(ratings)    val numRatersPerItem = ratings.groupBy("itemId").count().alias("nor")
      .coalesce(defaultParallelism)    // 在原记录基础上加上item的打分者的数量
    val ratingsWithSize = ratings.join(numRatersPerItem, "itemId")
      .coalesce(defaultParallelism)    //  执行内联操作
    ratingsWithSize.join(ratingsWithSize, "userId")
      .toDF("userId", "item1", "rating1", "nor1", "item2", "rating2", "nor2")
      .selectExpr("userId"
        , "item1", "rating1", "nor1"
        , "item2", "rating2", "nor2"
        , "rating1 * rating2 as product"
        , "pow(rating1, 2) as rating1Pow"
        , "pow(rating2, 2) as rating2Pow")
      .coalesce (defaultParallelism) 
        | ITEM1 the GROUP BY, item2 
      "" ".stripMargin)
      .createOrReplaceTempView ( "joined") // calculate the necessary intermediate data, attention has WHERE defined here, only half the data amount calculated 
    Val sparseMatrix spark.sql = ( "" " 
        | ITEM1 the SELECT 
        |, ITEM2 
        |, COUNT (the userId ) AS size 
        |, SUM (Product) AS dotProduct 
        |, SUM (rating1) AS ratingSum1 
        |, SUM (rating2) AS ratingSum2 
        |, SUM (rating1Pow) AS ratingSumOfSq1 
        |, SUM (rating2Pow) AS ratingSumOfSq2 
        |, First (NOR1) AS NOR1 
        |, First (NOR2) AS NOR2 
        | the FROM Joined 
        | ITEM1 the WHERE <ITEM2 
      .coalesce (defaultParallelism) 
      .cache () // item similarity calculated
    var sim = sparseMatrix.map(row => {      val size = row.getAs[Long](2)      val dotProduct = row.getAs[Double](3)      val ratingSum1 = row.getAs[Double](4)      val ratingSum2 = row.getAs[Double](5)      val ratingSumOfSq1 = row.getAs[Double](6)      val ratingSumOfSq2 = row.getAs[Double](7)      val numRaters1 = row.getAs[Long](8)      val numRaters2 = row.getAs[Long](9)      val cooc = cooccurrence(size, numRaters1, numRaters2)      val corr = correlation(size, dotProduct, ratingSum1, ratingSum2, ratingSumOfSq1, ratingSumOfSq2)      val regCorr = regularizedCorrelation(size, dotProduct, ratingSum1, ratingSum2,
        ratingSumOfSq1, ratingSumOfSq2, PRIOR_COUNT, PRIOR_CORRELATION)      val cosSim = cosineSimilarity(dotProduct, scala.math.sqrt(ratingSumOfSq1), scala.math.sqrt(ratingSumOfSq2))      val impCosSim = improvedCosineSimilarity(dotProduct, math.sqrt(ratingSumOfSq1), math.sqrt(ratingSum2), size, numRaters1, numRaters2)      val jaccard = jaccardSimilarity(size, numRaters1, numRaters2)
      (row.getInt(0), row.getInt(1), cooc, corr, regCorr, cosSim, impCosSim, jaccard)
    }).toDF("itemId_01", "itemId_02", "cooc", "corr", "regCorr", "cosSim", "impCosSim", "jaccard")    //  最终的物品相似度
    sim.withColumnRenamed("itemId_01", "itemId_02")
    similarities = Option-(SIM) the this
      .cache ()
      .withColumnRenamed("itemId_02", "itemId_01")
      .union(sim)
      .repartition (defaultParallelism) // repartition even data distribution to facilitate downstream user 
  }

User Recommended

/ ** 
    * Recommended num article is a user specified 
    * 
    * @param Users user set 
    * @param num number of items recommended for each user 
    * @return recommendation table 
    * / 
  DEF recommendForUsers (Users: a Dataset [the User], num: int): DataFrame = {// similarityMeasure the similarity algorithm name 
    var sim = similarities.get.select ( "itemId_01" , "itemId_02", similarityMeasure) // score table obtained 
    val rits = ratings.get val project: DataFrame = users 
      .selectExpr ( "userId as user", "userName") // sub-projection, where the left table is much smaller than the number of the right table, perform left connection 
      .join (rits, $ "user" <=> rits ( "userId") , "left")  
      .drop ($ "the user") // associated with the user selecting items and score 
      .select ( "userId", "itemId ", "rating ") // get the similarity of goods and other items of interest to the user
    project.join (the SIM, $ "itemId" <=> the SIM ( "itemId_01")) 
      .selectExpr ( "userId"
        , "itemId_01 as relatedItem"
        , "itemId_02 as otherItem"
        , similarityMeasure
        , s"$similarityMeasure * rating as simProduct")
      .coalesce(defaultParallelism)
      .createOrReplaceTempView("tempTable")

    spark.sql(      s"""
         |SELECT userId
         |,  otherItem
         |,  sum(simProduct) / sum($similarityMeasure) as rating
         |FROM tempTable
         |GROUP BY userId, otherItem
         |ORDER BY userId asc, rating desc
      """.stripMargin)      //  过滤结果
      .rdd
      .map(row => (row.getInt(0), (row.getInt(1), row.getDouble(2))))
      .groupByKey()
      .mapValues(xs => {        var sequence = Seq[(Int, Double)]()        val iter = xs.iterator        var count = 0
        while (iter.hasNext && count < num) {          val rat = iter.next()          if (rat._2 != Double.NaN)
            sequence :+= (rat._1, rat._2)
          count += 1
        }
        sequence
      })
      .toDF("userId", "recommended")
  }

The results show similarities

Data Sources

Data from MovieLens , MovieLens data set is a data set about the film score, which contains from IMDB, The Movie DataBase users get the top score of information on the film.

The degree of similarity between the items calculated

The following shows the use of co-occurrence similarity, cosine similarity and improved version of the similarity calculation (other similarity test the self) of the similarity between the movie and the "Star Wars (1977)," the results of the test (only displays the first 20 results).

Surprisingly cosine similarity of the results seem less satisfactory, because it seems to be only the cosine similarity and user ratings (type is more suitable for the recommended 5 star movie, do not care about movies, etc.), may also be that I the algorithm error has occurred, please correct me.

Show co-occurrence similarity results

movie1 movie2 coocurrence
Star Wars (1977) Return of the Jedi (1983) 0.8828826458931883
Star Wars (1977) Raiders Lost Ark (1981) 0.7679353753201742
Star Wars (1977) The Empire Strikes Back (1980) 0.7458505006229118
Star Wars (1977) Godfather, The (1972) 0.7275434127191666
Star Wars (1977) Fargo (1996) 0.7239858668831711
Star Wars (1977) Independence Day (ID4) (1996) 0.723845113716724
Star Wars (1977) Silence of the Lambs, The (1991) 0.7025515983155468
Star Wars (1977) Indiana Jones and the Last Crusade (1989) 0.6920306174608959
Star Wars (1977) Pulp Fiction (1994) 0.6885437675802282
Star Wars (1977) Star Trek: First Contact (1996) 0.6850249237265413
Star Wars (1977) Back to the Future (1985) 0.6840536741086217
Star Wars (1977) Fugitive, The (1993) 0.6710463728397225
Star Wars (1977) Rock, The (1996) 0.6646215466055597
Star Wars (1977) Terminator, The (1984) 0.6636319257721421
Star Wars (1977) Forrest Gump (1994) 0.6564951869930893
Star Wars (1977) Terminator 2: Judgment Day (1991) 0.653467518885383
Star Wars (1977) Princess Bride,The(1987) 0.6534487891771482
Star Wars (1977) Alien (1979) 0.648232034779792
Star Wars (1977) ET. Alien (1982) 0.6479990753086882
Star Wars (1977) Monty Python and the Holy Grail (1974) 0.6476896799641126

Cosine similarity results show

Cosine similarity

movie1 movie2 cosSim
Star Wars (1977) Infinity(1996) 1.0
Star Wars (1977) Monster, The (1994) 1.0
Star Wars (1977) Boys,Les(1997) 1.0
Star Wars (1977) Stranger, (1994) 1.0
Star Wars (1977) Love is all (1996) 1.0
Star Wars (1977) Paris is a woman (1995) 1.0
Star Wars (1977) Victims, A (1937) 1.0
Star Wars (1977) Pie in the Sky (1995) 1.0
Star Wars (1977) Century (1993) 1.0
Star Wars (1977) Angel on my shoulder (1946) 1.0
Star Wars (1977) Here Cookies (1935) 1.0
Star Wars (1977) Power 98 (1995) 1.0
Star Wars (1977) Funny Girl (1943) 1.0
Star Wars (1977) Volcano (1996) 1.0
Star Wars (1977) Memorable summer (1994) 1.0
Star Wars (1977) Innocents,The(1961) 1.0
Star Wars (1977) Sleepover(1995) 1.0
Star Wars (1977) Jupiter's wife (1994) 1.0
Star Wars (1977) My Life and Times with Antonin Alto (En compagnie d'Antonin Artaud) (1993) 1.0
Star Wars (1977) Bent(1997) 1.0

Cosine similarity results show improvement

Improved cosine similarity

movie1 movie2 impCosSim
Star Wars (1977) Return of the Jedi (1983) 0.6151374130038775
星球大战(1977) 失落方舟攻略(1981) 0.5139215764696529
星球大战(1977) 法戈(1996) 0.4978221397190352
星球大战(1977) 帝国反击,The(1980) 0.47719131109655355
星球大战(1977) 教父,The(1972) 0.4769568086870377
星球大战(1977) 沉默的羔羊,The(1991) 0.449096021012343
星球大战(1977) 独立日(ID4)(1996) 0.4334888029282058
星球大战(1977) 低俗小说(1994) 0.43054394420596026
星球大战(1977) 联系(1997) 0.4093441266211224
星球大战(1977) 印第安纳琼斯和最后的十字军东征(1989) 0.4080635382244593
星球大战(1977) 回到未来(1985) 0.4045977014813726
星球大战(1977) 星际迷航:第一次接触(1996) 0.40036290288050874
星球大战(1977) 逃亡者,The(1993) 0.3987919640908379
星球大战(1977) Princess Bride,The(1987) 0.39490206690864144
星球大战(1977) 摇滚,The(1996) 0.39100622194841916
星球大战(1977) 巨蟒与圣杯(1974) 0.3799595474408077
星球大战(1977) 终结者,The(1984) 0.37881311350029406
星球大战(1977) 阿甘正传(1994) 0.3755685058241706
星球大战(1977) 终结者2:审判日(1991) 0.37184317281514295
星球大战(1977) 杰瑞马奎尔(1996) 0.370478212770262



作者:manlier
链接:https://www.jianshu.com/p/169aad3cfddd

发布了3 篇原创文章 · 获赞 4 · 访问量 1万+

Guess you like

Origin blog.csdn.net/dl2277130327/article/details/86648218