Jane does not support Markdown Math grammar book, please move https://glassywing.github.io/2018/04/10/spark-itemcf/
Brief introduction
The current spark support collaborative filtering algorithm only ALS (collaborative filtering algorithm based on the model), but ALS algorithm for certain problems, the effect is not ideal, unlike the mahout offers a variety of recommendation algorithm. To enjoy the upgrade spark caused by the speed and to meet some business needs, then use the spark to build ItemCF algorithm. Meanwhile spark DataFrame provides new data types, algorithm development to make more clear and easy to implement,
premise
-
This article requires that you have some understanding of the basic calculation based on collaborative filtering algorithm items (ItemCF) process, if not find out about ItemCF, please refer to the " field-based collaborative filtering algorithm: UserCF and ItemCF ."
-
As used herein, DataFrame data type spark of development, rather than the RDD. Without understanding DataFrame, please refer to " the Spark SQL, dataframes and Datasets Guide ."
Common similarity calculation formula
The most important part of the algorithm is to calculate the collaborative filtering similarity between the items, different scenarios may apply different similarity calculation formula to calculate the degree of similarity, similarity calculation formula used is as follows:
Co-occurrence similarity (Co Occurrence)
Co-occurrence similarity formula
$$ w (x, y) = \ frac {| N (x) \ cap {N (y)} |} {| N (x) |} $$
The denominator is the formula x is an article like the number of users, and the numerator is the number of simultaneous users of the article item x and y of interest. Accordingly, the above formula can be understood how the probability of users interested x article also interested in y (association rule and the like)
But there is a problem of the above formula, if the items are popular items y, there are a lot of people are like, will result in W (x, y) is large, close to 1. It will cause any items are popular items and pay a great similarity. For this purpose we use the following formula to correct:
Improved co-occurrence similarity formula
$$ w (x, y) = \ frac {| N (x) \ cap {N (y)} |} {\ sqrt {| N (x) || N (y) |}} $$
The format of the right to punish y of heavy items, thus reducing the possibility of similar items and a lot of popular items. (Also normalized a)
Euclidean similarity (Eucledian Similarity)
Euclidean similarity calculated in accordance with Euclidean distance, the higher the closer similarity, whereas the opposite.
Euclidean distance is defined
In mathematics, Euclidean distance or Euclidean metric is the Euclidean space between two points, "normal" (ie linear) distance. Using this distance, Euclidean space becomes a metric space. Associated norm called the Euclidean norm. Pythagoras called the earlier literature metric.
Euclidean distance formula
$$ \ d_{X,Y}=\sqrt{ \sum_{i=1}n(x_i-y_i)2} $$
Pearson similarity
Pearson correlation coefficients, i.e., probability theory correlation coefficients in the range [-1, + 1]. When greater than zero, positive correlation between two variables, is less than zero when the two vectors represents negative correlation.
Pearson product-moment correlation coefficient is defined
Pearson correlation coefficient between two variables is defined as the covariance and standard deviation of the quotient between the two variables:
Pearson product-moment correlation coefficient equation
$$ \ rho {X, Y} = \ frac {cov (X, Y)} {\ {x} sigma_ \ sigma_ {y}} = \ frac {E ((X \ mu_x) (Y- \ mu_y) )} {\ {x} sigma_ \ sigma_ {y}} = \ frac {E (XY) -E (X) E (Y)} {\ sqrt {E (X2) -E2 (X)} \ sqrt {E (Y2) E2 (Y)}} $$
Cosine similarity (Cosine Similarity)
The larger the value cosine point extends the range of use of the multidimensional space and the set point angle is [-1,1], the larger the angle, the farther apart two o'clock, the smaller the degree of similarity.
Cosine between vectors defined
Cosine of the multi-dimensional space formed by two points and the set point angle
Cosine formula
$$ sim_{X,Y}=\frac{XY}{||X||||Y||}=\frac{ \sum_{i=1}n(x_iy_i)}{\sqrt{\sum_{i=1}n(x_i)2}*\sqrt{\sum_{i=1}n(y_i)^2}} $$
$ $ X_i formula represents the i th user ratings for items of x, $ y_i $ empathy.
The formula takes into account only the user's score, is likely to score higher items will be standing in the front regardless of other information items, improved version of the cosine similarity is calculated as follows:
Improved cosine similarity calculated
$$ sim_ {X, Y} = \ frac {XYnum_ {X \ cap {Y}}} {|| Y || X |||| num_ {X} log10 (10 + num_ {Y})} $$
Improved formula takes into account the number of individuals of the same two vectors, X vector size, Y size of the vector, Note:
$$ \ {X-sim_, Y} \ {NEQ sim_ the Y, X} $$
Tanimoto similarity (Jaccard coefficient)
Tanimoto similarity coefficient, also known as Jaccard, Cosine similarity is extended, even if used for document similarity. This similarity is not considered evaluation value, considering only two sets the number of common individuals.
Jaccard coefficient formula
$$ sim (x, y) = \ frac {X \ cap {Y}} {|| || X || Y || + - || X \ Y} || cap {} $$
User rating prediction formula
$$ pred_{u,p}=\frac{\sum_{i\in{ratedItems(u)}}{sim(i,p)r_{u,i}}}{\sum_{i\in{ratedItems(u)}}{sim(i,p)}} $$
Formula refers to the user u, p = articles, ratedItems (u) refers to the user u evaluated items, sim refers to the degree of similarity, r refers to an article the user scores (between the item).
Construction of ItemCFModel
Class definition
// item information case class Item (itemId: Int, itemName: String) // user - Item - Ratings case class Rating (userId: Int, itemId: Int, rating: Float) // user information case class User (userId: Int , userName: String)
Similarity metric
/** * SIMILARITY MEASURES */object SimilarityMeasures { /** * The Co-occurrence similarity between two vectors A, B is * |N(i) ∩ N(j)| / sqrt(|N(i)||N(j)|) */ def cooccurrence(numOfRatersForAAndB: Long, numOfRatersForA: Long, numOfRatersForB: Long): Double = { numOfRatersForAAndB / math.sqrt(numOfRatersForA * numOfRatersForB) } /** * The correlation between two vectors A, B is * cov(A, B) / (stdDev(A) * stdDev(B)) * * This is equivalent to * [n * dotProduct(A, B) - sum(A) * sum(B)] / * sqrt{ [n * norm(A)^2 - sum(A)^2] [n * norm(B)^2 - sum(B)^2] } */ def correlation(size: Double, dotProduct: Double, ratingSum: Double, rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double): Double = { val numerator = size * dotProduct - ratingSum * rating2Sum val denominator = scala.math.sqrt(size * ratingNormSq - ratingSum * ratingSum) * scala.math.sqrt(size * rating2NormSq - rating2Sum * rating2Sum) numerator / denominator } /** * Regularize correlation by adding virtual pseudocounts over a prior: * RegularizedCorrelation = w * ActualCorrelation + (1 - w) * PriorCorrelation * where w = # actualPairs / (# actualPairs + # virtualPairs). */ def regularizedCorrelation(size: Double, dotProduct: Double, ratingSum: Double, rating2Sum: Double, ratingNormSq: Double, rating2NormSq: Double, virtualCount: Double, priorCorrelation: Double): Double = { val unregularizedCorrelation = correlation(size, dotProduct, ratingSum, rating2Sum, ratingNormSq, rating2NormSq) val w = size / (size + virtualCount) w * unregularizedCorrelation + (1 - w) * priorCorrelation } /** * The cosine similarity between two vectors A, B is * dotProduct(A, B) / (norm(A) * norm(B)) */ def cosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double): Double = { dotProduct / (ratingNorm * rating2Norm) } /** * The improved cosine similarity between two vectors A, B is * dotProduct(A, B) * num(A ∩ B) / (norm(A) * norm(B) * num(A) * log10(10 + num(B))) */ def improvedCosineSimilarity(dotProduct: Double, ratingNorm: Double, rating2Norm: Double , numAjoinB: Long, numA: Long, numB: Long): Double = { dotProduct * numAjoinB / (ratingNorm * rating2Norm * numA * math.log10(10 + numB)) } /** * The Jaccard Similarity between two sets A, B is * |Intersection(A, B)| / |Union(A, B)| */ def jaccardSimilarity(usersInCommon: Double, totalUsers1: Double, totalUsers2: Double): Double = { val union = totalUsers1 + totalUsers2 - usersInCommon usersInCommon / union } }
Similarity computing items
def fit(ratings: Dataset[Rating]): ItemCFModel = { this.ratings = Option(ratings) val numRatersPerItem = ratings.groupBy("itemId").count().alias("nor") .coalesce(defaultParallelism) // 在原记录基础上加上item的打分者的数量 val ratingsWithSize = ratings.join(numRatersPerItem, "itemId") .coalesce(defaultParallelism) // 执行内联操作 ratingsWithSize.join(ratingsWithSize, "userId") .toDF("userId", "item1", "rating1", "nor1", "item2", "rating2", "nor2") .selectExpr("userId" , "item1", "rating1", "nor1" , "item2", "rating2", "nor2" , "rating1 * rating2 as product" , "pow(rating1, 2) as rating1Pow" , "pow(rating2, 2) as rating2Pow") .coalesce (defaultParallelism) | ITEM1 the GROUP BY, item2 "" ".stripMargin) .createOrReplaceTempView ( "joined") // calculate the necessary intermediate data, attention has WHERE defined here, only half the data amount calculated Val sparseMatrix spark.sql = ( "" " | ITEM1 the SELECT |, ITEM2 |, COUNT (the userId ) AS size |, SUM (Product) AS dotProduct |, SUM (rating1) AS ratingSum1 |, SUM (rating2) AS ratingSum2 |, SUM (rating1Pow) AS ratingSumOfSq1 |, SUM (rating2Pow) AS ratingSumOfSq2 |, First (NOR1) AS NOR1 |, First (NOR2) AS NOR2 | the FROM Joined | ITEM1 the WHERE <ITEM2 .coalesce (defaultParallelism) .cache () // item similarity calculated var sim = sparseMatrix.map(row => { val size = row.getAs[Long](2) val dotProduct = row.getAs[Double](3) val ratingSum1 = row.getAs[Double](4) val ratingSum2 = row.getAs[Double](5) val ratingSumOfSq1 = row.getAs[Double](6) val ratingSumOfSq2 = row.getAs[Double](7) val numRaters1 = row.getAs[Long](8) val numRaters2 = row.getAs[Long](9) val cooc = cooccurrence(size, numRaters1, numRaters2) val corr = correlation(size, dotProduct, ratingSum1, ratingSum2, ratingSumOfSq1, ratingSumOfSq2) val regCorr = regularizedCorrelation(size, dotProduct, ratingSum1, ratingSum2, ratingSumOfSq1, ratingSumOfSq2, PRIOR_COUNT, PRIOR_CORRELATION) val cosSim = cosineSimilarity(dotProduct, scala.math.sqrt(ratingSumOfSq1), scala.math.sqrt(ratingSumOfSq2)) val impCosSim = improvedCosineSimilarity(dotProduct, math.sqrt(ratingSumOfSq1), math.sqrt(ratingSum2), size, numRaters1, numRaters2) val jaccard = jaccardSimilarity(size, numRaters1, numRaters2) (row.getInt(0), row.getInt(1), cooc, corr, regCorr, cosSim, impCosSim, jaccard) }).toDF("itemId_01", "itemId_02", "cooc", "corr", "regCorr", "cosSim", "impCosSim", "jaccard") // 最终的物品相似度 sim.withColumnRenamed("itemId_01", "itemId_02") similarities = Option-(SIM) the this .cache () .withColumnRenamed("itemId_02", "itemId_01") .union(sim) .repartition (defaultParallelism) // repartition even data distribution to facilitate downstream user }
User Recommended
/ ** * Recommended num article is a user specified * * @param Users user set * @param num number of items recommended for each user * @return recommendation table * / DEF recommendForUsers (Users: a Dataset [the User], num: int): DataFrame = {// similarityMeasure the similarity algorithm name var sim = similarities.get.select ( "itemId_01" , "itemId_02", similarityMeasure) // score table obtained val rits = ratings.get val project: DataFrame = users .selectExpr ( "userId as user", "userName") // sub-projection, where the left table is much smaller than the number of the right table, perform left connection .join (rits, $ "user" <=> rits ( "userId") , "left") .drop ($ "the user") // associated with the user selecting items and score .select ( "userId", "itemId ", "rating ") // get the similarity of goods and other items of interest to the user project.join (the SIM, $ "itemId" <=> the SIM ( "itemId_01")) .selectExpr ( "userId" , "itemId_01 as relatedItem" , "itemId_02 as otherItem" , similarityMeasure , s"$similarityMeasure * rating as simProduct") .coalesce(defaultParallelism) .createOrReplaceTempView("tempTable") spark.sql( s""" |SELECT userId |, otherItem |, sum(simProduct) / sum($similarityMeasure) as rating |FROM tempTable |GROUP BY userId, otherItem |ORDER BY userId asc, rating desc """.stripMargin) // 过滤结果 .rdd .map(row => (row.getInt(0), (row.getInt(1), row.getDouble(2)))) .groupByKey() .mapValues(xs => { var sequence = Seq[(Int, Double)]() val iter = xs.iterator var count = 0 while (iter.hasNext && count < num) { val rat = iter.next() if (rat._2 != Double.NaN) sequence :+= (rat._1, rat._2) count += 1 } sequence }) .toDF("userId", "recommended") }
The results show similarities
Data Sources
Data from MovieLens , MovieLens data set is a data set about the film score, which contains from IMDB, The Movie DataBase users get the top score of information on the film.
The degree of similarity between the items calculated
The following shows the use of co-occurrence similarity, cosine similarity and improved version of the similarity calculation (other similarity test the self) of the similarity between the movie and the "Star Wars (1977)," the results of the test (only displays the first 20 results).
Surprisingly cosine similarity of the results seem less satisfactory, because it seems to be only the cosine similarity and user ratings (type is more suitable for the recommended 5 star movie, do not care about movies, etc.), may also be that I the algorithm error has occurred, please correct me.
Show co-occurrence similarity results
movie1 | movie2 | coocurrence |
---|---|---|
Star Wars (1977) | Return of the Jedi (1983) | 0.8828826458931883 |
Star Wars (1977) | Raiders Lost Ark (1981) | 0.7679353753201742 |
Star Wars (1977) | The Empire Strikes Back (1980) | 0.7458505006229118 |
Star Wars (1977) | Godfather, The (1972) | 0.7275434127191666 |
Star Wars (1977) | Fargo (1996) | 0.7239858668831711 |
Star Wars (1977) | Independence Day (ID4) (1996) | 0.723845113716724 |
Star Wars (1977) | Silence of the Lambs, The (1991) | 0.7025515983155468 |
Star Wars (1977) | Indiana Jones and the Last Crusade (1989) | 0.6920306174608959 |
Star Wars (1977) | Pulp Fiction (1994) | 0.6885437675802282 |
Star Wars (1977) | Star Trek: First Contact (1996) | 0.6850249237265413 |
Star Wars (1977) | Back to the Future (1985) | 0.6840536741086217 |
Star Wars (1977) | Fugitive, The (1993) | 0.6710463728397225 |
Star Wars (1977) | Rock, The (1996) | 0.6646215466055597 |
Star Wars (1977) | Terminator, The (1984) | 0.6636319257721421 |
Star Wars (1977) | Forrest Gump (1994) | 0.6564951869930893 |
Star Wars (1977) | Terminator 2: Judgment Day (1991) | 0.653467518885383 |
Star Wars (1977) | Princess Bride,The(1987) | 0.6534487891771482 |
Star Wars (1977) | Alien (1979) | 0.648232034779792 |
Star Wars (1977) | ET. Alien (1982) | 0.6479990753086882 |
Star Wars (1977) | Monty Python and the Holy Grail (1974) | 0.6476896799641126 |
Cosine similarity results show
Cosine similarity
movie1 | movie2 | cosSim |
---|---|---|
Star Wars (1977) | Infinity(1996) | 1.0 |
Star Wars (1977) | Monster, The (1994) | 1.0 |
Star Wars (1977) | Boys,Les(1997) | 1.0 |
Star Wars (1977) | Stranger, (1994) | 1.0 |
Star Wars (1977) | Love is all (1996) | 1.0 |
Star Wars (1977) | Paris is a woman (1995) | 1.0 |
Star Wars (1977) | Victims, A (1937) | 1.0 |
Star Wars (1977) | Pie in the Sky (1995) | 1.0 |
Star Wars (1977) | Century (1993) | 1.0 |
Star Wars (1977) | Angel on my shoulder (1946) | 1.0 |
Star Wars (1977) | Here Cookies (1935) | 1.0 |
Star Wars (1977) | Power 98 (1995) | 1.0 |
Star Wars (1977) | Funny Girl (1943) | 1.0 |
Star Wars (1977) | Volcano (1996) | 1.0 |
Star Wars (1977) | Memorable summer (1994) | 1.0 |
Star Wars (1977) | Innocents,The(1961) | 1.0 |
Star Wars (1977) | Sleepover(1995) | 1.0 |
Star Wars (1977) | Jupiter's wife (1994) | 1.0 |
Star Wars (1977) | My Life and Times with Antonin Alto (En compagnie d'Antonin Artaud) (1993) | 1.0 |
Star Wars (1977) | Bent(1997) | 1.0 |
Cosine similarity results show improvement
Improved cosine similarity
movie1 | movie2 | impCosSim |
---|---|---|
Star Wars (1977) | Return of the Jedi (1983) | 0.6151374130038775 |
星球大战(1977) | 失落方舟攻略(1981) | 0.5139215764696529 |
星球大战(1977) | 法戈(1996) | 0.4978221397190352 |
星球大战(1977) | 帝国反击,The(1980) | 0.47719131109655355 |
星球大战(1977) | 教父,The(1972) | 0.4769568086870377 |
星球大战(1977) | 沉默的羔羊,The(1991) | 0.449096021012343 |
星球大战(1977) | 独立日(ID4)(1996) | 0.4334888029282058 |
星球大战(1977) | 低俗小说(1994) | 0.43054394420596026 |
星球大战(1977) | 联系(1997) | 0.4093441266211224 |
星球大战(1977) | 印第安纳琼斯和最后的十字军东征(1989) | 0.4080635382244593 |
星球大战(1977) | 回到未来(1985) | 0.4045977014813726 |
星球大战(1977) | 星际迷航:第一次接触(1996) | 0.40036290288050874 |
星球大战(1977) | 逃亡者,The(1993) | 0.3987919640908379 |
星球大战(1977) | Princess Bride,The(1987) | 0.39490206690864144 |
星球大战(1977) | 摇滚,The(1996) | 0.39100622194841916 |
星球大战(1977) | 巨蟒与圣杯(1974) | 0.3799595474408077 |
星球大战(1977) | 终结者,The(1984) | 0.37881311350029406 |
星球大战(1977) | 阿甘正传(1994) | 0.3755685058241706 |
星球大战(1977) | 终结者2:审判日(1991) | 0.37184317281514295 |
星球大战(1977) | 杰瑞马奎尔(1996) | 0.370478212770262 |
作者:manlier
链接:https://www.jianshu.com/p/169aad3cfddd