Collaborative filtering algorithm combat

1. Collaborative filtering idea:

Collaborative filtering algorithm is a relatively well-known and commonly used recommendation algorithm, which is based on the mining of user historical behavior data to discover the user's preference bias, and predict the user's possible favorite products for recommendation. That is, the common functions such as "Guess you like it" and "People who bought the product also like it". Its main realization is:

●Recommend to you based on people who share your preferences

●Recommend similar items to you according to the items you like

●Comprehensive recommendation based on the above conditions

Know the relevance of different items through the user bridge: if you are accessing itema, see which users have visited itema, and further look at which items these users have seen in the history, such as items B, C, D, which are formed by the inverted index idea itemA=itemB,itemC,itemD, directly infuse the database (you can also invert the table obtained by word segmentation into the database).

Collaborative filtering algorithm classification:

Insert picture description here
This article discusses the following two situations:
1. User-based cf: user-based collaborative filtering: recommendation between friends
2. Item-based collaborative filtering based on item-based cf: item-based collaborative filtering of items bought in the past to recommend related items
like degree, the larger the number, the more you like .

User-Based CF: The degree of friendship is different from friends, so first calculate the degree of user similarity and then make a score prediction and estimate the score.
item-based cf: Recommend related items for items bought in the past, and build a matrix of similarity between items and items

Collaborative filtering practice:

Code download: link: https://pan.baidu.com/s/1oIA5sPsBhtYfCpRYg1idyA Extraction code: 1234

1. MR implements collaborative filtering (more development codes), divided into 3 steps map and reduce.
1. Normalize UI matrix

1_gen_ui_map.py
#!/usr/local/bin/python

import sys

for line in sys.stdin:
#ss=line.strip().split(’\t’)
ss=line.strip().split(’,’)
if len(ss) != 3:
continue
u, i, s = ss
print “%s\t%s\t%s” % (i, u, s)

1_gen_ui_reduce
#!/usr/local/bin/python

import sys
import math

cur_item = None
user_score_list = []

for line in sys.stdin:
item, user, score = line.strip().split("\t")
if not cur_item:
cur_item = item
if item != cur_item:
sum = 0.0
for tuple in user_score_list:
(u, s) = tuple
sum += pow(s, 2)
sum = math.sqrt(sum)
for tuple in user_score_list:
(u, s) = tuple
print “%s\t%s\t%s” % (u, cur_item, float(s / sum))

user_score_list = []
cur_item = item

user_score_list.append((user, float(score)))

sum = 0.0
for tuple in user_score_list:
(u, s) = tuple
sum += pow(s, 2)
sum = math.sqrt(sum)
for tuple in user_score_list:
(u, s) = tuple
print “%s\t%s\t%s” % (u, cur_item, float(s / sum))

2. Pair by pair

2_gen_ii_pair_map
#!/usr/local/bin/python
import sys

for line in sys.stdin:
u, i, s =line.strip().split(’\t’)
print “%s\t%s\t%s” % (u, i, s)

2_gen_ii_pair_reduce
#!/usr/local/bin/python

import sys

cur_user = None
item_score_list = []

for line in sys.stdin:
user, item, score = line.strip().split("\t")
if not cur_user:
cur_user = user
if user != cur_user:
for i in range(0, len(item_score_list) - 1):
for j in range(i + 1, len(item_score_list)):
item_a, score_a = item_score_list[i]
item_b, score_b = item_score_list[j]
print “%s\t%s\t%s” % (item_a, item_b, score_a * score_b)
print “%s\t%s\t%s” % (item_b, item_a, score_a * score_b)

item_score_list = []
cur_user = user

item_score_list.append((item, float(score)))

for i in range(0, len(item_score_list) - 1):
for j in range(i + 1, len(item_score_list)):
item_a, score_a = item_score_list[i]
item_b, score_b = item_score_list[j]
print “%s\t%s\t%s” % (item_a, item_b, score_a * score_b)
print “%s\t%s\t%s” % (item_b, item_a, score_a * score_b)

3. Sum calculation, equivalent to wordcount

cat 3_sum_map.py
#!/usr/local/bin/python

import sys

for line in sys.stdin:
i_a, i_b, s = line.strip().split(’\t’)
print “%s\t%s” % (i_a + “” + i_b, s)

cat 3_sum_reduce.py
#!/usr/local/bin/python

import sys

cur_ii_pair = None
score = 0.0

for line in sys.stdin:
ii_pair, s = line.strip().split("\t")
if not cur_ii_pair:
cur_ii_pair = ii_pair
if ii_pair != cur_ii_pair:
#item_a, item_b = cur_ii_pair.split(’’)
ss = cur_ii_pair.split(’’)
if len(ss) != 2:
continue
item_a, item_b = ss
print “%s\t%s\t%s” % (item_a, item_b, score)
cur_ii_pair = ii_pair
score = 0.0

score += float(s)

ss = cur_ii_pair.split(’’)
if len(ss) != 2:
sys.exit()
item_a, item_b = ss
print “%s\t%s\t%s” % (item_a, item_b, score)

2.
Spark collaborative filtering (code is easy to implement) import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ArrayBuffer
import scala.math._

object cf {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster(“local[2]”)
conf.setAppName(“CF Spark”)

val sc = new SparkContext(conf)
val lines = sc.textFile(args(0))
val output_path = args(1).toString

val max_prefs_per_user = 20
val topn = 5

//Step 1. normalization
val ui_rdd = lines.map { x =>
val ss = x.split("\t")
val userid = ss(0).toString
val itemid = ss(1).toString
val score = ss(2).toDouble

(userid, (itemid, score))
}.groupByKey().flatMap { x =>
val userid = x._1
val is_list = x._2

#is_arr 为 itemid,score

val is_arr = is_list.toArray
var is_arr_len = is_arr.length
if (is_arr_len > max_prefs_per_user) {
is_arr_len = max_prefs_per_user
}

var i_us_arr = new ArrayBuffer[(String, (String, Double))]
for (i <- 0 until is_arr_len) {
val itemid = is_arr(i)._1
val score = is_arr(i)._2
i_us_arr += ((itemid, (userid, score)))
}
i_us_arr
}.groupByKey().flatMap { x =>
val itemid = x._1
val us_list = x._2
val us_arr = us_list.toArray
var sum:Double = 0.0
for (i <- 0 until us_arr.length) {
sum += pow(us_arr(i)._2, 2)
}
sum = sqrt(sum)

var u_is_arr = new ArrayBuffer[(String, (String, Double))]
for (i <- 0 until us_arr.length) {
val userid = us_arr(i)._1
val score = us_arr(i)._2 / sum
u_is_arr += ((userid, (itemid, score)))
}
u_is_arr
}.groupByKey()

//Step 2. pairs
val pairs_rdd = ui_rdd.flatMap { x =>
val is_arr = x._2.toArray

var ii_s_arr = new ArrayBuffer((String, String), Double)
for (i <- 0 until is_arr.length - 1) {
for (j <- i + 1 until is_arr.length) {
val item_a = is_arr(i)._1
val item_b = is_arr(j)._1
val score_a = is_arr(i)._2
val score_b = is_arr(j)._2

ii_s_arr += (((item_a, item_b), score_a * score_b))
ii_s_arr += (((item_b, item_a), score_a * score_b))
}
}
ii_s_arr
}.groupByKey()

// Step3. sum and output
pairs_rdd.map { x=>
val ii_pair = x._1
val s_list = x._2
val s_arr = s_list.toArray
val len = s_arr.length
var score:Double = 0.0

for (i <- 0 until len){
score += s_arr(i)
}
val item_a = ii_pair._1
val item_b = ii_pair._2
(item_a, (item_b, score))
}.groupByKey().map { x=>
val item_a = x._1
val rec_list = x.2
val rec_arr = rec_list.toArray.sortWith(
._2 > _._2)

var len = rec_arr.length
#取前5
if (len > topn) {
len = topn
}

val s = new StringBuilder
for (i <- 0 until len){
val item = rec_arr(i)._1
val score = “%1.3f” format rec_arr(i)._2
s.append(item + “:” + score.toString)
if (i != len -1) {
s.append(",")
}
}

item_a + “\t” + s
}.saveAsTextFile(output_path)
}
}
3. Python implementation (using sklearn library)

import pandas as pd
import numpy as np
import warnings
import random, math, os
from tqdm import tqdm
from sklearn.model_selection import train_test_split
warnings.filterwarnings(‘ignore’)

Evaluation index

The number of correct products recommended by the recommendation system account for the number of products actually clicked by the user

def Recall(Rec_dict, Val_dict):
'''
Rec_dict: Recommendation list returned by recommendation algorithm, form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
Val_dict: user actually clicked The list of products, in the form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
'''
hit_items = 0
all_items = 0
for uid, items in Val_dict.items():
rel_set = items
rec_set = Rec_dict[uid]
for item in rec_set:
if item in rel_set:
hit_items += 1
all_items += len(rel_set)

return round(hit_items / all_items * 100, 2)

#Recommendation system recommends the correct number of products accounted for the number of products actually recommended to the user

def Precision(Rec_dict, Val_dict):
'''
Rec_dict: Recommendation list returned by recommendation algorithm, form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
Val_dict: user actually clicked The list of products, in the form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
'''
hit_items = 0
all_items = 0
for uid, items in Val_dict.items():
rel_set = items
rec_set = Rec_dict[uid]
for item in rec_set:
if item in rel_set:
hit_items += 1
all_items += len(rec_set)

return round(hit_items / all_items * 100, 2)

# Among all recommended users, the number of recommended products accounted for the number of products actually clicked by these users

def Coverage(Rec_dict, Trn_dict):
'''
Rec_dict: Recommendation list returned by recommendation algorithm, form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
Trn_dict: training set users The list of items actually clicked, in the form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
'''
rec_items = set()
all_items = set()
for uid in Rec_dict:
for item in Trn_dict[uid]:
all_items.add(item)
for item in Rec_dict[uid]:
rec_items.add(item)
return round(len(rec_items) / len(all_items) * 100, 2)

#Use the average popularity to measure novelty. If the average popularity is high (that is, the recommended product is more popular), it means that the recommended novelty is relatively low

def Popularity(Rec_dict, Trn_dict):
'''
Rec_dict: Recommendation list returned by recommendation algorithm, form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
Trn_dict: training set users The list of items actually clicked, in the form: {uid: {item1, item2,…}, uid: {item1, item2,…}, …}
'''
pop_items = {}
for uid in Trn_dict:
for item in Trn_dict[uid] :
if item not in pop_items:
pop_items[item] = 0
pop_items[item] += 1

pop, num = 0, 0
for uid in Rec_dict:
for item in Rec_dict[uid]:
pop += math.log(pop_items[item] + 1) # The item popularity distribution satisfies the long tail distribution, and the logarithm can make the average The value is more stable
num += 1
return round(pop / num, 3)

#Call several evaluation index functions together

def rec_eval(val_rec_items, val_user_items, trn_user_items):
print(‘recall:’,Recall(val_rec_items, val_user_items))
print(‘precision’,Precision(val_rec_items, val_user_items))
print(‘coverage’,Coverage(val_rec_items, trn_user_items))
print(‘Popularity’,Popularity(val_rec_items, trn_user_items))

def get_data(root_path):

#Read data

rnames = [‘user_id’,‘movie_id’,‘rating’,‘timestamp’]
ratings = pd.read_csv(os.path.join(root_path, ‘ratings.dat’), sep=’::’, engine=‘python’, names=rnames)

#Split training and validation sets

trn_data, val_data, _, _ = train_test_split(ratings, ratings, test_size=0.2)

trn_data = trn_data.groupby(‘user_id’)[‘movie_id’].apply(list).reset_index()
val_data = val_data.groupby(‘user_id’)[‘movie_id’].apply(list).reset_index()

trn_user_items = {}
val_user_items = {}

#Construct the array into the form of a dictionary {user_id: [item_id1, item_id2,…,item_idn]}

for user, movies in zip(*(list(trn_data[‘user_id’]), list(trn_data[‘movie_id’]))):
trn_user_items[user] = set(movies)

for user, movies in zip(*(list(val_data[‘user_id’]), list(val_data[‘movie_id’]))):
val_user_items[user] = set(movies)

return trn_user_items, val_user_items

def User_CF_Rec(trn_user_items, val_user_items, K, N):
'''
trn_user_items: stands for training data, the format is: {user_id1: [item_id1, item_id2,…,item_idn], user_id2…}
val_user_items: stands for verification data, the format is: { user_id1: [item_id1, item_id2,…,item_idn], user_id2…}
K: K represents the number of similar users, and each user chooses the most similar K users.
N: N represents the number of products recommended to the user , Recommend the N products with the greatest similarity to each user
'''

#Establish item->users inverted table. The format of the inverted table is: {item_id1: {user_id1, user_id2,…, user_idn}, item_id2: …} That is, each item corresponds to which users have clicked

#The purpose of establishing an inverted table is to better count the number of products interacting with users

print('Create an inverted list...')
item_users = ()
for uid, items in tqdm(trn_user_items.items()): # Traverse the data of each user, which contains all interactive items
for item in items: # Traverse all the items of the user and add the corresponding uid to the user list corresponding to these items
if item not in item_users:
item_users[item] = set()
item_users[item].add(uid)

#算User Collaborative Filtering Matrix
# That is, using the item-users inverted table to count the number of products interacting between users, the representation of the user collaborative filtering matrix is: sim = {user_id1: {user_id2: num1}, user_id3: {user_id4: num2 }, …}
#Collaborative filtering matrix is ​​a two-layer dictionary used to indicate the number of products that users interact with each other.
# When calculating the user collaborative filtering matrix, it is also necessary to record the number of products each user interacts with, its representation For: num = {user_id1:num1, user_id2:num2, …}
sim = {}
num = {}
print('Build a collaborative filtering matrix...')
for item, users in tqdm(item_users.items()): # traverse all the item to statistics, two common interactions between the user's item number
for u in the users:
IF u not in num: # If the user is not in the dictionary num u in advance to initialize it to 0 in the dictionary, otherwise the operation will be back Report key error
num[u] = 0
num[u] += 1 # Count the total number of items interacted by each user
if u not in sim: # If user u is not in the dictionary sim, give it in the dictionary in advance Initialize to a new dictionary, otherwise the subsequent operations will report a key error
sim[u] = {}
for v in users:
if u != v: # The similarity between users is calculated only when u is not equal to v
if v not in sim[u]:
sim [u] [v] = 0
sim [u] [v] + = 1

#算User Similarity Matrix
#The user collaborative filtering matrix is ​​actually equivalent to the numerator part of the cosine similarity, and it needs to be divided by the denominator, that is, the product of the number of items that two users interact with each other
# The product of the number of items that two users interact with each other is The above num dictionary
print('calculate similarity...')
for u, users in tqdm(sim.items()):
for v, score in users.items():
sim[u][v] = score / math .sqrt(num[u] * num[v]) # Cosine similarity denominator part

#对Verification data for each user to topN recommend
#Before recommending users, you need to get the top K users who are most amorous with the current user through the similarity matrix, #Then
divide the current K user interaction products Calculate the final similarity score for products other than the products that have been interacted in the test user training set.
# The similarity score of the final recommended candidate product is an accumulation of the product scores by multiple users and
print('Recommend to the test user...' )
items_rank = {}
for u, _ in tqdm(val_user_items.items()): # Traverse the test set users and recommend each user in the test set
items_rank[u] = {} # Initialize the dictionary of user u’s candidate items
for v, score in sorted(sim[u].items(), key=lambda x: x[1], reverse=True)[:K]: # Select the k users who love user u most
for item in trn_user_items [v]: # Traverse the products that have been interacted by similar users
if item not in trn_user_items[u]: # If the products that similar users have interacted with, and the test user has appeared in the training set, there is no need to recommend it, just skip the
if item not in items_rank[u]:
items_rank[u][item] = 0 # Initialize the similarity score of user u to item 0
items_rank[u][item] += score # Accumulate the scores of all similar users for the same item

print('
Select the N items with the highest similarity scores for each user...') items_rank = {k: sorted(v.items(), key=lambda x: x[1], reverse=True)[:N ] for k, v in items_rank.items()}
items_rank = {k: set([x[0] for x in v]) for k, v in items_rank.items()} # Integrate the output into a suitable format and output

return items_rank

if name == “main”:
root_path = ‘./data/ml-1m/’
trn_user_items, val_user_items = get_data(root_path)
rec_items = User_CF_Rec(trn_user_items, val_user_items, 80, 10)
rec_eval(rec_items, val_user_items, trn_user_items)

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/114013653