[Recommendation System Actual Combat] Movie Recommendation System (Part 1) - Overall Design

1. Project framework

1.1 System module design:

insert image description here

1.2 Project System Architecture

insert image description here



2. Data source analysis

2.1 Data source information

  • Movie information movies.csv
Movie ID (MID) Movie name (NAME) Movie description (DESCRI) Movie duration (TIMELONG) Issue time (ISSUE) Shooting time (SHOOT) Movie language (LANGUAGE) Movie category (DIRECTOR) Film actors (ACTORS) Film director (DIRECTOR)
1 Toy Story - 81minutes March 20, 2001 1995 English Adventure
Animation
Children
Comedy
Fantasy
Tom Hanks
Tim Allen

Wallace Shawn
John Lasseter
  • User rating information ratings.csv
User ID (UID) Movie ID (MID) Movie Score (SCORE) Grading time (TIMESTAMP)
1 31 2.5 1260759144
  • Movie tag information tags.csv
User ID (UID) Movie ID (MID) Movie Tag (TAG) Label time (TIMESTAMP)
15 1995 dentist 1193435061

2.2 Main data model

insert image description here



3. Offline statistics module

insert image description here

3.1 Historical Popular Movie Statistics

Count the number of ratings for each movie in all historical data (only the number of ratings is considered here, not the size of the ratings.)

# RateMoreMovies  数据结构:mid,count
select mid,count(mid) as count from ratings group by mid

3.2 Recent popular movie statistics

Counting the number of movie ratings per month represents the recent popularity of the movie.
changeDate: Use SimpleDateFormat to convert Date to the scoring time (TIMESTAMP), and the conversion format is "yyyyMM".

# ratingOfMonth   数据结构:mid,score,yearmonth
select mid,score,changeDate(timestamp) as yearmonth from ratings
# RateMoreRecentlyMovies   数据结构:mid,count,yearmonth
select mid,count(mid) as count, yearmonth from ratingOfMonth group by yearmonth,mid order by yearmonth desc,count desc

3.3 Movie Average Rating Statistics

# AverageMovies   数据结构:mid,avg
select mid,avg(score) as avg from ratings group by mid

3.4 Statistics of Top 10 high-quality movies in each category

# movieWithScore  把所有的类别和所有的电影做匹配判断
select a.mid,genres,if(isnull(b.avg),0,b.avg) score from movies a left join averageMovies b on a.mid=b.mid




4. Offline recommendation module

4.1 Training the latent semantic model with the ALS algorithm

ALS is the abbreviation of alternating least squares. In machine learning, ALS specifically refers to a collaborative recommendation algorithm solved by alternating least squares.

The benefit of factorization-like models is that, once the model is established, it is relatively easy to solve for recommendations. So such models usually perform very well. But the disadvantage may be that it is difficult to choose the number of factors, which often needs to be determined in combination with specific business and data volume. Generally speaking, the value range of the factor is between 10 and 200. Note: The larger k is, the higher the computational complexity


ALS recommendation model training :
insert image description here

4.2 Calculation of user recommendation matrix

insert image description here

4.3 Computing Movie Similarity Matrix

insert image description here

The ratings of two movies cannot reflect the similarity of the two movies, so here we use the cosine similarity to represent the similarity of the two movies instead of the Euclidean distance.


Store the movie similarity matrix :
insert image description here



5. Real-time recommendation module

  • Calculate faster
  • The results may not be particularly precise
  • There are pre-designed recommendation models

5.1 Real-time Recommendation Architecture

insert image description here


5.2 Real-time recommendation priority calculation

Rationale: The tastes of users in the recent period are similar.

It is necessary to comprehensively consider movies that are similar to the movie that the user has watched recently and the user's rating of the movie (if the rating is low, similar movies are not recommended), and the weighted sum of these two factors is used to obtain the recommendation priority.

specific methods:

  1. First, get a set of candidate movie lists;
  2. Then do calculations for each candidate movie and calculate its recommendation priority;
    E uq = ∑ r ∈ RK sim ( q , r ) × R rsimnum + lgmax ( incount , 1 ) − lgmax ( recount , 1 ) E_{uq} = \frac{\sum\limits_{r\in RK} sim(q,r)\times R_r}{sim _num} + lg max(incount, 1) - lg max(recount, 1)Euq=simna mrRKsim(q,r)×Rr+lgmax(incount,1)lgmax(recount,1 ) Among them,
    • sim(q,r) is the candidate movie qqq and the movie r ∈ RK r\in RKthat the user has watched recentlyrThe similarity between R K ;
    • R r R_r RrIndicates that the user is interested in the movie rrr 's rating;
    • ∑ r ∈ R K s i m ( q , r ) × R r s i m n u m \frac{\sum\limits_{r\in RK} sim(q,r)\times R_r}{sim_num} simna mrRKsim(q,r)×RrIndicates that the similarity and the score are weighted and summed, and then averaged;
    • l g m a x ( i n c o u n t , 1 ) lg max(incount, 1) lgmax(incount,1 ) Indicates reward items,incount incountin co u n t indicates the number of high scores (custom) in the user's recent ratings;
    • l g m a x ( r e c o u n t , 1 ) lg max(recount, 1) lgmax(recount,1 ) Indicates the penalty item,incount incountin co u n t indicates the number of low scores (custom) in the user's recent ratings.

insert image description here



6. Content-based recommendation module

  • Similar Movies to Movie A - Movies with the same tag
  • Item-CF: Extract the content features of movie A according to the label, and select movies with similar features to A
  • Feature extraction based on UGC - TF-IDF





Reference:
[1] Shang Silicon Valley Machine Learning and Recommender System Project Practical Tutorial

Guess you like

Origin blog.csdn.net/qq_42757191/article/details/126699421