Table of contents
1. Project framework
1.1 System module design:
1.2 Project System Architecture
2. Data source analysis
2.1 Data source information
- Movie information movies.csv
Movie ID (MID) | Movie name (NAME) | Movie description (DESCRI) | Movie duration (TIMELONG) | Issue time (ISSUE) | Shooting time (SHOOT) | Movie language (LANGUAGE) | Movie category (DIRECTOR) | Film actors (ACTORS) | Film director (DIRECTOR) |
---|---|---|---|---|---|---|---|---|---|
1 | Toy Story | - | 81minutes | March 20, 2001 | 1995 | English | Adventure Animation Children Comedy Fantasy |
Tom Hanks Tim Allen … Wallace Shawn |
John Lasseter |
… | … | … | … | … | … | … | … | … | … |
- User rating information ratings.csv
User ID (UID) | Movie ID (MID) | Movie Score (SCORE) | Grading time (TIMESTAMP) |
---|---|---|---|
1 | 31 | 2.5 | 1260759144 |
… | … | … | … |
- Movie tag information tags.csv
User ID (UID) | Movie ID (MID) | Movie Tag (TAG) | Label time (TIMESTAMP) |
---|---|---|---|
15 | 1995 | dentist | 1193435061 |
… | … | … | … |
2.2 Main data model
3. Offline statistics module
3.1 Historical Popular Movie Statistics
Count the number of ratings for each movie in all historical data (only the number of ratings is considered here, not the size of the ratings.)
# RateMoreMovies 数据结构:mid,count
select mid,count(mid) as count from ratings group by mid
3.2 Recent popular movie statistics
Counting the number of movie ratings per month represents the recent popularity of the movie.
changeDate: Use SimpleDateFormat to convert Date to the scoring time (TIMESTAMP), and the conversion format is "yyyyMM".
# ratingOfMonth 数据结构:mid,score,yearmonth
select mid,score,changeDate(timestamp) as yearmonth from ratings
# RateMoreRecentlyMovies 数据结构:mid,count,yearmonth
select mid,count(mid) as count, yearmonth from ratingOfMonth group by yearmonth,mid order by yearmonth desc,count desc
3.3 Movie Average Rating Statistics
# AverageMovies 数据结构:mid,avg
select mid,avg(score) as avg from ratings group by mid
3.4 Statistics of Top 10 high-quality movies in each category
# movieWithScore 把所有的类别和所有的电影做匹配判断
select a.mid,genres,if(isnull(b.avg),0,b.avg) score from movies a left join averageMovies b on a.mid=b.mid
4. Offline recommendation module
4.1 Training the latent semantic model with the ALS algorithm
ALS is the abbreviation of alternating least squares. In machine learning, ALS specifically refers to a collaborative recommendation algorithm solved by alternating least squares.
The benefit of factorization-like models is that, once the model is established, it is relatively easy to solve for recommendations. So such models usually perform very well. But the disadvantage may be that it is difficult to choose the number of factors, which often needs to be determined in combination with specific business and data volume. Generally speaking, the value range of the factor is between 10 and 200. Note: The larger k is, the higher the computational complexity
ALS recommendation model training :
4.2 Calculation of user recommendation matrix
4.3 Computing Movie Similarity Matrix
The ratings of two movies cannot reflect the similarity of the two movies, so here we use the cosine similarity to represent the similarity of the two movies instead of the Euclidean distance.
Store the movie similarity matrix :
5. Real-time recommendation module
- Calculate faster
- The results may not be particularly precise
- There are pre-designed recommendation models
5.1 Real-time Recommendation Architecture
5.2 Real-time recommendation priority calculation
Rationale: The tastes of users in the recent period are similar.
It is necessary to comprehensively consider movies that are similar to the movie that the user has watched recently and the user's rating of the movie (if the rating is low, similar movies are not recommended), and the weighted sum of these two factors is used to obtain the recommendation priority.
specific methods:
- First, get a set of candidate movie lists;
- Then do calculations for each candidate movie and calculate its recommendation priority;
E uq = ∑ r ∈ RK sim ( q , r ) × R rsimnum + lgmax ( incount , 1 ) − lgmax ( recount , 1 ) E_{uq} = \frac{\sum\limits_{r\in RK} sim(q,r)\times R_r}{sim _num} + lg max(incount, 1) - lg max(recount, 1)Euq=simna mr∈RK∑sim(q,r)×Rr+lgmax(incount,1)−lgmax(recount,1 ) Among them,- sim(q,r) is the candidate movie qqq and the movie r ∈ RK r\in RKthat the user has watched recentlyr∈The similarity between R K ;
- R r R_r RrIndicates that the user is interested in the movie rrr 's rating;
- ∑ r ∈ R K s i m ( q , r ) × R r s i m n u m \frac{\sum\limits_{r\in RK} sim(q,r)\times R_r}{sim_num} simna mr∈RK∑sim(q,r)×RrIndicates that the similarity and the score are weighted and summed, and then averaged;
- l g m a x ( i n c o u n t , 1 ) lg max(incount, 1) lgmax(incount,1 ) Indicates reward items,incount incountin co u n t indicates the number of high scores (custom) in the user's recent ratings;
- l g m a x ( r e c o u n t , 1 ) lg max(recount, 1) lgmax(recount,1 ) Indicates the penalty item,incount incountin co u n t indicates the number of low scores (custom) in the user's recent ratings.
6. Content-based recommendation module
- Similar Movies to Movie A - Movies with the same tag
- Item-CF: Extract the content features of movie A according to the label, and select movies with similar features to A
- Feature extraction based on UGC - TF-IDF
Reference:
[1] Shang Silicon Valley Machine Learning and Recommender System Project Practical Tutorial