Article directory
Preface
main content
- project framework
- Data source analysis
- Statistical recommendation module
- Offline recommendation module based on LFM
- Real-time recommendation module based on custom model
- Other forms of offline similar recommendation modules
- Content-based module recommendations
- Item-based collaborative filtering recommendation module
1. Project Framework
Big data processing process
- Data sources: structured data (relational data), semi-structured data (log data), unstructured data (pictures and videos)
- Data collection: ETL tools, Scribe, Flume, Kafka, Sqoop
- Data storage: Oracle, GreenPlum, Cassandra, Hbase, HDFS
- Data computing: Mahout, Storm, Flink, Spark, MapReduce
- Data applications: business applications, Tableau, BI analysis, visualization ECharts D3
Real-time processing flow
- User interface (business request)
- Backend server (front-end/back-end buried point)
- Log file (Flume)
- Log collection (kafka)
- Data bus (Kafka message queue)
- real time calculation
- data storage
- data visualization
Offline processing process
User interface -> Backend server -> Log file -> Log collection -> Log storage -> Log cleaning -> Data loading -> Data warehouse -> Data calculation -> Data storage -> Data visualization
2. Project system design
System module design
- Real-time recommendations
- Offline recommendation
- Popular recommendations
- Label
- similar recommendation
Project system architecture
Business system composition
-
User visualization: NGULARJS
-
Recommended results display
-
Product search
-
Product information details
-
Product tag
-
product rating
-
Comprehensive business services: Spring
-
Recommended result query
-
Product search
-
Product information details
-
Product tag
-
product rating
-
Business database: MongDB (popular, large amount of data, document database => Json string)
-
Offline statistics service: historical popular product statistics, recent popular product statistics, product average score statistics
-
Offline recommendation service:
- ALS - LFM – UserRecs – ProductRecs
- TF-IDF –
-
Cache database: Redis
Recommendation system composition
Offline recommendations (offline):
- Offline statistics service Scala Spark SQL
- Offline recommendation service Scala Spark MLlib
real-time recommendation (online): - Log collection service Flume-ng
- Message buffering service kafka
- Implement recommendation service Spark Streaming
Project data flow diagram
Data source analysis
- Product information: products.csv
- Product ID (productId)
- Product name (name)
- Categories
- Product image URL (imageUrl)
- Product tags
- User rating data:ratings.csv
- User ID (uid)
- Product ID (productid)
- Product rating (score)
- Rating time (timestamp)
Main data model
-
Product information sheet
-
User rating information table
-
user table
-
Historical popular product statistics table
-
Recent popular product statistics table
-
Product average rating statistics table
-
Offline (LFM-based) user recommendation list
-
Offline (based on LFM) product similarity table (prepared for subsequent real-time recommendations)
-
Offline (content-based) product similarity table
-
Offline (based on Item-CF) product similarity table
-
Real-time user recommendation list
Implement module
Statistical recommendation module
Historical popular product statistics
- Calculate the average score of each product in all historical data
select productId, count(productId) as count from rating group by productId order by count desc
=> RateMoreProducts- RateMoreProducts data structure: productId, count
Recent popular product statistics
- Count the number of product ratings per month, representing the recent popularity of the product
select productId, score, changeDate(timestamp) as yearmonth from ratings
=> ratingOfMonthselect productId, count(productId) as count, yearmonth from ratingOfMonth group by yearmonth, productId order by yearmonth desc, count desc
=> RateMoreRecentlyProducts- changeDate: UDF function, use SimpleDateFormat to convert the Date format into ''yyyyMM''
- RateMoreRecentlyProducts data structure: productId, count, yearmonth
Product average rating statistics
select productId, avg(sorce) as avg from ratings group by productId order by avg desc
=> AverageProducts- AverageProducts data structure: productId, avg
Offline recommendation module based on LFM
-
Training latent semantic model using ALS algorithm
val model = ALS.train(trainData, rank, iterations, lambda)
- Required data structure: RDD/DataFrame
- trainData: training data
- rank: number of latent features k
- iterations: number of iterations
- lambda: number of regularizations
- RMSE: root mean square error
- Parameter adjustment: adjust parameter values multiple times through the root mean square error, and select a set of parameter values with the smallest RMSE
-
Calculate user recommendation matrix
-
Calculate product similarity matrix
Model-based real-time recommendation module
- Fast calculation speed
- The results may not be particularly accurate
- There are pre-designed recommendation models
Recommendation priority calculation
- Basic principle: Users’ tastes in the recent period are similar
- Similarity - Rating score
Other forms of offline similar recommendations
Content-based recommendations
- Based on the user tag information of the product, the TF-IDF algorithm is used to extract the feature vector.
- Calculate the cosine similarity of the feature vector and obtain a similar list of products
- In practical applications, similar products are generally recommended on the product details page or product purchase page.
Item-based collaborative filtering recommendations
- Item-based collaborative filtering (Item-CF) only needs regular behavioral data of mobile phone users (such as clicks, collections, purchases) to obtain the similarity between items, and is widely cited in actual projects.
- "Co-occurrence similarity" - using behavioral data to calculate the similarity between different products
Mixed Recommendation - Partition Mixing
- Model-based recommendations
- Recommendations based on collaborative filtering
- Content-based recommendations
- Statistically based recommendations