Deep dive into recommendation engine components (based on Apache Mahout and Elasticsearch)

Recommendation engines help users narrow down their choices based on their specific needs. In this post, let's explore how the parts of a recommendation engine work together. We will use collaborative filtering to recommend movies based on movie rating data. Its key parts are based on Apache Mahout's collaborative filtering algorithm to build and train machine learning models, and Elasticsearch-based search technology to simplify the development of recommender systems.

What is a referral?

Recommendation is a branch of machine learning that analyzes data to predict user preferences or rate items. Recommendation systems are widely used in the industry:

  • Books and other products (such as Amazon)
  • Music (like Pandora)
  • Movies (like Netflix)
  • Restaurants (like Yelp)
  • Occupation (eg LinkedIn)

Netflix's recommendation engine

Movie recommendations rely on the following perspectives:

  1. User behavior is the truest response to their demands.
  2. Co-occurrence is the basis on which Apache Mahout can calculate the saliency identification of recommended items.
  3. The weight assignment of the model output indicator scores is similar to the mathematical calculation behind the full-text search engine.
  4. This mathematical similarity enables the idea of ​​developing a Mahout recommender using text search, with the help of search engines such as Elasticsearch.

Recommendation engine architecture

Architecture of a Recommendation Engine

The architecture of the recommendation engine is as follows:

Architecture of a Recommendation Engine

  1. Movie info data is reformatted and then stored in Elasticsearch for searching.
  2. The item similarity algorithm from Apache Mahout creates identifiers for movie recommendations based on users' existing ratings of movies. These identifiers are added to the corresponding movie files stored in Elasticsearch.
  3. Searching for other movies by the identifier of the movie the user likes will return a new list of movies sorted by the relevance of the user's preference.

Mahout-based collaborative filtering

The Mahout-based collaborative filtering engine looks at the user's historical behavior and tries to guess what the user might like in a future scenario. This is done by analyzing products and content that users have interacted with in the past. Mahout is particularly concerned with how items co-occur in user history. Co-occurrence is the basis for Apache Mahout to calculate the salient identification of recommended items. Suppose Ted likes movies A, B, and C, and Carol likes movies A and B. When recommending a movie to Bob, we notice that Bob likes Movie B, and since Ted and Carol also like Movie B, Movie A is an alternate recommendation. Of course, this is a small example. In real life, we will mine information through massive data.

Recommended grid

To obtain useful identifiers for recommendations, Mahout's ItemSimilarity project builds three matrices based on user historical behavior:

1. History matrix : Contains the interaction information between users and items, and the two-dimensional matrix structure of user X items.

History Matrix

2.  Symbiosis matrix : Convert the history matrix into a matrix of the relationship between items and items, and record which items have appeared together in the user's history record.

Co-occurrence matrix

In this example, Movie A and Movie B co-occur once, and Movie A and Movie C co-occur twice. The co-occurrence matrix cannot be used directly as a recommended identifier, as extremely common items will always be accompanied by a large number of other items.

3.  Identifier matrix : The identifier matrix only records abnormal (interesting) co-occurrences that can serve as recommendation cues. Some items (in this case movies) are so popular that almost everyone likes them, which means they will accompany everything. They are not of concern (non-anomalous) for recommender systems. And too sparse co-occurrences are also unreliable and therefore not recorded in the identifier matrix. In this example, Movie A is one of Movie B's identifiers.

identifier matrix

Mahout runs multiple MapReduce jobs in parallel to compute co-occurrences of items (Mahout 1.0 runs on Apache Spark). Mahout's ItemSimilarity assignment uses a log-likelihood ratio test (LLR) to determine which co-occurrences are sufficiently anomalous to be recommended identifiers. The system outputs those items whose similarity is greater than the set threshold.

The output of the Mahout ItemSimilarity job shows which items are always co-occurring in pairs and which ones can be used as a basis for recommendation. For example, the row for Movie B has the column Movie A marked, which means that liking Movie A can be an identifier that you also like Movie B.

identifier matrix

Elasticsearch search engine

Elasticsearch search engine

Elasticsearch is an open source search engine built on top of the full-text search engine library Apache Lucene. Full-text search uses Precision and Recall to evaluate search results:

  • Accuracy = the ratio of the number of relevant documents retrieved to the total number of documents retrieved
  • Recall = the ratio of the number of relevant documents retrieved to the number of all relevant documents in the document library

Documents stored by Elasticsearch consist of several different fields. Each field has a corresponding name and content.

For our recommendation engine, we store movie metadata (such as id, title, genre, and movie recommendation identifier) ​​into a JSON document:

{
“id”: "65006",
"title": "Electric Horseman",
"year": "2008",
"genre": ["Mystery","Thriller"]
}

The data for the identifier matrix, information that identifies significance or co-occurrence interest, is stored in the Elasticsearch Movie File Identifier field. For example, since movie A is an identifier for movie B, movie A will be stored in the identifier field of the movie B file. This means that when we search for movies with movie A as the identifier, movie B will be recommended to us.

Recommendation matrix

Search engines are already optimized for fields related to search and query terms. We use the search engine to find movies based on the identifier field that best matches the query term.

To learn more about building a recommendation engine, we recommend you check out the following resources:

Original link : https://www.mapr.com/blog/inside-look-at-components-of-recommendation-engine (translation/zhyhooo editor/Zhou Jianding)

 

http://www.csdn.net/article/2015-05-14/2824676

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326992071&siteId=291194637