Introduction to Recommendation System

Introduction

Migration of Notes in 21 Years, which mainly introduces the definition, development history and several basic recommendation algorithms of the recommendation system.

Recommender systems and search engines

What is the information overload problem?

Information overload refers to social information that exceeds the range that individuals or systems can accept, process, or effectively use. It is one of the negative effects of excessive information in the era of big data [6 ] . In the era of information overload, how can we quickly and accurately obtain the high-quality information we need?

To solve the problem of information overload (Information overload) caused by massive data , people have proposed two solutions: search engine and recommendation system . When people mention the recommendation system, the most commonly associated technology is the search engine, so in order to correspond to the search engine, people sometimes get used to calling the recommendation system a recommendation engine . Both of these are technologies proposed to solve the problem of information overload, one problem, the starting point of two solutions. Document 1 jokingly called them brothers, which is really an interesting image.

Search engines tend to favor people with a clear purpose . It converts people's demand for information into precise keywords, and sends them to search engines to return a series of information, and users can give feedback on these returned results. In this process, the user plays an active role . But it has an obvious problem: the Matthew effect , that is, popular things will become more popular as the search iterates, and unpopular things will become less popular.

The recommendation engine tends to be that people do not have a clear purpose, or their purpose is vague. Generally speaking, users themselves do not know what they want. The recommendation system collects the user's historical behavior, user's interest preferences, or user's demographic characteristics to send to the recommendation algorithm, and then the recommendation algorithm will generate a list of items that the user may be interested in. In this process, users are passive to the recommendation engine . The recommendation system does not have too obvious Matthew effect ( 实际上我觉得它也是有的), because it uses the long-tail theory to a certain extent , which is also the value of the recommendation system.

The so-called long tail is actually a colloquial expression of the power law and Pareto distribution characteristics in statistics.

insert image description here

Experiments have shown that the profits generated by projects with low exposure rates in the long tail position are not lower than those of projects with high exposure rates, and sometimes even greater. This theory was first put forward by Anderson, the editor-in-chief of "wired" magazine in 2004. Anderson also believed that the Internet era is an era of paying attention to the "long tail" and exerting the long tail effect. The recommendation system can just provide opportunities for exposure to all projects, so as to tap the potential profits of the long tail.

The importance of the recommendation system in the contemporary Internet economy: 35% of Amazon’s sales come from recommendations, Google News recommendations increase the click-through rate by 38%, and 2/3 of Netflix’s (DVD physical rental) movie rentals come from recommendation systems.

Development History

In the era of information overload, the main task of the recommendation system is to connect users and information . On the one hand, it helps users find information that is valuable to them.

At present, the earliest recognized recommendation system is Tapstry, a personalized email recommendation system first proposed in 1992, and some scholars believe that it is GroupLens, a news recommendation system based on collaborative filtering in 1994 (because they believe that the proposal of collaborative filtering algorithm really marks the formal formation of the recommendation system field).

insert image description here

In 1994, the earliest automated collaborative filtering system was proposed. The GroupLens research group of the Department of Computer Science, University of Minnesota Twin Cities designed a news recommendation system called GroupLens. This work not only proposed the idea of ​​collaborative filtering for the first time, but also established a formal model for the recommendation problem, which had a huge impact on the development of recommendation systems in the following decades ( 直到现在,推荐系统基本跟协同过滤画上了等号). The research group later created the MovieLens recommendation website, an academic research platform for recommendation engines, which contains a dataset that is by far the most cited in the recommendation field.

The origin of the recommendation system may be somewhat controversial, but if we want to say who pushed the research of the recommendation system to a climax in history, it must be Netflix's million-dollar competition. In 2006, Netflix announced a $1 million prize for the first entrant who could improve the accuracy of their company's existing recommendation algorithm (CineMatch) by more than 10%. ( 朋友们,06年的百万美金是什么概念呢?同期,平均汇率是1美元=7.97元人民币,而当时北京的商品房成交均价是6000~8000)

There must be brave men under heavy rewards, this has been the case since ancient times.

After Netflix advertised, it immediately attracted more than 40,000 teams from 186 countries to compete in the Central Plains. In just two weeks, Netflix had 169 submissions, and a month later it had more than a thousand.

A few months after the start of the competition, some contestants improved the original CineMatch algorithm by 5%. One year later, the best answer was very close to 9%. However, it took us two years to break through the last 1%. In the later stage, this competition has evolved into an academic research time, and some participants even published their own algorithms in full for peer reference.

It was not until June 26, 2009 that the team BellKor broke through the threshold of 10% for the first time, reaching 10.05%. It was this team that finally won the award in the later stage. Interestingly, the award for the BellKor team has not been all smooth sailing. According to the rules of the competition, if a team breaks through 10% and no one can submit a new algorithm higher than the team within 30 days, then the team will win the game.

On July 26, 2009, the last day of the competition, the BellKor team submitted their latest algorithm 10.06%, with an RMSE (root mean square error) of 0.856704, and 20 minutes later, The Ensemble team ( ) also submitted their latest algorithm, which also reached 10.06%, with a root mean square error of 0.856714. For the last four, the final grand prize was awarded to the 其中还有个中国人BellKor team . Time is money, brilliant note.

So, after three years of competition, BellKor, a seven-person team composed of engineers and statisticians, won the grand prize and got the million-dollar check, as shown in the picture:

insert image description here

Here is a screenshot of the final leaderboard:

insert image description here

insert image description here

There are two other interesting things about this game.

Two employees of AT&T Labs were among the winning teams. Because they used their working hours to participate in the million contest, the prize money they received ultimately belonged to AT&T. Finally, AT&T Labs donated the prize money to local educational charities and primary and secondary schools to encourage young people to engage in science, technology, engineering, and mathematics (STEM) learning and work.

After the first contest ended, NetFlix quickly proposed a second million-dollar contest due to the popularity: recommending movies for customers who don’t do movie ratings often or don’t do ratings at all. This requires the use of users' real geographic information and behavioral data. The new contest data set has 100 million records, including rating data, customer age, gender, ZIP code where they live, and previously watched movies. While all of the data is anonymized and there is no way to directly link it to any individual Netflix customer, making information such as a customer's age, gender, zip code of residence, etc. public makes many people uncomfortable. The U.S. Federal Trade Commission has become concerned about the contest’s damage to customer privacy, and a law firm has filed a lawsuit against Netflix on behalf of clients. Netflix announced in March 2010 that it was canceling its second Million Dollar Game to avoid legal issues, and it appears it hasn't done so since.

Field

Recommender System (referred to as RS) is an interdisciplinary research in many fields. Including but not limited to information retrieval (Information Retrieval, IR), data mining (Data Mining, DM), machine learning (Machine Learning, ML), computer vision (Computer Vision, CV), multimedia (MultiMedia, MM), database (Database, DB), oh of course, artificial intelligence (AI).

By the way, here is an excerpt of relevant conferences involved in various fields:

RS(Recommender System):RecSys

IR (Information Retrieval): SIGIR

DM(Data Mining): SIGKDD,ICDM, SDM

ML (Machine Learning): ICML, NIPS

CV (Computer Vision): ICCV, CVPR, ECCV

MM (MultiMedia): ACM MM

DB (Database): CIKM, WIDM

AI (Artificial Intelligence): IJCAI, AAAI

Recommendation system classification

overview

The classification structure here is expanded according to reference 1:

insert image description here

In order to prevent the loss of the picture, the text is recorded again

Recommendation system classification:

  • Content-Based Recommendations
  • Recommendation Based on Collaborative Filtering
    • Memory-Based Collaborative Filtering
      • User-Based Collaborative Filtering
      • Item-Based Collaborative Filtering
    • Model-Based Collaborative Filtering
      • Bayesian network model
      • latent semantic model
      • graph-based model
      • Matrix factorization
      • 。。。。。
  • mixed recommendation

First, clarify a few concepts. In the recommendation system, Item is generally used to refer to the recommended items, and User refers to the user. So sometimes you will see the concept of UI matrix in the recommendation system, which is actually the User-Item matrix, which is used to describe the interaction between users and items. ( 以购物为例,矩阵里每个元素的值实际上就是,该用户是否购买过该产品,或者说购买过几次该产品)

Content-Based Recommendations

Content-based recommendation is actually a recommendation based on the attributes and content of the Item itself. For example, the genre, type, beat, etc. of music, the style, category, etc. of movies. It does not need to build a UI matrix, because it makes recommendations based on the content of the item itself. Generally speaking, it does not need to base its evaluation on the item on the basis of the user's evaluation. Instead, it directly uses machine learning methods to dig out the user's interest based on the content? ? ?

In simple terms, the content-based recommendation process is as follows:

  1. Convert the content feature vector of each item and the user feature vector of each user;
  2. Based on the content feature vector of the item, calculate the similarity or some kind of score between each other, find the item list closest to the target item, and then recommend items that meet certain constraints (for example, the user has no behavior record and has a high score) to the user.

Of course, the actual process is slightly more complicated than this.

Reference 1 presents an example he learned while watching andrew NG's machine learning course:

First define a utility function to evaluate the rating of a specific user c for a specific item s:

insert image description here

So how to learn a user attribute of the same dimension according to the content attribute of the project?

This requires defining another objective function:

insert image description here

The above should be a mean squared error with an L2 penalized norm. I remember that the L2 norm is used to reduce the complexity of the model.

Then minimize this function by gradient descent or other means, where θ j {\theta_j}ijis the user dimension feature that needs to be learned, X i {X_i}Xiis the content dimension feature of the item. What we need to do is to continuously throw all the items that user j has visited into training to achieve the minimum error between observed data and predicted data.

The above formula first calculates the score of the j-th user on the i-th item each time, and y should be the actual evaluation score of the j-th user on the i-th product, commonly known as the label label, which should be understood as 01 here. In this way, the utility function u defined earlier is actually θ T x \theta^ TxiTx

Recommendation Based on Collaborative Filtering

collective intelligence.

Collaborative filtering can be regarded as the most classic and original recommendation algorithm. It has been enduring for a long time, and the academic circle is still introducing its variants.

The core idea can be described in one sentence: People who like product A will also like product B. Assuming that most people like product A and B at the same time, and you like A, then it is right to recommend B to you. (In collaborative filtering articles, you can often see such a sentence: Birds of a feather flock together, and people are divided into groups.)

A more professional description is: how to find the content he is interested in for a specific user? First of all, it is necessary to find other users who have similar interests to the user, and then recommend to him the content that these users are interested in and that the specific user has not seen.

So the crux of the problem is: how to calculate, or quantify, the similarity of interests between users?

The most commonly used method is the nearest neighbor technique. Based on the user's historical preferences and other information, the user's interest feature vector is converted, and then the distance is calculated from the interest feature vectors of other users to find one or more neighbor users closest to the user in space, and the evaluation (or weighted evaluation, etc.) of these neighbors on a certain item is used to predict the evaluation of the target user on this item. This process requires the construction of a User-Item matrix.

Recommendation algorithms based on collaborative filtering, according to different objects, are currently recognized as two categories:

  • Item-based CF: Collaborative filtering based on product dimensions,
  • User-based CF: Collaborative filtering based on user dimensions.

However, there were more and more collaborative filtering implementations, so in order to make a distinction, someone proposed to divide CF into two categories from the perspective of implementation technology:

  • Memory-Based CF: memory-based collaborative filtering;
  • Model-Based CF: Model-based collaborative filtering.

As for why it is so divided and where is the boundary between the two, there are many reasons given on the Internet, and there is no consensus. But the more commonly accepted reason is "whether to use machine learning algorithm ideas to achieve." After a brief look, I think this statement is not too strict. After all, CF itself is a kind of machine learning algorithm. I personally prefer to use memory-based collaborative filtering as the name implies. Memory-based collaborative filtering should mean that the score matrix needs to be written into the memory every time, and statistical calculations are performed directly.大概是这样,后续看完再来具体写吧这块

Memory-Based Collaborative Filtering

Memory-based collaborative filtering is mostly based on heuristic methods and relies on experience to make recommendations.

The most important step is the selection of the similarity function.

How to choose an appropriate similarity function to measure the similarity between two items or users is the key to the whole algorithm.

Another step is the recommendation strategy. The simplest recommendation strategy is to recommend items that most people (neighbors) have acted on but target users have not.

Item-Based CF:

  1. Build UI matrix;
  2. Calculate the similarity between columns (commodity dimensions) according to the UI matrix;
  3. Select K products closest to a specific product (the product or product s purchased by the user) to form a recommendation list;
  4. From the recommendation list, select a product that has not been purchased by a specific user and recommend it.

User-Based CF:

  1. Build UI matrix;
  2. Calculate the similarity between rows (user dimensions) according to the UI matrix;
  3. Select the K users most similar to a particular user;
  4. Recommended for a specific user, the specific user has not yet purchased, but similar users buy more products (high-frequency purchases).

UI matrix example:

knife gun First horse shield
Liu Xuande 1 1 1 1 1
Kuan 1 1 1 0 1
Zhang Yide 1 1 1 0 0

User_based CF, Guan Yu and Liu Bei have a high similarity and belong to the same category of people, so I recommend horses to Guan Yu;

Item-based CF, sword and gun armor belong to one category. At this time, a user named Huang Zhong came up. He bought a knife and gun, so we recommend armor to him.

For products, the users who have purchased are the characteristics of the product;

For users, the purchased products are the characteristics of users.

Think of the story of beer and diapers, in a sense, it is a bit like Item-Based CF.

The story of beer and diapers can be regarded as a classic case in the field of data analysis. It is said that it happened in an American supermarket in the 1990s. When analyzing the sales data, supermarket managers found an incomprehensible phenomenon: under certain circumstances, two seemingly unrelated items, "beer" and "diapers", would often appear in the same shopping basket. When the father buys diapers, he often buys beer for himself. In this way, two seemingly unrelated items, beer and diapers, often appear in the same shopping basket.

The supermarket discovered this unique phenomenon and began to try to place beer and diapers in the same area in the store, so that young fathers can find these two items at the same time and complete shopping quickly.

Model-Based Collaborative Filtering

The so-called use of machine learning ideas to recommend.

This part is relatively broad,

The first is the qualitative nature of the problem. For example, "recommendation" can be regarded as a classification problem, or a clustering problem.

After the nature is determined, different techniques can be used to solve the problem, such as regression, matrix factorization algorithm, neural network, and graphical model.

  • Loss function + regularization term (this is the case for the top example)
  • neural network + layers

Recommendation based on matrix factorization

Based on the recommendation of Matrix Factorization, "recommendation" is regarded as a **matrix completion (filling)** task.

Assuming we have M products and N users, then our UI matrix is ​​M*N in size. Of course, this is a sparse matrix, because users' ratings for products are not sufficient, and it is basically impossible for anyone to rate all products. Our task is to predict unknown data by analyzing existing data (observation data), which is a matrix completion task.

The task of matrix completion can be accomplished by matrix factorization techniques.

Among matrix decomposition techniques, the most commonly used method is Singular Value Decomposition (SVD). From this, many matrix factorization techniques based on SVD are derived.

like:

  • NMF in 1999: Enriched the theoretical basis for the application of matrix factorization in recommendation systems.

  • FunkSVD in 2006,

  • PMF in 2008 (FunkSVD variant),

  • BiasSVD in 2009: Some users will have some characteristics, such as being willing to give praise to others, being soft-hearted, and easy to talk to.

  • 10-year SVD++ ( 公认的精品, improved BiasSVD): The user's historical rating records or browsing records for the project can reflect the user's preferences from the side, adding implicit feedback;

  • 10 years of timeSVD( 用户的兴趣或者偏好不是一成不变的,而是随着时间而动态演化)

  • 14 years of NCRPD-MF: matrix decomposition + text comment information + geographic neighbor information + item category information + popularity and other information;

  • 16 years of ConvMF: matrix decomposition + CNN to extract document information;

Since they are all based on matrix decomposition, the core routines are similar.

Taking FunkSVD as an example, the core is to convert the UI matrix into two low-rank user and commodity matrices while reducing the computational complexity:

UI m ∗ n = U m ∗ k I k ∗ n UI_{m*n} = U_{m*k}I_{k*n}UImn=UmkIkn

Then the predicted score of the u-th user on the i-th product is:

pui = U u TI and p_{ui} = U_u^TI_ipui=UuTIi

i.e. row u multiplied by column i

Its core goal is to minimize the difference between the predicted score and the actual score, such as:

m i n p ∑ ( r u i − p u i ) 2 min_p \sum (r_{ui}-p_{ui})^2 minp(ruipui)2

As for the optimization method, gradient descent and the like are acceptable.

Evaluation Metrics for Recommender Systems

The evaluation index is an intuitive measure of the quality of the recommendation system.

Generally speaking, according to different recommendation tasks, the most commonly used recommendation quality measurement methods can be divided into three categories:

  1. Evaluate the predicted ratings, suitable for rating prediction tasks;
  2. Evaluate the predicted item set, which is suitable for Top-N recommendation tasks;
  3. The weighted recommendation effect is evaluated by ranking list, which can be applied to both the score prediction task and the Top-N recommendation task.

The specific evaluation indicators corresponding to these three types of measurement methods are:

  1. Score prediction indicators: such as accuracy indicators: mean absolute error (MAE), root mean square error (RMSE), normalized mean error (NMAE); and coverage (Coverage, ) 可以简单理解成多样性,促进长尾效应( 按照正常回归预测来走就行)

  2. Collection recommendation indicators: such as precision (Precision), recall (Recall), ROC and AUC;

  3. Ranking recommendation indicators: Half-Life utility (half-life utility indicator), discounted cumulative gain (DCG)

Symbol definition:

  • U represents the user set in the test set;

  • I represents the item collection in the test set;

  • rui r_{ui}ruiIndicates the real rating of user u on item i, and NULL indicates the vacant rating ( rui r_{ui}rui=NULL means that user u has not rated i);

  • pui p_{ui}puiIndicates the score predicted by the algorithm for user u on item i.

Then Root Mean Squared Error ( Root Mean Squared Error, RMSE ):

insert image description here

有兴趣的话可以直接去看参考文献10,讲的很全面清晰,这里就不展开讲了,主要是因为我也没看完

Problems with recommender systems

Although the recommendation system has been developed for a long time, especially recently with the help of deep learning technology, it has made a great leap forward, but there are still many problems today:

  1. Data Sparsity
  2. Cold start problem (Cold start)
  3. Synonym problem (Synonymy)
  4. The Lonely User Problem (Gray Sheep)
  5. Shilling Attack
  6. Other issues like privacy, interpretability, novelty, etc.

data sparsity

Taking Taobao as an example, there are many users and many products, and for a user, the products he can contact and learn about may only account for 1% of the total. The Netflix Database contains 48,000 users, 17,770 movies, and hundreds of millions of rating records. Such a UI matrix is ​​extremely sparse. Such a sparse matrix, on the one hand, the information is relatively scattered, on the other hand, there are too many parameters and a large amount of calculation.

Solution: Dimensionality Reduction. Singular Value Decomposition (SVD, Singular Value Decomposition) is used to reduce the dimension of the sparse matrix and obtain the best low-dimensional approximation for the original matrix. However, there are problems such as large data volume calculation costs and the impact on the effect (after the SVD change, some inactive users or items are thrown away, and the recommendation effect for such users or items will be discounted, and the existence of niche groups cannot be reflected), which is not in line with the original intention of mining the long-tail effect.

cold start problem

There is no historical behavior information for new users and new products.

One solution is hybrid recommendation, that is, a mixture of multiple recommendation methods;

Another method is to introduce users' personal data, such as demographic characteristics: age, gender, place of residence, etc., to calculate user similarity. Although this approach will reduce the recommendation accuracy to a certain extent, it can also be considered when the data is very sparse.

Take Weibo as an example. When you log in after registering on Weibo, or log in again after losing for a period of time, you will be asked to select several areas of interest and recommend several corresponding accounts to you. This is also a way to deal with the cold start problem.

synonym problem

This problem manifests itself in the fact that the same category of items in the recommendation system is sometimes classified under different names, which further aggravates the data sparsity. Solutions include synonym mining, various semantic analysis, etc.

The lonely user problem

There are always some users whose preferences are different from anyone else's, and there are even some users whose preferences are completely opposite to normal people. This reality is difficult to solve.

Troubleshooting

In fact, it is a problem of garbage filtering, that is, some people give high marks to their own things or things that are beneficial to themselves, while they give low marks to competitors' things, which will affect the normal work of the collaborative filtering algorithm. The active solution is to do cleaning and filtering, and the passive solution is to use Item-Based CF( 作弊者总是较少数,在计算Item相似度的时候影响较小) or Hybrid hybrid algorithm.

references

  1. The recommendation system is very well written from getting started to getting started , especially the introduction to the background of recommender systems and search engines in the introduction
  2. Summary of recommended system dry goods is the same as the author of document 1
  3. Baidu Encyclopedia - Long Tail Theory
  4. Chapter 1 Collaborative filtering recommendation algorithm: memory-based collaborative filtering
  5. Recommendation System (Recommendation System) and its matrix decomposition technology Good article, but to chew
  6. Baidu Encyclopedia - Information Overload
  7. Overview of Recommendation System Development
  8. Reminiscing about the intense process of the Netflix Million Dollar Prize
  9. The Story of the NetFlix Million Dollar Data Modeling Award
  10. Evaluation indicators commonly used in recommender system research Detailed and clear
  11. Summary of the twelve major evaluation indicators of the recommendation system
  12. Summary of Matrix Decomposition in Recommender System This summary is not bad, and the source papers of each algorithm are attached, which can be linked with reference 5

Guess you like

Origin blog.csdn.net/wlh2220133699/article/details/131255551