Recommendation system evaluation method

What makes a good recommendation system? Take the book recommendation system as an example:

First of all, the recommendation system must meet the needs of users, cover various books as much as possible, collect high-quality user feedback, increase the interaction between users and book websites, and increase website revenue. To be able to accurately predict user behavior, but also to expand the user's horizons to help users discover those things they may be interested in but not so easy to find. This article mainly proposes different indicators from users, websites, and content providers.

Three Recommender System Experiment Methods

1. Offline experiment

Implementation steps:

(1) Generate the user behavior data collected from the log system into a standard data set;

(2) According to certain rules, the data set is divided into two parts: training set and test set;

(3) Train the user interest model on the training set and test it on the test set;

(4) Use the defined offline benchmark evaluation algorithm to predict the results on the test set.

Advantages: All experiments are completed on the data set (extracted from the system log), which is less dependent on the actual system and user participation, and is convenient and fast;

Disadvantages: weak ability to obtain indicators of commercial concern;

2. User study

User survey is an experimental method to understand the performance of the test system by analyzing the behavior of the surveyed users (real users) when they complete tasks and answer questions on the test recommender system. It is designed to provide preparatory work for live testing in order to guard against potential problems that reduce user satisfaction with live testing.

Advantages: It has superior performance in the acquisition of "user subjective experience-related indicators" that cannot be solved by offline testing; risks are easy to control.

Disadvantages: high experimental cost, difficult to organize large-scale tests; difficult to design double-blind experiments, which affects the evaluation results.

3. Online experiment

The online experimental method here refers to the AB test method.

Implementation steps:

(1) Randomly group users according to certain rules;

(2) Different algorithms are used for different groups of users;

(3) Count different evaluation indicators of different groups of users to compare different algorithms.

Advantages: Fair access to actual online performance metrics of different algorithms, including metrics of commercial interest.

Disadvantages: The test period is long; the amount of engineering to design the AB test system is large, and the flow segmentation design is generally essential.

Evaluation indicators:

1. User satisfaction:

Basically, it is not available in actual operation. First of all, this is a relatively subjective indicator. It is generally obtained through user research, and users basically do not know what they want. At most, they can talk about a superficial feeling, which is of little reference value. Secondly, it is often the PM of this project that is most concerned about the quality of the recommendation system. The concept of user satisfaction can easily be stolen and replaced with PM satisfaction or boss satisfaction. In the past, a friend of mine was in a game company. As a result, every demand was based on the boss's satisfaction, and it finally closed down. Therefore, in actual situations, objective indicators that measure the accuracy will be used for reference, such as the user's satisfaction with the launched content through the statistics of click-through rate.

2. Prediction accuracy:

In application, the recommendation system can be divided into TopN recommendation and rating recommendation according to the scene. Rating recommendation is generally calculated by RMSE (root mean square error) and MAE (absolute mean) error. Among them, RMSE increases the penalty for inaccurate predictions, and the evaluation is more stringent. For the prediction accuracy rate recommended by TopN, it is generally evaluated by Recall (recall rate) and Precision (accuracy rate). When necessary, multiple pairs of accuracy and recall rate can be calculated, and then the PR curve can be drawn for evaluation.

3. Coverage:

Coverage is used to describe the system's ability to discover long-tail items. Simply put, the more types of items that can be included in the items recommended by all users, the greater the coverage, which leads to the simplest definition of coverage: The proportion of items that the system can recommend to the total item collection. However, this calculation method does not consider the frequency of each item in the recommendation list. If not only the proportion of each item in the list is large, but also the frequency of each item is similar, then the ability to mine the long tail is better. There are two ways to define coverage by the number of times an item appears in the recommendation list, information entropy and Gini coefficient. The calculation of these two indicators will involve the calculation of popularity. The popularity of a product is how many users have user behaviors with it.
The long tail effect is as follows:
Recommendation system evaluation method
Long tail effect, English name Long Tail Effect. "Head" and "tail" are two statistical terms. The protruding part in the middle of the normal curve is called "head"; the relatively flat part on both sides is called "tail". From the perspective of people's needs, most of the needs will be concentrated in the head, and this part can be called popular, while the needs distributed in the tail are individualized and scattered in small quantities. And this part of the differentiated and small demand will form a long "tail" on the demand curve, and the so-called long tail effect lies in its quantity. The accumulation of all the non-popular markets will form a more popular market. The market is still big.
Simply put, the gap between the rich and the poor is too great between popular products and unpopular products.

4. Diversity:
The dissimilarity of items in the recommendation list can be defined by similarity. The higher the similarity, the lower the diversity.

5. Novelty:
Introduce content that users have not been exposed to before.

6. Surprise degree:
Recommend a content that has nothing to do with the user's interest but the user thinks it is still very good.

7. Trust:
Make users agree with the recommendation results and reasons of the recommendation system.

8. Real-time:
Can you update the recommendation list in real time after new user behaviors are generated? Can
new items be added to the list and recommended to users immediately (cold start of items)

9. Robustness:
Also known as robustness, the ability to resist cheating
can be evaluated by simulating attacks
. Methods to improve robustness:
1. Designing a recommendation system is to use costly user behavior
2. Perform attack detection before using data to train the model, Clean up the data

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324514146&siteId=291194637