2021 MathorCup College Mathematical Modeling Big Data Competition Problem Solving Ideas

Track B: Sequence Evaluation Problems in Information Flow Intelligent Recommendation Algorithms

With the vigorous development of Internet information, users face the problem of information overload when using Internet applications. The emergence of recommendation algorithms satisfies users' personalized content consumption needs and improves the efficiency of users' acquisition of useful information, and has been widely used in Internet APPs. As the main application scenario of recommendation algorithms, information flow is the main entrance for users to access Internet information. It has been fully integrated into people's daily life and has become the main way for people to understand the world.
Figure 1 shows an example of an information flow product. In this example, after the user performs the refresh operation, the recommendation system returns K recommendation results, which constitute a recommendation sequence. Among them, the first 4 recommended contents occupy a mobile phone screen, and users can continue to slide down to browse the remaining contents. A recommendation sequence consists of multiple content types, for example, content 1 is graphic content, and content 4 is video content. It should be noted that the number of pieces of content K returned by the recommendation system each time may not be fixed, and the system can dynamically adjust it according to the user's specific request environment to obtain the best user browsing experience. How to determine the size of K is in this There is no discussion in the topic.
insert image description here

Figure 1 Example of information flow products

The core idea of ​​traditional recommendation algorithms is to mine the matching relationship between the recommended content and the user's interests, as well as the quality of the content itself, and select the most relevant or high-quality content to recommend to the user. As shown in (a) in Figure 2, the recommender system will score and evaluate a single candidate content, and estimate the comprehensive income brought by recommending this content to users according to whether the content matches the user's interests and the quality of the content ( Comprehensive income usually includes whether users will click on this content, and how long users spend on this content). The score given by the system is a description of the comprehensive benefit brought by each piece of content. Then, the system selects the K pieces of content with the largest estimated scores, and recommends them to users in descending order of scores. We call this recommendation method point wise.
However, the study found that, in addition to the content itself, the arrangement and combination of the content will also affect the user's browsing experience, which in turn affects the size of the recommendation revenue. For example, a high concentration of similar content tends to result in poor feedback, even if they all closely match user interests or have high content quality. Therefore, more and more researches focus on how to choose the optimal content arrangement, not just the optimal content. As shown in (b) in Figure
2, the same three contents of ABC, in different order (A→B→C,

A→C→B,...) recommending to users will bring different benefits. The recommendation system needs to first generate candidate recommendation sequences according to the candidate content, and then score and evaluate each candidate sequence. Finally, the system selects a sequence with the highest estimated score, and recommends the content to the user according to the sequence of the sequence. We call this recommendation method list wise. This question requires contestants to design a mathematical model to evaluate the overall benefit of a given candidate sequence.

insert image description here

Figure 2 Single-content evaluation and sequential evaluation
The topic provides the user's exposure history on information flow products in the past week

(train_data.txt) as a training set for participants to conduct modeling analysis. The train_data.sample.txt in the attachment gives an example of the data format for the convenience of participants. The fields involved include:

  1. User ID: unique user ID, such as 1000024368;

  2. Request ID: The unique identifier of the user's single request for the recommendation service, for example

500012184_1635188998881_5305;

  1. Date: The date when the user requested the recommendation service once, such as 20211026;

  2. Time: The time when the user requests the recommendation service once, such as 22 (representing 22 at night)

point);

  1. Recommendation sequence: The user requests the recommendation service once, and the content list returned by the recommendation service. The order of the content is the actual recommended order of the content. Multiple content is
    separated by ";". A single content includes three fields: content ID, whether the user clicks or not.

(0 means no click, 1 means click), user browsing time (in seconds),

Multiple fields are separated by " : ". For example 133672454001:0:0;508896132:1:111;508969800:0:0;50887
0333:1:10;

At the same time, the title provides the basic properties of the content (doc_info.txt). attached

doc_info.sample.txt gives an example of the data format. The fields involved include:

  1. Content ID: the unique identifier of the content, such as 133342615958;

  2. Content type: Recommended content is divided into two types: video (video) and graphic (news);

  3. Content Category: The first and second category of the content, such as variety show/mainland variety show;

Finally, the title provides a part of the user recommendation sequence after the training set time as the test set (test_data.txt). The test_data.sample.txt in the attachment gives an example of the data format. Contestants need to predict the benefit size of the test set sequence based on the training set data. The fields involved include:

  1. Request ID: The unique identifier of the user's single request for the recommendation service, for example

500012184_1635188998881_5305;

  1. User ID: unique user ID, such as 1000024368;

  2. Date: The date when the user requested the recommendation service once, such as 20211026;

  3. Time: The time when the user requests the recommendation service once, such as 22 (representing 22 at night)

point);

  1. Recommendation sequence: The user requests the recommendation service once, and the content list returned by the recommendation service. The arrangement order of the content is the actual recommended order of the content. Multiple content is separated by ";". Only the content ID is provided for a single content, for example

508681374;133681260394;508767175;508767175;

The full dataset above is available via https://pan.yidian-

inc.com/index.php/s/QB7lhh7YPKLJWfL to download and obtain. Contestants please

The above data is analyzed and a model is established to solve the following problems. Contestants need to elaborate the final solution in the form of a paper, including the main models, algorithms and calculation results, and submit the prediction results of Question 2 to the competition system as a separate file without changing the file format.
Question 1: Establish a mathematical model for evaluating the total click revenue (the sum of the clicks of the single content in the sequence) and the total duration revenue (the sum of the browsing duration of the single content in the sequence) of the recommended sequence, and how to compare the comprehensive value according to the click revenue and the duration revenue. Earnings are quantified. Different from the mathematical model that evaluates the benefits of a single recommended content, in the design of the sequence evaluation model, it is necessary to elaborate how to consider the impact of different permutations and combinations on the benefits.
Question 2: Based on the mathematical model designed in Question 1, predict the total number of clicks and the total duration (in seconds) of the recommended sequences in the test set (test_data.txt), write the predicted results into
result.csv and submit. The file contains three columns: request ID, total hits, and total browsing time. The request ID corresponds to the request ID in the test set, and the total number of clicks and total browsing duration is the estimated sum of the number of clicks and duration of the recommended sequence corresponding to each request ID in the test set.
The total number of clicks and total duration given in the attached result.csv are randomly generated sample data,

Contestants need to replace it with their predicted total clicks and total duration before submitting.

Question 3: Suppose there are N pieces of candidate content, select the optimal recommendation sequence with length K (N≥K), and the number of sequences that need to participate in the income evaluation is A". In the real recommendation scenario, due to the consideration of computing performance, the system Returns cannot be evaluated for all possible sequences, often

It is necessary to first prune the sequence set in a way with lower computational complexity, and delineate a small number of candidate sequences for accurate income evaluation. The goal of the pruning strategy is to ensure that the candidate sequence set is more likely to contain the optimal sequence. Please elaborate on your modeling ideas, as well as the accuracy and time complexity of the pruning strategy.

Welcome to the group to discuss and ask questions B question A question
insert image description here

Guess you like

Origin blog.csdn.net/weixin_43292788/article/details/122363424