2020 Tencent Advertising Algorithm Competition Preliminary Summary

1. Problem description

The topic of this algorithm competition comes from an important and interesting question. As we all know, demographic characteristics such as user age and gender are important input characteristics of various recommendation systems, which naturally include advertising platforms. The assumption behind this is that users’ preferences for advertising will vary with their age and gender. Practitioners in many industries have repeatedly verified this hypothesis. However, most verification methods use demographic attributes as input to generate recommendation results, and then compare the recommendation performance with and without these inputs offline or online. The topic of this competition attempts to verify this hypothesis from another direction, that is, to predict the demographic attributes of users with the user's interactive behavior in the advertising system as input.

We believe that the "reverse thinking" of this contest question itself has its research value and interest, in addition to practical value and challenge. For example, for practitioners who lack user information, inferring user attributes based on their own system data can help them achieve intelligent targeting or audience protection among a wider group of people. At the same time, participants need to use various technologies in the field of machine learning to achieve more accurate estimates.

Specifically, during the competition, we will provide participants with a set of user ad click history records in a time window of 91 days (3 months) as a training data set. Each record contains the date (from 1 to 91), user information (age, gender), and information about the advertisement that was clicked (material id, advertisement id, product id, product category id, advertiser id, advertiser industry) id, etc.), and the number of times the user clicked on the ad that day. The test data set will be the ad click history of another group of users. The test data set provided to the contestants will not contain the age and gender information of these users. This question requires participants to predict the age and gender of users appearing in the test data set, and submit the prediction results in an agreed format.

2. Introduction to the data set

The training data contains a set of 91-day ad click logs of users, organized into three tables, and provided in the format of a CSV file with a header row (encoding using UTF-8 without BOM), namely: click_log.csv, user.csv, ad.csv. The test data contains 91-day ad click logs of another group of users, organized in the same way as the training data, but does not include user.csv. The detailed format of each table is described below.

click_log.csv:

  • time: time in days granularity, integer value, value range [1, 91].
  • user_id: a unique encrypted user id generated from 1 to N randomly numbered, where N is the total number of users (training set and
    test set).
  • creative_id: The id of the creative that the user clicked on, generated in a manner similar to user_id.
  • click_times: The number of times the user clicked on the creative that day.

user.csv:

  • user_id
  • age: The age of the user represented by segment, the value range is [1-10].
  • gender: user gender, value range [1,2].

ad.csv:

  • creative_id
  • ad_id: The id of the advertisement to which the material belongs, generated in a manner similar to user_id. Each advertisement may contain multiple
    displayable materials.
  • product_id: The id of the product advertised in the ad, which is generated in a manner similar to user_id.
  • product_category: The category id of the product promoted in the ad, which is generated in a manner similar to user_id. -Advertiser_id: The id of the advertiser, which is generated in a manner similar to user_id.
  • industry: The id of the advertiser's industry, which is generated in a manner similar to user_id.

3. Solution

Since the total creative_id is 444w, if you directly do onehot and then embedding, the parameter is too large and it is difficult to train. Therefore, we use the idea of ​​Word2vec to treat each user's historical click record as a sentence, for example, treat all the creative_ids visited as a "sentence", and each advertisement as a "word". For the user 1234: ['821396 ', '209778', '877468', '1683713', '122032', …]. In this way, we can form a total of 90w sequences of varying lengths, and then train them through the Word2vec model (self-supervised model), and each creative_id can form a K-dimensional vector. First consider id as a word, and then concatenate the id into a sentence, so the task becomes a multi-input text classification problem in a desensitization scenario.

Network structure design

Network structure of multi-class text input classification model based on LSTM:

The whole model consists of 4 parts: input layer, LSTM layer, pooling layer, and fully connected layer.

1. Input layer (word embedding layer):
I chose'creative_id','ad_id','advertiser_id','product_id' as input text. Use the pre-training model, that is, pre-train the word2vec model of each id, embed the pre-trained word2vec model in the Embedding layer, fix the parameters, and then do not adjust the parameters of this layer during training. Each id text input is a text sequence with a fixed length of 100 words, which is transformed into a sentence vector composed of 100 word vectors through the Embedding layer of the embedded pre-trained word2vec model.

2. LSTM layer: After
multiple id sentence vectors are spliced, they are input into the LSTM layer to use bidirectional LSTM to extract text features. From the input fixed-length text sequence, local word order information is used to extract primary features, and the primary features are combined into advanced features.

3. Pooling layer: The
average pooling is used, which reduces the parameters of the model and ensures that a fixed-length fully connected layer input is obtained on the output of the variable-length volume base layer.

4. Fully connected layer:
The function of the fully connected layer is the classifier. I used a fully connected network with a hidden layer for classification, and finally log_softmax outputs the classification results.
The network structure is shown in Figure 1:
Insert picture description here

Loss function design
Loss function uses CrossEntropyLoss: cross entropy loss function
optimizer uses Adam

Source link

4. Summary

The first time I participated in the data competition, many things were learned now. I wrote the code training by myself, so the speed of getting the score was relatively slow, but I also learned the knowledge of nlp text classification. In the end, the model was trained with a 30% discount and ranked 200th in the preliminary contest. Although he failed to enter the semi-finals, he still improved his abilities during the game. Follow-up preparations to learn the open source code of the front row bosses after the game, look for gaps, and improve self-ability.

Guess you like

Origin blog.csdn.net/qq_32505207/article/details/106923336