Xiaohongshu recommends the practice of big data on Alibaba Cloud

Introduction: The  content of this article is mainly divided into three parts. In the first part, I will talk about the use scenarios of real-time computing in the recommendation business. The second part talks about how Xiaohongshu uses some new functions of Flink. The third part mainly talks about some real-time analysis scenarios of OLAP and the cooperation with Alibaba Cloud MC-Hologres.

Author: little red book person in charge of the project recommended  Guo

Xiaohongshu recommended business architecture

1.png

First of all, some typical recommendation services are drawn on this picture. The main modules that use big data are the online recommendation engine on the far left. Generally, the recommendation engine will be divided into several steps such as recall, sorting, and back row. I won’t go into details here. Up. Mainly from the perspective of big data, the recommendation engine mainly uses predictive models to estimate how much users like each candidate note. Decide which notes to recommend to users according to certain strategies. The recommendation model needs to capture note features when using it, and these features will be returned to our training data to train new models. After the recommendation engine returns the notes, the user's consumption behavior of the notes, including the behaviors such as display, click, and liking, will form the user's behavior flow. These user behavior streams combine feature streams to generate model training data to iterate the model. After combining the information of users and notes, some analysis reports used in user and note portraits and recommended services will be generated.
After more than a year of transformation, in the recommended scenario of Xiaohongshu, in addition to the need for human participation in the iterative strategy from analyzing data to strategy, the other modules are basically updated in real-time or near real-time.

Real-time computing application for recommended business

2.png

Let's expand on the real-time calculation after the data of features and user behaviors are returned, and how we use the data they generate. When the recommendation engine generates a feature stream, the feature stream is extremely large, including all recommended returned notes, which are about a hundred, and all the features of these notes, so there are about hundreds of these features in total. Our current approach is to write the features into a self-developed efficient kv cache for several hours, and then the user behavior data is returned from the client, and then we start the data stream processing.
Our first step is to attribute and summarize the user behavior that the client manages. Here is what attribution and aggregation are. Because on the Xiaohongshu APP, the client's management is divided into pages. For example, the user reads and clicks on the note in the homepage recommendation. After clicking, the user will jump to the note page, and then the user browses on the note page This note and like it. At the same time, the user may click the author's avatar to enter the author's personal page, and follow the author on the personal page. Attribution means that this series of user behaviors are counted as behaviors generated by homepage recommendations, and will not be mixed with other businesses. Because the search user sees the same note in the search, it may return the same result. So we have to distinguish which business is responsible for the user's behavior. This is attribution.

Then summary refers to this series of behaviors of users. Regarding the same note, we will generate a summary record, which can facilitate subsequent analysis. Then after attribution, there will be a real-time single user behavior data stream. On the summary side, because there is a window period, the summary data is generally delayed by about 20 minutes. After we generate the attribution and summary data stream, we will add some data from the dimension table. We will find the features that we recommend at that time based on the user notes. At the same time, we will also combine some basic user information and notes. Basic information is added to the data stream. There are actually four more important user scenarios. The first scenario is to generate Breakdown information for each business. This is mainly to know that a certain user is in different note dimensions, his click rate and some other business indicators. At the same time, I can also know the click-through rate of a note for different users. This is a more important feature in our real-time recommendation. Another very important thing is a wide table that we analyze in real time. The wide table is the summary information that we turn user information, note information and user note interaction into a multi-dimensional table for real-time analysis. This will be Tell everyone in more detail. Then there are two more important ones. One is the real-time training information. The training information is that I expanded the information of the user and the note interaction. I picked up the feature when sorting, and this feature plus some tags we summarized. , Train the model to update the model. Then the other is that all my summary information will enter the offline data warehouse, and then some subsequent analysis and report processing will be carried out.

Stream computing optimization-Flink batch stream integration

3.png

Then I will talk about how we use some new features of Flink to optimize the process of stream computing. I mainly talk about two points in this, the first of which is the integration of batch and flow.
I just said that we analyze the behavior of a user based on the behavior of the note. The summary information here is actually a lot of information. Among the summary information, except for the simplest, for example, did the user like to bookmark this note? There are some more complex tags, such as how long the user stays on the note page, or whether the previous click of this note is a valid click, we need to know if the If the user stays for more than 5 seconds after clicking, then the click is valid. So, like this kind of complex logic, we hope to be implemented only once in our system, and then it can be used in real-time and batch calculations at the same time. So in the traditional sense this is difficult, because in most implementations, batch and stream are two versions, that is, we are on Flink. For example, we have implemented a version of the definition of effective clicks, and we will also need Realize the definition of an offline version of valid clicks, this may be a version written in SQL. Then Xiaohongshu uses a new function in FLIP-27. The log file is in the form of a batch, which can be converted into a stream. In this way, I can achieve a unified batch flow in the code sense.

Stream Computing Optimization—Multi-sink Optimization

4.png

Then there is another Flink function that is Multi-sink Optimization on Flink 1.11. It means that a copy of my data will be written to multiple data applications. For example, I will need to create a wide table of user behavior at the same time and also generate an offline data. So what Multi-sink Optimization does is that you only need to read from Kafka once. If it is the same key, he only needs to go to Lookup once to generate multiple copies of data and write to multiple sinks at the same time, which can greatly reduce Our pressure on Kafka and pressure on kv queries.

Typical scene of Xiaohongshu OLAP

5.png

Finally, let me talk about a collaboration between our OLAP scenario and Alibaba Cloud MaxCompute and Hologres. Xiaohongshu has many OLAP scenarios under the recommendation business. Here I will talk about 4 more common scenarios. The most common one is actually a real-time analysis based on the user's experimental group comparison. Because we need a lot of adjustment strategies or update models in the recommendation business, and then every time we adjust the strategy and update the model, we will open an experiment to put users in different ABtests to compare user behaviors. In fact, a user will be in multiple experiments at the same time during the recommendation. Each experiment belongs to an experimental group. The experimental analysis we do by experiment grouping is to take out an experiment, and then summarize the user’s behavior and data. , According to the experimental group in this experiment, conduct a dimensional analysis to see the difference in user indicators of different experimental groups. Then this scenario is a very common scenario, but it is also a very computationally intensive scenario, because it needs to be grouped according to the user's experimental tag.
Then another scenario is that the recommendation of our Xiaohongshu actually runs on multiple data centers. Different data centers often have some changes, such as changes in operation and maintenance. We need to start a new service, or we Maybe some new models need to go online in a certain computing center first, then we need an end-to-end solution to verify whether the data between different data centers is consistent, and whether the user experience in different data centers is the same. At this time, we need to compare different data centers and compare the behaviors of users in different data centers. Whether their final indicators are consistent or not, we also use our models and code releases. We will look at a model release or a code release of the old version and the new version, the comparison of the user behavior indicators generated by them, and see if they are consistent. Similarly, our OLAP is also used for real-time business indicator alerts. If the user's click-through rate and the number of users' likes suddenly drop sharply, it will also trigger our real-time alert.

The scale of Xiaohongshu OLAP data

6.png

At peak times, we have about 350,000 user behaviors recorded in our real-time calculations every second. Then our large wide table has about 300 fields, and then we hope to keep the data for more than two weeks or about 15 days, because when we are doing experimental analysis, we often need to look at the comparison of the data of this week and the previous week, and then we There are about a thousand queries every day.

Little Red Book + Hologres

7.png
We have a cooperation with MaxComputer and Hologres of Alibaba Cloud in July. Hologres is actually a new generation of intelligent data warehouse solutions, which can solve both real-time and offline calculations through a one-stop method. At the same time, its applications can be mainly used in real-time large screens, Tableau and data science. We have studied it and it is more suitable for our recommendation scenarios.

Little Red Book Hologres application scenarios

8.png
What Hologres does is mainly to query and accelerate offline data, and then do table-level interactive query response to offline data. He does not need to do the work of moving data from offline to real-time data warehouse because of it. It's all inside. The entire real-time data warehouse, through the establishment of a user insight system, real-time monitoring of user data on the platform, real-time diagnosis of users from different angles, which can help implement refined operations. In fact, this is also a very suitable scene for our users' large wide tables. Then its real-time and offline federated computing can be based on the interactive analysis of the real-time computing engine and the offline data warehouse MaxCompute, and real-time offline federated query to build a refined operation of the entire link.

Hologres VS  Clickhouse

9.png

Before cooperating with Alibaba Cloud MaxCompute, we built our own Clickhouse cluster. At that time, we were also a very large-scale cluster with 1,320 cores in common. Because Clickhouse was not a solution for separating computing and storage, we were saving Cost, only 7 days of data is stored, and because Clickhouse is actually not very good for the user experiment tag scenario, it is said that we query data for more than three days at that time will be particularly slow. Because it is an OLAP scenario, we hope that each user's query can produce results within two minutes, so we can only check the data of the past three days. At the same time, another problem is that Clickhouse has some problems with component support, so we did not configure components on the Clickhouse cluster. If the upstream data flow is jittery and the data causes some duplication, there will actually be some in the downstream Clickhouse Some duplicate data. At the same time, we also sent a dedicated person to operate and maintain Clickhouse, and then we found through research that if Clickhouse is to be made into a cluster version, its operation and maintenance costs are still very high. Therefore, in July, we cooperated with Alibaba Cloud to migrate the largest user wide table we recommended to MaxCompute and Hologres. Then we have a total of 1200 cores on Hologres, because it is a computing storage solution, so 1200 A core is enough for us to use. But we have greater needs in terms of storage. We have co-existed 15 days of data. Then, because Hologres has made some customized optimizations for the scenario of user grouping based on experiments, we can say that we can now easily Query 7 to 15 days of data. Under this scenario based on experimental groups, the query performance is greatly improved compared with Clickhouse. Hologres actually also supports Primary Key, so we also configure Primary Key. We use the insert or ignore method in this scenario, and because the Primary Key is configured, it naturally has the function of de-duplication. In this case, we upstream As long as at least once is guaranteed, there will be no duplication of downstream data. And because we are on Alibaba Cloud, we say that there is no operation and maintenance cost.

 

Original link
This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/109195724