Flow batch integrated production application! Bigo real-time computing platform construction practice

Introduction:  This article is shared by Xu Shuai, the person in charge of Bigo computing platform. It mainly introduces the construction practice of Bigo real-time computing platform.

This article is shared by Xu Shuai, the person in charge of Bigo computing platform, and mainly introduces the construction practice of Bigo real-time computing platform. content include:

  1. The development history of Bigo real-time computing platform
  2. Features and improvements
  3. Business scene
  4. Efficiency improvement
  5. Summary outlook

1. The development history of Bigo real-time computing platform

Today, I will mainly share with you the construction process of the Bigo real-time computing platform, some of the problems we have solved during the construction process, and some optimizations and improvements we have made. First enter the first part, the development history of Bigo real-time computing platform.

Let me briefly introduce Bigo's business. It mainly has three major apps, namely Live, Likee and Imo. Among them, Live provides live broadcast services for global users. Likee is an app for creating and sharing short videos, very similar to Kuaishou and Douyin. Imo is a global free communication tool. These major products are all related to users, so our business should focus on how to improve the conversion rate and retention rate of users. The real-time computing platform, as the basic platform, mainly serves the above business. The construction of the Bigo platform also needs to do some end-to-end solutions around the above business scenarios.

The development process of Bigo real-time computing is roughly divided into three stages.

  • Before 2018, there were very few real-time jobs. We used Spark Streaming to do some real-time business scenarios.
  • From 18 to 19, with the rise of Flink, everyone generally believed that Flink was the best real-time computing engine, and we began to use Flink for discrete development. Each business line builds a Flink for simple use.
  • Starting in 2019, we have unified all businesses that use Flink onto the Bigo real-time computing platform. After two years of construction, all real-time computing scenarios are currently running on the Bigo platform.

As shown in the figure below, this is the status quo of Bigo's real-time computing platform. On the Data Source side, our data are all user behavior logs, mainly from APPs and clients. There are also some user information stored in MySQL.

This information will go through the message queue and finally collected on our platform. The message queue mainly uses Kafka, and now Pulsar is gradually adopted. The MySQL log mainly enters the real-time computing platform through BDP. In the real-time computing platform, the bottom layer is also based on the more commonly used Hadoop ecosystem to manage dynamic resources. The engine layer above has been unified to Flink, and we do some development and optimization of our own on it. On this one-stop platform of development, operation, maintenance and monitoring, we built a BigoFlow management platform internally. Users can develop, debug and monitor on BigoFlow. Finally, in terms of data storage, we also docked with Hive, ClickHouse, HBase, and so on.

2. Features and improvements of Bigo real-time computing platform

Next, let's take a look at the characteristics of the Bigo computing platform and the improvements we have made. As a developing company, the focus of our platform construction is to make it as easy as possible for business people to use. So as to promote the development of the business and expand the scale. We hope to build a one-stop platform for development, operation and maintenance, and monitoring.

First of all, users can develop very conveniently on BigoFlow. The features and improvements we are developing in this area include:

  1. Powerful SQL editor.
  2. Graphical topology adjustment and configuration.
  3. One-click multi-cluster deployment.
  4. Unified management of versions and convergence as much as possible.

In addition, in the operation and maintenance area, we have also made many improvements:

  1. Perfect savepoint management mechanism.
  2. Logs are automatically collected to ES, with built-in troubleshooting rules for common errors.
  3. The task history is saved for easy comparison and problem tracking.

The last one is monitoring. Our features are:

  1. Monitoring is automatically added, and users basically do not need to manually configure.
  2. Automatically analyze resource usage and recommend reasonable resource allocation for users.

There are three main places for our metadata storage. They are Kafka, Hive and ClickHouse. At present, we can fully open up all the metadata of the storage system. This will greatly facilitate the user, while reducing the cost of use.

  • After Kafka's metadata is opened, it can be imported once and used indefinitely without DDL.
  • Flink and Hive are also fully connected. When users use Hive tables, they do not need DDL and can use them directly.
  • ClickHouse is similar, it can automatically track Kafka topics.

In fact, what we provide today is not only a platform, but also provides end-to-end solutions in common scenarios. In the ETL scenario, our solutions include:

  1. The universal management is fully automated to access.
  2. The user does not need to develop any code.
  3. The data enters hive.
  4. Update meta automatically.

In the monitoring area, our features are:

  1. The data source is automatically switched.
  2. The monitoring rules remain unchanged.
  3. The results are automatically stored in prometheus.

The third scenario is the ABTest scenario. The traditional ABTest is done offline, and the results can only be produced after another day. So today we will convert ABTest to real-time output, and greatly improve the efficiency of ABTest through the integrated flow and batch method.

The improvements to Flink are mainly reflected in these aspects:

  • First, at the connector level, we have customized a lot of connectors and docked all the systems used by the company.
  • Second, at the level of data formatting, we have made a very complete support for the three formats of Json, Protobuf, and Baina. Users do not need to do the analysis by themselves, just use it directly.
  • Third, all the company's data falls directly into Hive, which is ahead of the community in the use of Hive. Including streaming reading, EventTime support, dimension table partition filtering, Parquet complex type support, etc.
  • Fourth, we have also made some optimizations at the State level. Including SSD support, and RocksDB optimization.

Three, Bigo typical business scenarios

The traditional warehouse storage is through Kafka to Flume, then to Hive, and finally to ClickHouse. Of course, most of ClickHouse is imported from Hive, and some are written directly through Kafka.

This link is a very old link, and it has the following problems:

  • First, it is unstable. Once flume is abnormal, data loss and duplication often occur.
  • Second, poor scalability. In the face of sudden traffic peaks, it is difficult to expand.
  • Third, business logic is not easy to adjust.

So after building Flink, we have done a lot of work. Replace the original Flume to Hive process. Today, all ETL passes through Kafka, and then through Flink. All RBI will enter the Hive offline data warehouse as a historical preservation, so that data will not be lost. At the same time, because many jobs require real-time analysis, we are on another link and directly enter the ClickHouse real-time data warehouse from Flink for analysis.

In this process, we made some core transformations, divided into three major parts. First of all, in the area of ​​user access, our transformation includes:

  1. Keep it as simple as possible.
  2. General management is fully automatic.
  3. Meta-information gets through without DDL.

In addition, in Flink itself, our transformations include:

  1. Parquet write optimization.
  2. Concurrency adjustment.
  3. Supports large-scale operations through SSD disks.
  4. RocksDB is optimized to better control memory.

Finally, in the data sink, we have done a lot of customized development, not only supporting Hive, but also docking ClickHouse.

Fourth, the efficiency improvement Flink brings to the business

The following mainly introduces some transformations we have done in the ABTest scenario. For example, after all the data falls into Hive, offline calculations are started. After countless workflows, a large wide table is finally produced. There may be many dimensions on the table, recording the results of the grouping experiment. After the data analyst gets the results, he analyzes which experiments are better.

Although this structure is simple, the process is too long, the results are late, and it is not easy to increase the dimension. The main problem is actually in Spark. This job has countless workflows to execute, and one workflow cannot be scheduled until the other is executed. And there is no very good guarantee for offline resources. Our biggest problem before is that the results of the previous day of ABTest cannot be output until the afternoon of the next day. Data analysts often report that they cannot work in the morning and can only start analysis when they are almost off work in the afternoon.

So we started to use Flink's real-time computing power to solve the problem of timeliness. Unlike Spark tasks, which have to wait for the last result to be output, Flink consumes directly from Kafka. Basically, the results can be issued in the morning. But at that time, because the results of its final output have many dimensions, there may be hundreds of dimensions. At this time, the State is very large, and OOM is often encountered.

Therefore, we took a compromise in the first step of the transformation process. Instead of directly using Flink to join all the dimensions in one job, we split it into several jobs. Each job calculates a part of the dimensions, and then uses HBase to make a join of these results, and then imports the results of the join into ClickHouse.

In the process of transformation, we found a problem. Maybe the homework needs to adjust the logic frequently. After the adjustment, you have to see if the result is correct. Then this requires a time window of 1 day. If you read historical data directly, Kafka has to store data for a long time. When reading historical data, you have to read it on the disk, which puts a lot of pressure on Kafka. If you don't read historical data, because only the zero point can be triggered, then the logic is changed today, and you will have to wait a day before you can see the results, which will cause the debugging iteration to be very slow.

As mentioned earlier, all our data is in Hive. At that time, it was still version 1.9. We supported streaming data from Hive. Because these data are all triggered by EventTime, we support EventTime to trigger on Hive. In order to unify the batch flow, Spark is not used here, because if Spark is used for job verification, two sets of logic need to be maintained.

We use a stream-batch integrated approach on Flink to do offline data supplementation or offline job verification. The real-time one is used for the production of daily operations.

As I said just now, this is actually a compromise solution because it relies on HBase and does not give full play to the capabilities of Flink. So we carried out the second round of transformation to completely remove the dependence on HBase.

After the second round of iteration, we have been able to handle the day-level window trading of large watches on Flink today. This unified solution for streaming batches has been launched. We directly use Flink to calculate a complete large-width table. After the daily window is triggered, the results are directly written into ClickHouse, and the results can be produced basically in the early morning.

During the entire process, our optimizations to Flink include:

  1. State supports SSD disks.
  2. Streaming reading Hive, supports EventTime.
  3. Hive dimension table join, support partition partition load.
  4. Complete ClickHouse Sinker.

After optimization, our hourly tasks will no longer be delayed, and the day-level completion time will advance from the afternoon to before work, which greatly accelerates the iteration efficiency.

V. Summary and Prospects

Summarize the current status of real-time computing in Bigo. First of all, it is very close to the business. Secondly, it seamlessly connects with all the ecosystems used in the company, basically eliminating the need for users to do any development. In addition, the real-time data warehouse has taken shape. Finally, our scene is not rich enough compared with the big factory. For some typical real-time scenarios, because business requirements are not so high, many businesses have not really switched to real-time scenarios.

There are two major areas in our development plan.

  • The first piece is to expand more business scenarios. Including real-time machine learning, advertising, risk control and real-time reporting. In these fields, it is necessary to promote the concept of real-time computing more and to better connect with the business.
  • The other piece is on Flink itself, we have many scenarios to do internally. For example, support for large Hive dimension table join, automatic resource configuration, CGroup isolation, etc. The above is some of the work we will do in the future.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/xxscj/article/details/114078551