NetEase Cloud Music is based on Flink real-time data warehouse practice

Background introduction

As of today, the NetEase Cloud Music real-time computing platform has more than 150 machines, 700 running tasks, and 4 million peak data QPS. Li Hanmiao revealed, "There are probably more than 180 developers using this real-time computing platform."

At the level of business coverage, what is related to real-time is basically full coverage, including real-time reports, real-time features, real-time indexes, and real-time services. The real-time computing platform was started in the first half of 2018, during which it experienced two iterations. Until the first half of 2020, the task has increased by nearly 200%.

NetEase Cloud is a real-time computing platform version-1 design based on version 1.7. As everyone knows, Flink has been updated to version 1.11, which integrates many excellent features of Blink. After being acquired by Ali, Flink started with version 1.9, and with each version iteration, there were many code changes.

Because the community version of Flink 1.7 does not support SQL DDL, for the convenience of users, we are based on Antlr's custom SQL grammar, including DDL and dimension table JOIN. In addition, the first version of the real-time platform does not have a data blood relationship tracking function, which makes it difficult to locate problems.

The first version of the real-time platform has no metadata control, and the dissociation of Jar tasks. Its task monitoring is not sound. When a problem occurs, the metrics we collect are not enough to locate the problem. Looking at it now, there are many problems with the first version of the real-time platform.

Real-time data warehouse construction

The real-time data warehouse version is based on the Flink 1.9 version. Its main features include: integration with the metadata center so that users do not need to define the data format too much; SQL and SDK are provided for users to use; end-to-end blood collection; Data source and task monitoring are perfect.

▲Architecture diagram of real-time data warehouse

Look at the architecture diagram of the real-time data warehouse from the very beginning, with SQL and SDK as input, and go directly to the Planner. Planner and SQL are connected, and the overall SQL statement can be parsed. Next, there will be an injection of Catalog, which is connected to MetaHub (cloud data center).

All metadata is controlled, and a system is formed, which is the metadata center. It can manage the storage of all existing metadata systems, and has functions such as independent module management of MQ metadata, plug-in metadata management, unified data types, and metadata retrieval.

Data warehouse construction is divided into three parts: unified table representation format (catalog.db.table), data warehouse hierarchy, and table authority management. When doing real-time data warehouse, we completely copy a real-time data warehouse table using the existing offline data warehouse model.

SDK provides encapsulated internal SQL execution logic, simple API, and data collection. The above picture is a DEMO made with SDK, which is a real business code. It used nearly 190 lines of code to implement it before, but after encapsulation, the total is less than a dozen lines of code, which is convenient for users to use.

From data, data source, and data writing, we provide fine-grained task-level indicator monitoring and MQ data volume monitoring. Li Hanmiao believes, "This kind of cluster-level monitoring is indispensable when the platform reaches a certain level. The perfect monitoring is a good supplement to the platform."

Real-time data warehouse practice

The first typical practice of real-time data warehouse is ABTest, which stores the original data in HIVE, uses Spark for cleaning and aggregation, and then outputs it to the upper table. It is worth noting that the layering of real-time data warehouse and offline data warehouse is actually the same.

The real-time data warehouse version of ABTest gets rid of the previous HIVE+Spark processing mode, and the application effect is very good.

The second typical practice of real-time data warehouse is real-time report, as shown in the figure above, the statistics table of NetEase cloud music live broadcast. Real-time report creation tasks are easier, and data problem positioning is clearer.

The third typical practice is real-time features, with feature reuse and feature blood relationship display. We have done a statistic. Many of the tasks performed by the algorithm teams and the output features are repetitive. This invisibly caused a waste of resources and increased the cost of team development.

The real-time data warehouse has hierarchical features, and all tables are isolated according to business and all unified. If the algorithm team wants to use some features, he can directly search for relevant features on the platform, and then perform further operations based on the information contained therein.

 

From October 22nd to 24th, 2020, the 12th China System Architect Conference (SACC2020) hosted by IT168's ITPUB enterprise community platform will be webcast on the cloud. Since 2009, the SACC Architects Conference has been successfully held for eleven sessions, gathering domestic CTOs, R&D directors, senior system architects, development engineers and IT managers and other technical groups, and the scale of the conference exceeded 1,000 people. The past three-day agenda, involving 20+ special sessions and nearly 120 topics, was completely migrated to the online live webcast to the conference. SACC2020 is ready to go and counterattack, looking forward to your registration and participation in the grand event!

Guess you like

Origin blog.csdn.net/Baron_ND/article/details/109645133