NetEase Cloud Music Real-time Data Warehouse Governance Optimization Practice

The topic  shared today is the optimization practice of real-time data warehouse management.

Full-text catalog:

  1. Current Situation, Problem

  2. governance practice

  3. Technology optimization

  4. future plan

  5. Q&A

Sharing guest|Wang Lei Netease cloud music data development expert

01

Current status and problems

1. Current situation and problems

e03b6a31a32407cfaf5d81113dec6e14.png

The cloud music data warehouse platform has been online for more than 6 years. At present, the cumulative number of users (including resigned personnel) exceeds 700, and the daily UV exceeds 200. It involves data warehouse development, data products, analysts, algorithms, business development, QA, etc. Developers in all roles. Covering all business lines of music, some typical business types include index construction, feature development, content monitoring, and reports, online statistics, etc. The cloud music business has developed to this day, and the business of all departments is inseparable from big data processing. All development will be more or less exposed to big data processing. Currently, there are 1600+ real-time tasks on the platform, 7000 to 8000 offline tasks, and more than 80% of the tasks are SQL tasks. At present, the cluster size of the entire Cloud Music has about 2,000+ pure computing nodes, and the daily raw log volume exceeds 100 billion.

2. Platform ideas

406680cd5864f439862bd804c2d3277d.png

The idea of ​​building the platform is to become a bridge connecting technology and business, integrate technology and business, and make data be used more efficiently through the platform. We are positioned as a data platform team for the vertical business of cloud music. The demand side is more about the internal needs of music, not the general group needs. Therefore, compared with the group platform or the general cloud service data platform, we are closer to the business. Tools are also more business-oriented. Different from the universal data development platform, it is more inclined to open general capabilities and will not be customized according to business process specifications. We need to customize platform capabilities according to internal specifications and needs. We need to go deep into the business to understand business needs and The pain point of development provides a complete set of solutions. At the same time, we are more concerned about the cost of the business side, hoping that the overall use will be more economical and save money for the business.

3. Overall structure

c3af634b4502bc379b5de4f6323ff25c.png

Our overall capabilities are built on cluster services, which provide general data processing and governance capabilities. For example, the real-time task development platform sloth, based on Flink, provides general real-time data processing capabilities and supports SQL to process real-time data; The offline development platform Mammoth provides general offline task submission, scheduling, and management capabilities, and supports multiple task types such as MR, SparkSQL, Jar, and HiveSQL; the metadata center provides general data warehouse metadata management capabilities and lineage tracking capabilities; Security Center, based on Ranger, provides the basic capabilities of general permission management. Based on the complete basic capabilities provided by the group, we have packaged and customized according to the internal specifications and requirements of Cloud Music, implemented business specifications on the platform, and integrated best practices, so that users can benefit from Higher efficiency and quality Complete the business demand data processing work on the platform.

At present, more than 80% of the tasks on the platform are completed by custom components on the platform. For the tasks on the platform, we can better understand the business needs and characteristics of the tasks, and at the same time, the degree of control over user tasks will be relatively large, which can be more convenient It can be optimized in batches without the user's perception, so as to improve the development quality, which is a great help for the subsequent governance work. Of course, this is also a double-edged sword. More implementation and intervention will also bring greater pressure on operation and maintenance. It is also a big challenge for the development quality and ability of team members. It is necessary to consider the various applications of components more comprehensively. Scenes.

4. Why do you want to do this governance

We govern for a number of reasons.

1010b79c29d5175562c9c495716b8447.png

  • First: Last year was a big year for reducing costs and improving efficiency. All major companies are reducing costs and improving efficiency, and there will be external pressure to promote resource optimization and management.

  • Second: The current water level is high, and the huge business traffic has caused the Kafka water level of the platform to remain high. It has been above 80% for a long time, and many problems are obvious. A sudden wave of traffic peaks may cause Kafka to jitter and affect downstream tasks. .

  • Third: A new tracking system is launched inside the new music platform. The new tracking system supplements a lot of business information and solves many statistical problems; the increase in reported information has brought a three-fold increase in traffic, resulting in Kafka clusters and downstream The pressure of all Flink tasks is very high, which will have a great impact on the stability of platform tasks.

  • Fourth: The cloud music business has developed to this day. As mentioned earlier, it has reached a state where everyone uses data. The development of almost all roles will be exposed to data development work. Most users of the platform are non-professional data developers. Stability, ease of use, and operation and maintenance work are all very big challenges. Last year, 60% to 70% of the work order problems on the platform were the most basic performance, conceptual problems, and configuration problems, which were basically problems that could be mastered through documentation or simple training and learning.

02

governance planning

Governance planning is mainly divided into four blocks:

  • First: find out the status quo

What to do, what is the current situation, so that the governance is targeted and the results can be obtained quickly and efficiently.

  • Second: Movement Governance

There are many historical tasks in stock. In the early stage, we need some human actions to promote governance actions in a sporty manner, get data results quickly, and lower the overall water level.

  • Third: technical optimization

In the process of campaign-style governance, we also made technical optimizations to optimize the resource usage of tasks, improve the stability of tasks, and reduce the water level of the overall computing cluster resources and the water level of the Kafka cluster.

  • Finally: Sustainability

After the above three parts are completed, we still need to consider the sustainable development of the governance work. It is not a one-off deal. At the same time, we cannot always rely on the method of manpower campaign governance to solve the problem. We hope to be able to take the active benefits of campaign governance, It is converted into passive income that users actively trigger governance behaviors.

1. Find out the status quo

e26cd8ac02ad21aa9735201788c61c17.png

In order to find out the status quo, we have done the following work:

Cooperate with the group's bottom-level team to integrate the group's resource monitoring service Smildon, obtain the resource usage of all tasks in the cluster in real time, count the resources and costs used by all tasks, and directly convert resources and costs into money, and give real-time feedback to users through the front-end . From the perspective of users, users can have the most direct perception of the cost of tasks, and will be more cautious when using resources. At the same time, users will be more cooperative when the platform promotes user governance. From the perspective of the platform, it is possible to obtain an overall resource usage market, start governance from tasks with high resource usage, and quickly converge the resource level.

At the same time, we also collected the relationship between task concurrency and input traffic, counted the unit concurrent processing volume of all tasks, and then quickly evaluated the overall processing capability of the platform through this indicator. Through this value, we quickly found that the resource configuration may have problems. The task of the problem can be optimized and managed efficiently.

In order to control the growth of each department’s resources, we use the department as a unit to integrate the real-time resource usage data collected from the Smildon system, build a logical virtual queue, and count the approximate resources used by each department in real time, and then define the initial If the limit is exceeded, the application process must be followed to expand the capacity, and its growth can be controlled by means of this process (the figure shows a virtual queue resource).

2. Efficient governance

41e5f843a95136714a35384423c370a6.png

With the data indicators mentioned above, the problematic tasks can be quickly screened out, and then according to the resource usage of all tasks, and the unit's concurrent processing capacity and other indicators, the reverse order can be made to quickly optimize governance related tasks, converge resources, and take to the data result. Task governance is mainly divided into the following categories:

(1) First: Useless task detection offline        

The key to this operation is how to judge whether the task is still in use. At present, our judgment mainly includes the following points:

  • Judging by kinship, if the output data is not consumed by tasks, the tasks with high probability are useless tasks. There are two main ways to obtain blood relationship. For SQL tasks and real-time tasks developed using our SDK, the blood relationship information of tasks will be obtained through static SQL analysis, which is more accurate. For Jar tasks, key information is extracted through log parsing to obtain blood relationship. This method may not be able to capture all. Therefore, for the collection of bloodlines, we have always advocated internally that conventions are more important than technical optimization, and promote user transformation to use SQL or our SDK for development to obtain bloodlines, rather than adapting to user development habits, wasting manpower and using strange methods to extract blood.

  • Operation and maintenance enthusiasm, if the task is unmanned for a long time, and the alarm is not dealt with, you can ask the user to confirm whether it is still in use.

  • Judging by the business cycle, if the business has been offline, such as daily activities, you can promote the offline of related tasks.

(2) Second: the resources of the task itself are unreasonable

Through the indicators of the single concurrent processing data volume of the tasks mentioned above, quickly screen out the tasks with a high possibility of unreasonable resource allocation, adjust the concurrency and optimize resources. Since most of our platform users are not in the background of data development, this case Still very much.

(3) Third: The shrinking traffic leads to redundant allocation of task resources . There are still many tasks with large historical traffic, but the traffic gradually decreases later, but the task resources are not adjusted accordingly, which also leads to unreasonable allocation of overall resources. This In the follow-up, the data can be used to record the historical processing capacity of tasks to judge the rationality of the overall resources.

(4) Fourth: Technical optimization

In order to optimize the overall use of resources, we have also done a lot of technical optimization, such as the enhancement of Flink SQL, adding some additional capabilities to optimize the overall performance, optimization of Kafka writing, reducing the water level of the overall Kafka through batch optimization of writing, design and development Partition flow table technology optimizes traffic usage, reduces useless message consumption, reduces overall bandwidth and overall computing resources, which will be introduced in detail later.

3. Sustainability

71e3243ca774aa31ac6d36710aff77f8.png

Sustainable development refers to the normalization of governance work, and this part of the function is still under development. We hope to apply the rules mentioned above to the governance platform, promote users through automated processes, automatically scan out problematic tasks, push forward and notify users, let users actively develop and manage, and let everyone participate in the governance work The Jolywood platform has changed from the active income of campaign management to the passive income of automation.

03

Technology optimization

d1443b80867971ace595709a54894da8.png

Next, we will introduce technical optimization in three parts: Flink SQL optimization, Kafka batch optimization, and the optimization of the "partition flow table" we designed and developed.

1. Flink SQL optimization

The release of Flink SQL has greatly lowered the development threshold for real-time computing and improved development efficiency, but it has also brought some problems. The logic behind SQL is opaque, and the things that users can control are also reduced, which leads to some inconsistencies. The necessary calculation logic, and the optimization methods that users can do are also reduced, which leads to a lot of waste of resources in the middle. Below we illustrate through some cases.

(1) Case 1: Pre-optimization of message deserialization

230695b6477747227a450d16ec988661.png

Background: The log message format is userid\001os\001action\001logtime\001props. Props is in JSON format, so most of the performance loss in the process of reading the flow table is on JSON parsing. In the offline scenario, we can do column cutting and read only the required data, but in the real-time situation, it is not so mature. Whether we need the props field or not, FlinkSQL itself will parse the entire message, which leads to many waste of resources.

In order to solve this problem, I made some optimizations in deserialization. Users can do some filtering before parsing the complete log through some configurations, such as the comparison of the two SQLs in the above figure. Before parsing the entire message, pass The keyword configuration is used for keyword filtering to filter out all messages that do not contain the 'user-fm' keyword. Before parsing props, filter out redundant messages through os.list and action.list. Through these configurations, a large number of useless messages can be reduced The analysis greatly improves the performance of the overall task and reduces CPU consumption. These optimizations are very effective in many cases, and can reduce performance loss by more than 50% in extreme cases.

Similar to column pruning in offline scenarios, on-demand parsing and on-demand deserialization, this optimization can continue to be optimized, and now users need to configure it manually. Our ultimate goal is to automatically implement it based on the user's Select field combined with the realization of format column pruning optimization.

(2) Case 2: Index construction scenario

55dcc1f84da74b985dc6482a7fd6131e.png

The second case is the index construction scenario. Many indexes are generated by associating multiple database tables to generate large and wide tables, which are written into the index engine, and then provided to front-end users for query. The general process is that the user subscribes to the binlog log of the database through Flink to monitor the data changes of the business database, and then associates the key data in the binlog with many business DB tables to generate a large wide table, and finally writes it to the index construction engine through Flink Inside, for users to query. There are several problems here:

  • First: Flink SQL reading Kafka is limited by Kafka partitions. For example, 10 partitions can only be read and consumed through 10 concurrency;

  • Second: When there are too many dimension table associations, because the upstream partitions are limited, the downstream dimension table associations are limited by the query performance of dimension tables. The more tables there are, the worse the processing performance of a single message will be.

The combination of the two makes the overall processing performance unable to scale horizontally. No matter how you expand the concurrency of Flink tasks, there are always only 10 concurrent tasks processing messages, resulting in serious task delays.

8e5e3ba8e56d800b36505937ec7cda7f.png

Our optimization plan is:

  • First: Improve Metrics monitoring, collect all metrics associated with dimension tables, such as the RT of each dimension table query, the performance of message deserialization, and RT related monitoring indicators written to third-party storage, and write them into task monitoring Go inside and display it on Granafa, so that if a dimension table has particularly poor performance due to improper index design, it can be quickly discovered and optimized through the monitoring page of Granafa.

  • Second: Aiming at the problem that the more dimension tables, the worse the performance, add asynchronous association configuration, enable the feature of Flink AyncIO, and improve the overall processing capacity of the task through asynchronous association.

  • Third: Regarding the problem that the processing capacity is limited by the Kafka partition: Flink will automatically optimize the OperationChain when reading Kafka messages, and will bind all the reading actions, parsing actions, and out-of-dimensional association writing actions together, resulting in this A series of actions will be limited by Kafka's concurrency, and the overall processing capacity is very bad. Especially when there are many dimension table associations, even if asynchronous optimization is enabled, the overall performance improvement is not particularly obvious. At this time, a capability is needed to separate the behaviors and set concurrency separately. Therefore, a configuration is added in Flink SQL to add a concurrent modification operation in the process of reading table messages, and separate the behavior of reading messages from the subsequent behavior of parsing and processing messages. By adding a rescale or rebalance operation in the middle, set the concurrency of reading and subsequent parsing processing respectively. When there is no need to shard according to the message content, we recommend rescale, because the performance loss of rescale is relatively small. In this way, the functions associated with subsequent dimension tables are not limited by the number of Kafka partitions, and the processing capability can be expanded horizontally by adjusting the concurrency of subsequent processing. Of course, adding rescale or rebalance operations in the middle will cause messages to be out of order. This optimization cannot be used on the required scene.

  • Finally, this type of tasks are IO-intensive tasks, and the input traffic is often not very large. Adding concurrency is only to increase the concurrency of DB dimension table queries and improve the overall throughput of tasks. Therefore, when optimizing this type, we will also optimize the CPU configuration, and do fine-grained CPU resource allocation through the configuration of yarn.containers.vcors. By default, one slot is assigned one CPU, and the ratio can be controlled through this configuration. For example, if there are four slots and yarn.containers.vcors is set to 2, one slot will only be allocated 0.5 CPUs, which can also save resources.

2. Batch optimization of Kafka

328b2a32675b1e51b3b1c3d83db004d8.png

As mentioned earlier, our Kafka cluster has always been at a relatively high water level, with the water level reaching 80% during the peak period. In addition, the new buried point system for business will be launched soon, which will bring a three-fold increase in traffic. In order to reduce the water level of the overall Kafka cluster, we have done the following:

  • The first perfect Kafka monitoring

In the early days, our Kafka operation and maintenance system was relatively simple, and the monitoring indicators were not perfect, which made it difficult for us to start the optimization work. In order to better understand the reasons for the high water level of Kafka, we refer to the Kafka community's plan to build a very complete monitoring. Obtained relatively complete data monitoring, which provides a direction for our optimization, and we found the problems mentioned below through the monitoring data.

  • Second traffic balance problem

A Kakfa cluster serves many businesses. Each business has many message queue topics, and each topic has many partitions. These partitions are distributed on cluster machines. The relationship between partitions and machines is maintained manually. The distribution of message partitions is different. On average, the traffic size of each partition message is not the same, which directly leads to the relatively high load of some machines and the low load of some machines. The current optimization solution is relatively simple and crude. The traffic of each machine can be seen intuitively through monitoring, and then PE manually adjusts the partition distribution of topics through tools to ensure that the traffic of each machine is balanced and stable. This problem is also a common problem in the open source version of Kafka. In the future, we will consider replacing Kafka with Pulsar to solve this problem by taking advantage of the architecture of separating storage and computing.

  • Optimization of third message sending

Through monitoring, it was found that the high water level of Kafka in the past was mostly due to insufficient processing thread pools and relatively large disk IO, but the overall message volume was good. After digging deeper, it is found that the batch size configuration for message sending does not take effect. In many cases, only one or two messages are sent at a time. In this way, 100 messages need to be sent 100 requests, which will cause the thread level of Kafka message processing to be very high, and the disk IO The frequency of operations is the same. But why does the batch size configuration not take effect? The survey found that Kafka batch is related to the configuration of batch size and the maximum tolerable delay time of liner.ms. The default configuration of liner.ms is 0, but when we optimize this configuration, the batch effect is still not obvious. Finally We found that it is also related to the strategy of the producer partitioner.

383ce98ff58bbaf9e2d0f35c8a43da45.png

The Partitioner optimization strategy considers the following points:

  • For partition balance, it is necessary to ensure that messages are evenly sent to all partitions, otherwise it will cause data skew, which will put pressure on the machine and downstream consumers.

  • The body of a single message should not be too large, otherwise the delay of the message will increase, and the pressure on the downstream processing of a single message will also increase.

  • Maximum tolerance time: When the message body has not accumulated to the configured maximum size within the maximum tolerance time, a send request will also be triggered.

There is a trade off between the three. If it is too large, it will affect the delay. If it is too small, the entire IO will not work. Kafka introduced a new Partition strategy in version 2.4: Sticky Partitioner, and added a new onNewBatch method to the public Partitioner interface. This method will be called every time a new Batch is created. When this method is called, Sticky Partitioner A partition will be randomly selected, and then all messages will be placed in a Cache. In the next OnNewBatch, all messages in the Cache will be packaged into a batch and sent to the randomly selected partition. Next time, a partition will be randomly selected. Continue to accumulate messages into the Cache and send them in packages. This strategy not only ensures the balance of the partitions but also maximizes the Batch effect. After practice testing, the performance optimization brought about by the Batch strategy of Sticky Partitioner is very obvious, and the water level of the entire Kafka cluster has been reduced from 80% to 30%.

3. Partition flow table optimization

(1) The processing flow of the data warehouse

fe2e1a4afb834e6cabfea308456898ab.png

In the offline scenario, we can reduce unnecessary data reading by means of column storage, bucketing, partitioning, indexing, etc., thereby improving the performance of the program. In order to reduce the cost of using the overall real-time cluster and withstand three times the traffic impact of the new buried point, we refer to the partition design of Hive and realize the design of a set of real-time partition flow tables.

The above figure is a relatively conventional real-time data warehouse processing flow chart. The normal log processing flow includes DS archiving (Netease internal service, collecting logs to Kafka and HDFS), and then cleaning and formatting to the ODS layer through Flink, and then to the DWD layer , the business application consumes DWD, and produces the ADS layer to provide services for upper-layer applications. In the whole process, the log volume of ODS is very large. When developing DWD, it is necessary to consume the full amount of ODS layer logs. If the ODS layer log flow size is 700M/S, then all downstream DWD tasks must consume this 700M/S S traffic, to handle such a large traffic, about 900Core resources are required, which is equivalent to 9 newly configured physical servers. For each additional DWD table task, 9 additional physical servers need to be added. In addition, in the case of such a large traffic, the stability of the task cannot be guaranteed. Any wave of log fluctuations will have a relatively large impact on downstream tasks. The pressure on Kafka will also be very high, and the cost is unacceptable.

(2) Historical scheme

b23cedb77166ed1d1d82debfca96510a.png

Our previous solution was to split the original log at the source, and split the original log into different topics according to business needs through a separate distribution program. Some companies in the industry also do this, but this approach has the following problems:

① The cost of operation and maintenance is high, and the granularity of splitting is relatively coarse. There is still a certain degree of useless consumption of traffic in the downstream, and it is also difficult to split in the later stage.

② Users need a lot of prior knowledge when using the real-time flow table, and need to understand the distribution rules of the message to read the correct flow table, which is expensive to use.

③ The modeling of the real-time data warehouse and offline cannot be unified. If you want to integrate batch flow later, the real-time data warehouse and the offline data warehouse cannot achieve the same schema, and then you cannot make a set of codes support real-time and offline at the same time.

④ It cannot be migrated and reused. A customized distribution program is made at the source alone. The distributed data of downstream consumption may have a problem of large traffic. Downstream users cannot easily reuse this solution and cannot continue to generate value.

(3) Partition flow table optimization

49196a84f67e0094000e45fbebdebbdd.png

Referring to the partition design of the Hive table, we redesigned the metadata results of the real-time flow table, so that the real-time flow table also has the concept of partitions. The partition metadata contains the mapping relationship between partitions and Kafka topics. Then we customized and modified the Kafka Connector, according to The partition meta information of the flow table. When writing the flow table, the message is written to different topics according to the content of the partition field in the message and the meta information of the partition. When reading the flow table, we implemented partition push-down on the basis of the Kafka Connector, which automatically infers the required partition topic according to the partition conditions in the user query SQL, cuts out the unnecessary partition topic, and reduces unnecessary message reading. to reduce waste of resources.

With the partition flow table, we can archive the logs from DS, and all downstream DWD tasks can enjoy the dividend of traffic optimization brought by the partition flow table without being aware of it. Users do not need to pay too much attention. Set the partition rules, use SQL to read and write, do not need to build a separate program for distribution, and the cost of reuse is very low. The construction of downstream large-traffic DWD layer tables can also reuse the partition flow table technology at a very low cost. to a reduction in overall flow.

387da1d51567a0297ee1c9125c2c415a.png

In the example, you only need to define the write partition field, and it will automatically write to the corresponding topic according to the partition field, and read the query conditions of this partition field, and the topic source will be inferred automatically.

04

future plan

The future planning mainly consists of two parts: big data container transformation and automated governance platform.

9b80318d11b1f483d2113172616f0eba.png

1. Containerized transformation

We hope to obtain the following capabilities through container transformation:

  • First: Excellent resource isolation capability. Through K8S containerized CGroup’s relatively good resource isolation capability, it solves the problem of mutual influence between tasks in the Yarn environment. Although CGroup can also be enabled on Yarn, it is very inflexible to configure. It is more difficult to maintain.   

  • Second: Refined resource allocation

In the Yarn environment, we can only do some fine-grained resource adjustments through yarn.containers.vcores, but the overall granularity is relatively coarse. On K8S, it can achieve 1/1000 Core granularity, and the resource configuration is more refined and more flexible.

  • Third: Macro monitoring system

In the Yarn environment, because there is no good resource isolation, and there is no detailed indicator of resource utilization at the container level, it is difficult for us to evaluate tasks from macro indicators such as machine load, CPU utilization, memory utilization, and bandwidth usage. Rational use of resources; in the K8S environment, it has relatively good resource isolation and perfect resource indicators. We build a task-level macro monitoring system, refer to general web applications, and use CPU utilization, IO utilization, and memory utilization. Rate and other macro-monitoring can quickly evaluate the rationality of resource applications for Flink applications, and quickly optimize governance.

  • Fourth: Flexible resource scheduling capabilities

 K8S itself can customize a very flexible scheduling strategy, which is convenient for us to choose different types of machines according to the characteristics of the task; it can also be mixed with resources of other types of business (such as machine learning, online business, offline computing, etc.) to achieve the best overall resource utilization. maximize.

2. Automated governance platform

As mentioned earlier, we hope to build a metadata warehouse by collecting metadata of data warehouses, tasks, and user platform elements, and use these metadata to configure rules on the basis of the metadata warehouse. Before the development goes online, some legality checks are carried out through the rules, such as pre-checks such as whether the SQL is standardized and whether the alarm is complete; the service regularly scans through the rules to automatically find problems, such as whether the resources are reasonable and whether it can be offline, etc. Automatically promote user governance; after task governance, the governance effect will be recovered, and a red and black list will be formed by posting the governance results, forming a benign closed loop.

05

Q&A

Q1: Is there any flow-batch integration work based on the partition flow table?

A: Our current solution is to build a data model layer. The data model layer will associate offline data warehouse tables and real-time data warehouse tables, read offline tables in offline scenarios, and read real-time flow tables in real-time scenarios. The above-mentioned partition flow table technology solves the problem of excessive flow of real-time data warehouse form tables, and achieves the unification of offline data warehouse and real-time data warehouse modeling, so it is easy to achieve a unified model at the data model level map.

The article mentions a FastX development tool, which is a low-code development work based on the data model. The data model can be managed in FastX, and then configuration development can be done in a low-code way based on the data model. A set of unified computing logic is generated by means of low-code, a set of logic configuration is realized, and it runs in two sets of stream and batch environments.

Q2: How to shield the difference of SQL

A: FastX will produce a set of unified DSL through low-code configuration. In the real-time scenario, it will select the data source of the real-time data warehouse, and then generate real-time FlinkSQL from the DSL. In the offline case, select the offline data source to generate Spark SQL for execution. implement. Upper-layer interaction is limited, and the overall operator is also controllable, so we gradually cover business scenarios, implement a set of logic, and run in offline and real-time environments at the same time. At present, FastX is positioned as a development platform based on business scenarios. It is definitely difficult to develop all scenarios. We hope to cover 80% of business scenarios based on the 28th principle.

Q3: What are the similarities and differences in methodology between real-time data warehouse management and offline data warehouse management?

A: The methodology feels similar, but compared with the offline data warehouse, the development time of the real-time data warehouse is still relatively short. There are many data indicators in offline data warehouse governance to evaluate the quality of data warehouses, such as penetration rate, reuse rate, idle filtering, etc. to evaluate the quality of data assets, and will make data warehouse structural improvements towards these quantitative goals optimization. In the real-time scenario, the structure of the data warehouse is relatively simple and there are fewer layers. Generally, the DWS layer is not built in the real-time scenario. At most, it is from the ODS layer to the DWD layer, and then to the business layer. The cost of building the DWS layer is also very high. , storage is also difficult to choose. Flink tasks in real-time scenarios are sensitive to resources, stability, and delay. When necessary, the specifications for data warehouse construction need to make concessions to resource performance. Therefore, we pay more attention to stability and resource management in the governance of real-time data warehouses. .

ca71b5be672147728ea27bc07ec6fab3.jpeg

share guests

INTRODUCTION

fb5f294eff8439a80ba8888b392b8129.gif

Wang Lei

f45f01cb36b8c1dcb5852b2c242fbb42.gif

NetEase Cloud Music

bd77f2c094914e24f89e760fcb0726dd.gif

Data Development Specialist

0e9bdb3962a2a569fcdf480d7496bf3c.gif

Bachelor of Hangzhou Dianzi University, joined Netease in 2013, 10 years of experience in data development, participated in the construction of cloud music data platform from 0 to 1, and is currently responsible for the construction of cloud music real-time and offline platforms.

Guess you like

Origin blog.csdn.net/g6U8W7p06dCO99fQ3/article/details/131886723