Interpretation of the paper at the top database conference VLDB 2023: How ByteDance solves the problem of operation and maintenance of ultra-large-scale streaming tasks

This article interprets the paper "StreamOps: Cloud-Native Runtime Management for Streaming Services in" jointly published by Professor Ma Tianbai's team at the National University of Singapore and ByteDance's infrastructure-computing-streaming computing team at the top international database and data management conference VLDB 2023. "ByteDance", introduces a streaming task runtime management and control solution extracted by ByteDance based on tens of thousands of Flink streaming task management practices, which effectively solves various types of operations exposed due to changes in traffic and operating environment during the running of streaming jobs. Issues that sometimes require manual intervention to manage are needed to promote NoOps-based core capabilities. It supports the management of ultra-large-scale streaming jobs and provides capabilities including automatic expansion and contraction, automatic migration of slow nodes, and intelligent diagnosis of delays/faults. It can also expand functions through plug-ins. StreamOps has been verified on a large scale within Bytedance, saving 15% of computing resources on a daily basis, effectively migrating slow nodes about 1,000 times a day, reducing manual oncalls by 75%, and significantly reducing the maintenance costs of streaming tasks in ultra-large-scale scenarios.

Paper link: https://www.vldb.org/pvldb/vol16/p3501-mao.pdf.

introduction

In recent years, stream computing has been widely used in large-scale real-time data processing and decision-making. ByteDance has chosen Flink as its streaming computing processing engine. Tens of thousands of Flink jobs are run on internal clusters every day, with peak traffic reaching 9 billion pieces of data per second. Since streaming jobs often run for days or longer, their workloads and operating environments tend to change over time. The traffic difference between the peak period and the trough period of streaming operations within Byte is 4-5 times on average, and it is always faced with problems such as underlying resource congestion and machine model differences. Such changes will bring about various runtime problems, such as data backlog and various failures, resulting in the need for frequent manual intervention or the waste of excessive resource reservation. With the rapid growth of stream computing, a runtime management and control system is urgently needed to automatically solve these runtime problems. However, it is challenging to design a streaming job runtime management service in a scenario like ByteDance: the service needs to be scalable enough to uniformly manage all streaming jobs at the cluster level and use effective management and control strategies. To solve different runtime problems, it also needs to have good scalability to develop new management and control strategies for new runtime problems.

In response to the above challenges, this article proposes a streaming task runtime management and control system StreamOps based on cloud native, which can effectively reduce the maintenance cost of user streaming tasks in large-scale scenarios. StreamOps is designed as a lightweight and scalable management and control system that runs independently of streaming jobs to uniformly manage large-scale streaming jobs. It abstracts a layer of management and control strategy programming paradigm to support the rapid construction of new management and control strategies. Based on Byte's internal long-term practical experience, it supports automatic expansion and contraction of streaming tasks, automatic migration of slow nodes, and intelligent diagnosis of delays/faults. Three core management and control strategies. This article introduces the design decisions and related experiences we made in designing StreamOps, and conducts experiments in an internal production environment to verify the effect of StreamOps.

Introduction to SteamOps

The above figure shows the overall architecture and workflow of StreamOps. It mainly includes 3 components:

  1. Control Plane Service: A horizontally scalable stateless service to manage cluster-level streaming jobs. It is deployed independently from streaming jobs to decouple the control plane and streaming computing engine for better flexibility and scalability. .

  2. Global Storage: Stores job indicators, logs and other data required for management and control policy decisions, as well as the status data of the control plane service itself.

  3. Runtime Management Trigger: Each streaming job is equipped with a runtime management trigger to send requests to the control plane service to trigger management operations. Requests can be triggered periodically, when a specific condition is met, or manually.

The overall workflow is:

  1. A single streaming job triggers management and control operations to the control plane service based on the triggering policy.

  2. After receiving the request, the control plane service pulls data such as job indicators and the status of the management and control policy itself from the global storage for management and control policy decisions.

  3. After the management and control strategy makes a decision, it will initiate configuration changes when the streaming job is running or issue an alarm reminder to the user.

control plane services

StreamOps adopts the design principle of policy-mechanism separation and divides the overall control process into two parts: control strategy and control mechanism. Governance strategies focus on responsible model decisions, and implementation is defined by a common programming paradigm that abstracts the three steps of discovery-diagnosis-resolution. The management and control mechanism is responsible for interacting with external systems, executing index acquisition and executing control change operations based on decisions. The general index acquisition and control change mechanism is encapsulated and can be reused. Through the above measures, a new management and control strategy can be expanded and implemented at low cost.

Control strategy

The above figure shows the overall process of management and control strategy decision-making. The management and control strategy first obtains the indicators and configuration information when the streaming job is running from the indicator collector, and then follows the three-step process of discovery, diagnosis, and resolution to obtain the indicators and configuration information. Decisions are made and finally handed over to the streaming job configuration changer for execution. Common execution mechanisms include expansion and contraction, node migration, or simply sending alarms for users to handle manually.

Control mechanism

1. Indicator collection

In addition to the indicators of the calculation engine itself, commonly used indicator information for streaming job management and control also includes data source-related indicators on the MQ side and resource-related indicators on the K8s side. ByteDance internally caches all three types of indicators through the central time series database. StreamOps is connected to the internal time series database system, and the management and control strategy can perform rich query operations on different types of indicators as needed.

2. Configuration changes when streaming jobs are running

Configuration changes to the job can be completed by restarting, but this has a greater impact on users. In terms of changes, we first use the API to accelerate job hot updates. In addition, our analysis found that there is a lot of room for optimization in this type of operation. First, a large part of the time in operations involving resource changes is spent on resource application. For small-status jobs, the highest It can reach 70%. It implements a set of resource pre-application mechanism and is connected to StreamOps. For large-state tasks, most of the time is spent on state recovery. We have optimized the DB merging and pruning mechanism for RocksDB, and the overall state recovery time has been accelerated by 10 times. After our overall optimization, the overall outage time has been reduced from the minutes required for a complete restart to the seconds, making it almost insensitive to users.

Implementation of core management and control strategies

The goals of the StreamOps core management and control strategy are: 1. To ensure that job processing keeps up with the incoming data rate and solve the problem of message backlog and runtime exceptions; 2. To improve cluster resource utilization and reduce costs. In Byte's production practice, it was found that there are two main reasons for message backlog: insufficient overall resources and load imbalance. The load imbalance can be subdivided into machines (slow nodes) with slow running efficiency and data skew. lead to. Runtime exceptions have various causes, and many times cannot be resolved from within the computing engine. Therefore, StreamOps implements three core management and control strategies as shown in the figure above:

  1. Automatic expansion and contraction: Solve the problem of insufficient/excessive resource allocation of the overall job.

  2. Automatic migration of slow nodes: Solve the message backlog caused by machines with slow running efficiency.

  3. Intelligent diagnosis: Provide diagnosis and suggestions for problems such as data skew and runtime anomalies that often cannot be solved from within the computing engine.

Automatic expansion and contraction

Based on the DS2[1] model, we have expanded and implemented an automatic expansion and contraction model suitable for bytes. The overall decision-making process is as above. The model comprehensively considers the job's message backlog and operator load conditions to determine whether expansion and contraction operations are needed. For expansion and contraction, it will additionally consider the workload situation in the past period to eliminate abnormal situations such as serious data skew and job running failures to avoid errors. decision. During the processing process, combined with mechanisms such as job hot update, RocksDB DB merge and clipping acceleration, it can achieve fast recovery with close to zero downtime.

Automatic migration of slow nodes

For the problem of slow nodes caused by a small number of machines running in the environment, the detection model can eliminate regular situations through algorithms, screen out abnormally slow nodes with high machine aggregation in the job topology, and migrate and restore them. During the processing process, optimization methods such as node blocking and resource pre-application are used to make it faster and more stable.

Intelligent diagnosis

At the same time, StreamOps implements an intelligent diagnosis (Job Doctor) system and provides a visualization platform for users and operation and maintenance personnel to analyze and use. It mainly covers the following four types of diagnostic rules: resource usage analysis and suggestions, running exception collection analysis and suggestions, Flink configuration analysis and suggestions, and bottleneck processing analysis and suggestions. Users can perform independent detection, and the system will also perform regular inspections, send warnings to users and provide corresponding processing suggestions.

Experimental results

Control plane overall effect

First, we demonstrate StreamOps's evaluation of large-scale operation management and control capabilities of jobs at the cluster level. We deployed a StreamOps cluster consisting of 50 nodes in the production environment for stress testing. Each node was configured with a 16-core CPU and 32 GB of memory. As can be seen from the figure below, StreamOps can achieve a response time of P95 within 60s at a maximum of 33k requests per second, which shows that the system has good scalability.

Automatic expansion and contraction effect

This shows the execution effect of automatic expansion and contraction on a large-traffic production job: Figure a shows the change of job parallelism with the job inflow rate, and Figure b shows the change of message backlog during the period. It can be seen that StreamOps can effectively scale up and down according to the job inflow rate, and can save at least 60% of CPU resources. Of course, it can also be seen that the expansion and contraction itself will cause a temporary accumulation of messages, so each job can adjust parameters in a more fine-grained manner to weigh the cost of expansion and contraction and the sensitivity of automatic expansion and contraction.

Automatic migration effect of slow nodes

The figure below shows two representative production jobs within Byte that were affected by slow nodes, resulting in message backlog. StreamOps can accurately identify and automatically migrate slow nodes, effectively solving the message backlog problem caused by slow nodes.

The following figure further shows the partitions and corresponding backlogs of the top-5 backlogs of the above two jobs. It can be seen that more than 80% of the message backlog is concentrated in the top-5 Partitions, indicating that the nodes consuming these Partitions are running on slow nodes. StreamOps These slow nodes were accurately identified and the backlog was indeed resolved after migration.

Intelligent diagnostic effect

Figure a shows the number of various types of runtime problems successfully diagnosed every day using StreamOps over a period of time. Figure b shows the comparison between the number of problems successfully diagnosed every day within a week and continuing to enter manual oncall processing after using diagnosis. In the past, when encountering runtime problems, users usually initiated manual oncalls directly. It can be seen that accessing intelligent diagnosis effectively reduces the number of manual oncalls.

Summarize

This article proposes StreamOps, a streaming task runtime management and control system based on cloud native. The author implements it as an independent stateless service independent of the streaming job, so that it can efficiently and uniformly manage large-scale streaming jobs. It is proposed to split the overall management and control process into two parts: strategy and general mechanism for interaction with external systems, and abstract the strategy part into a three-step universal programming paradigm of discovery-diagnosis-solution, so that new management and control can be quickly implemented at low cost. Strategy. It implements three major management and control strategies: automatic expansion and contraction, automatic migration of slow nodes, and intelligent delay/fault diagnosis. It solves the pain points of message backlog, runtime failure, and resource waste in production practice, and has been verified in ByteDance’s internal production environment. Its efficiency and effectiveness.

Quote

[1] https://www.usenix.org/conference/osdi18/presentation/kalavri

author information:

  • Chen Zhanghao, ByteDance infrastructure engineer. Streaming computing expert, Apache Flink Contributor. Master's degree from the University of Illinois at Urbana-Champaign. After graduation, he has been engaged in research and development related to stream computing.

  • Zhang Yifan, ByteDance infrastructure engineer. A streaming computing expert with a master's degree from Hangzhou University of Electronic Science and Technology. He once worked for NetEase and is currently working full-time on the research and development of streaming computing systems and services at ByteDance.

Guess you like

Origin blog.csdn.net/weixin_46399686/article/details/132715931