ByteDance cloud-native big data platform operation and maintenance management practice

Cloud-native big data is the next-generation architecture and operating form of the big data platform. With the rapid growth of ByteDance's internal business, the disadvantages of the traditional big data operation and maintenance platform are gradually exposed, such as numerous components, complex installation and operation, and excessive coupling with the underlying environment; for the business side, there is a lack of out-of-the-box log, monitoring, and alarm functions. In this context, we have carried out a series of cloud-native big data operation and maintenance management practices. Through the cloud-native way of operation and maintenance management, it can finally weaken the business party's perception of the state, shield the differences in the environment, and unify the experience in different environments.
Author|ByteDance Senior R&D Engineer-Luo Laifeng
 

Business status and background introduction

In the past few years, ByteDance has accumulated many engine tools in the field of big data in the process of supporting its own business , and is currently exploring the standardized and productized output of the capabilities of these engine tools . There are mainly the following difficulties in this process :
  • Many components : In the field of big data , many components are required to complete a job. For example, distributed big data storage and various task execution engines: OLAP tools for Flink, Spark, and various ETLs , task scheduling tools for scheduling ETLs, as well as a running log monitoring system that supports tool engines and an auxiliary system for project user permissions;
  • Complex deployment : These systems have many components that work together complexly, making deployment difficult. For example, deploying a complete production environment may involve multiple dependencies and configuration management . There are strong dependencies, such as the dependence of various task engines on the underlying big data storage; there are also weak dependencies , such as the dependence of task engines on the log monitoring system; there are even circular dependencies , such as message middleware may need to collect logs, but the log collection itself depends on message middleware, and their configurations will also form mutual nesting;
  • Environmental coupling : For example, the task execution engine may need to nest big data storage configurations, and log collection may need to be aware of the directory and its format of each component . Complicated deployment will lead to coupling of the environment, because the daily maintenance of these complex configurations and dependencies will form a deep coupling with this environment over time, making migration difficult.
With the rise of cloud-native concepts in recent years , we also try to transform these tools into cloud-native to solve the above problems.
 

Features of Cloud Native Scenarios

  • Serviceless status awareness: users can use functions without paying attention to the running status behind it, and do not need to care about the logic behind it;
  • Extreme elastic scaling : After hiding the running status from users, the scaling is even more extreme in cloud-native scenarios, and on-demand use can significantly reduce costs;
  • Fast failover : When a failure occurs , with the help of the ultimate elastic scaling feature, the faulty node can be quickly taken offline and a new normal node can be added to achieve fast failover, and this failover is also a non-sense and harmless action for users .
The above three characteristics will promote each other, forming a virtuous circle.

Cloud Native Evolution Direction

For the cloud-native transformation mentioned above, the following :
  • Component micro-service : By dividing the overall service into multiple small components according to responsibilities, the overall architecture is more cohesive and low-coupling , which reduces the complexity of the entire environment change and facilitates large-scale cooperative development;
  • Application containerization : Containers provide portability and can ensure consistency between environments;
  • Immutable infrastructure : By encapsulating all content, the underlying infrastructure is isolated, thereby ensuring that the infrastructure is immutable, which can bring consistency, reliability, and simplicity to deployment, and make the state of the environment more controllable;
  • Declarative API : Through declarative API, users only need to declare the state they want to achieve, and the back-end service will try their best to meet it, so that users do not need to perceive specific process , the overall environment is more stable, and the change and evolution of functions will be easier, and it also simplifies the threshold for use.

architecture evolution

Introduction to Cloud Native Big Data

Cloud-native big data is mainly built on containers. The container here can be the container service of the public cloud or the container base of the private cloud. The container base of the private cloud can be an open - source K8s /K8s-based transformation base. The entire cloud-native big data can be divided into three major platforms and a major support system . The three major platforms are the scheduling layer, the engine layer, and the platform layer. Above the container is the resource scheduling layer, which is responsible for unified management and scheduling of computing, storage, and network resources of the entire cluster. The core engine layer above the scheduling layer is mainly a unified big data storage system developed by Byte , which is compatible with HDFS semantics and supports docking with standard S3 object storage . The upper layer of the storage layer is Flink , Spark and other self-developed or optimized computing engines , message middleware , log search and real-time analysis engines and other tools . The top is the platform service layer, which is responsible for packaging and integrating these engine capabilities into an external output product.
The operation and maintenance management platform introduced this time supports the above three platforms and provides management functions for daily component operation and maintenance. In order to better adapt to the transformation of the entire cloud-native big data , we have also made cloud-native improvements to the operation and maintenance management module .

Operation and maintenance practice on cloud native

  • Low resource occupancy : the operation and maintenance management module is not a user-oriented product core function, so its presence should be low enough, and the proportion of resources should be small enough, even in some small scenarios, it should be negligible;
  • Strong scalability : In daily operation and maintenance management, because log monitoring is positively correlated with the size of the cluster, all functions related to operation and maintenance management must have the ability to quickly scale horizontally with the environment ;
  • High stability : operation and maintenance management has high requirements for stability, even in the event of a failure, it must be able to recover quickly, and it also needs to provide disaster recovery management capabilities for other components ;
  • Strong portability : One of the major goals of the operation and maintenance management module is to support the rapid migration of the entire cloud -native big data product, so it is required that it should not have the problem of environmental coupling , and all related functions need to support plug-in design, which can flexibly provide a complete set of operation and maintenance management functions in different environments ;
  • Weak environment awareness : shield the upper-layer business from the differences in operation and maintenance management caused by environmental differences, and ensure that the upper-layer business can use the operation and maintenance management functions in different environments in a unified way .
Therefore, in order to meet the above requirements, we summarize the following directions that need to be paid attention to. In terms of environment management, we need to abstract a set of unified environment models to adapt to different deployments; in addition, we need a flexible and convenient component management service to manage component metadata dependencies, configuration and other information in a unified manner; finally, we need to have the ability to abstract functions, such as common functions such as logs , monitoring, and alarms, which can shield the upper-layer business from environmental differences through abstraction .
 

Environment Management and Component Services

environmental management

The entire environment can be divided into three logical areas according to functions, namely the control plane, the system plane, and the data plane. It should be noted that these three areas are only the division of logical areas, not the isolation of the physical environment. For example, in some scenarios, the control plane can be merged with the system plane, and even in some small-scale scenarios, the three planes can also be merged into one physical cluster .
  • Control plane : It is used to provide weak business bearer, and it is the only supporting work in the whole world responsible for environmental control, cost accounting and service gateway;
  • System plane : It is unique within each logical unit, but there can be multiple logical units in the whole system. For example, in the scenario of multi-active in the same city/multi-active in different places, each logical unit may correspond to a computer room/area, and the relationship between multiple logical units depends on the coordination of the control plane ;
  • Data plane : It is used to provide computing, storage, network and other resources required for engine operation, and under the unified coordination of the system plane, multiple data planes can form a logical federated cluster .
 

Component Services

The components are divided into levels by dividing the area of ​​the environment, which can be mainly divided into system level, cluster level, tenant level and project level. The system level is responsible for most of the business management and control logic; the cluster level mainly completes the collection of log data/monitoring data Agent and internal self-developed schedulers and operators; the tenant level is mainly used to support components exclusive to specific large users; the lowest project level is user job instances, middleware instances, and other third-party tools. Through the division here, the entire deployment is divided into a grid form, so that each component only needs to pay attention to the grid where it is located, and the coupling between components and environmental information is well shielded.
 

Component Services: Helm Customization Improvements

K8s is very friendly to support a single resource, and it is also very rich in operations in specific fields. However, simple services also require the cooperation of multiple resources. For example, a Deployment needs a ConfigMap to store its configuration to carry business logic, and then needs to access the portal through a Service to facilitate external exposure of services. However, K8s does not provide a good tool for resource coordination here. In open source solutions, many open source components basically provide Helm Chart for K8s migration, but in order to better integrate into the open source ecosystem, we have also built our own component services based on Helm.
Since the open-source Helm command-line tool is not suitable for API calls between components in cloud-native scenarios , we have made in-depth service-oriented customization of the open-source Helm, exposed it through APIs in common deployment, uninstallation, upgrade, and rollback requirements, and added a visual interface. At the same time, it also supports some in-depth simulation deployments, allowing users to quickly deploy, verify, and debug. Upper-level business components can pay more attention to their own business domain .
 

disk management

Native K8s is very friendly to stateless load support, but it is a bit unsatisfactory to stateful load support , which is mainly reflected in the use of local disks. We analyzed and summarized the following pain points:
  • Environmental coupling : When K8s uses a local disk, it is necessary to perceive the mount point, type, and size of the local disk in advance, which will cause a certain coupling phenomenon;
  • Low utilization rate : the lack of global and unified storage scheduling and management components leads to the inability to form efficient mixing between components , resulting in low overall disk utilization;
  • Poor isolation : If the disk is used as a whole disk, the utilization rate will be low, but if it is not used as a whole disk, there will be a lack of isolation between components ;
  • Difficulty in maintenance : During business operation, expansion and contraction adjustments are often required due to the dynamic requirements of components on disks . However, the coordination of operations of various users in advance leads to long time spans, long links, and high maintenance difficulties.

Unified scheduling

To this end, we have developed a set of unified CSI (Container Storage Interface) for management, which can not only collect all disk information of the cluster in a unified manner , but also perform unified management. On this basis, we divide the usage scenarios of the entire disk into three categories, namely shared capacity volumes, shared disk volumes, and exclusive disk volumes .
Shared capacity volume means that the capacity is shared. This type of scenario is not sensitive to IO and does not require strong space capacity restrictions, but has higher requirements for flexibility, such as temporary data storage and logs of typical big data jobs ;
Shared disk volumes are not very sensitive to IO, but they have certain requirements for isolation and persistence, and they need to be retrieved in the event of a failure . However, if they cannot be retrieved, there will be no catastrophic consequences. The most typical scenario is caching;
Exclusive disk volumes require a high degree of IO isolation, typical scenarios such as message middleware Kafka , HDFS , etc.

Disk Management Overview

In disk management , it is divided into two large areas. The first area is maintained by K8s , such as the commonly used EmptyDir. This part is recommended to store configuration data or a small amount of temporary data.
The remaining area is the unified management through CSI mentioned above , which is mainly subdivided into three areas, corresponding to the three storage volumes mentioned above, and the shared capacity volume is supported based on a simple local path; for the shared disk volume, all the disks will be assembled into a Volume Group first, and a logical volume can be created when the business component applies for the shared disk volume, so as to achieve the effect of isolation.
An exclusive disk volume is to own the entire disk, and then abstract it into a series of Storage Classes through a unified CSI . Upper-level business components can apply for corresponding storage volumes according to their own needs . If it is a public cloud disk or a scenario with centralized storage, it is still recommended to use this set of CSI to provide various storage volumes for the business, so as to achieve capacity control and also use this CSI to decouple disk information from components .
 

Unified log monitoring alarm

log

Logs are also a factor that makes portability difficult. For this reason, we have also implemented unified log collection link management to achieve business isolation, efficient collection, fair distribution, and safety and reliability.
Two methods are currently supported for log collection. One is intrusive collection , which provides various Collectors and mainly supports Java and Python . Because this method is invasive, most components are accustomed to file-based collection, so we also support file-based collection through Filebeat . After collection, it is aggregated to the log agent of the cluster for traffic control, and then aggregated to a unified centralized storage to support log search scenarios with a unified API . It also provides a targeted API for the customized engine so that users can use the corresponding API according to specific scenarios.
The second is Filebeat collection . In the container scenario, it is a file-based collection. It is different from the physical machine-based collection. The main difference lies in the perspective of the container. The log storage path is also different from the actual physical storage path. To solve this problem, first declare your own collection rules through the customized log rule CRD , and then deploy the service through the component . With the creation and update of the entire component , Filebeat's Discovery mechanism can dynamically discover the CRD creation, change or deletion of the log rule , and then generate and load it into your own log collection rules through the Filebeat hot loading mechanism.
In the deployment form of Filebeat , if you can perceive the node information of the cluster and have the corresponding permissions, you can deploy it as a DeamonSet, so that the overall resource ratio is lower . The cluster uses a set of Filebeat for data collection . Since the specific node information is not perceived in the public cloud container service scenario, and there is no permission to deploy DeamonSet, so it is supported to inject sidecar into specific Pods to collect logs . In this way , we can unify the file-based log collection in the container .

log data link

In cloud-native scenarios, log collection is far more than just a unified collection link, but to support the efficient collection of logs with the lowest possible resource consumption. Because the cloud-native scenario is naturally multi-tenant- oriented , the traffic between tenants and components will vary greatly, and the abnormal traffic of a single tenant should not cause disturbance to the entire log collection. Therefore, in the log agent inside each cluster , we will control the traffic of the tenants, and take measures to limit or fuse the flow when abnormal large traffic is found . At the same time, it is also necessary to ensure fair distribution in multi-tenant scenarios, failover of log collection, Pod reconstruction/active upgrade in cloud-native scenarios , etc. These parts are the general direction of the main investment in the future.
 

alarm

The overall alarm system is transformed based on the open source Nightingale . The overview of the open source Nightingale includes Prometheus for storing indicator data , database for storing alarm business data, and core components : WebApi and Server. WebApi is used to undertake user interaction, such as adding, deleting, modifying and querying rules and executing index queries. Server is responsible for loading rules, generating alarm events, sending alarm notifications, etc. In the open source Nightingale, Server also assumes the responsibility of Prometheus's PushGateway. Byte's products have their own user system and monitoring system, so the customization of alarms is mainly concentrated on WebApi and Server.

Process overview

Users first generate their own alarm rules through WebAPI and persist them in the database. Server then loads the rules into its own memory, decides which rules to process through the consistent and converts them into index queries to determine whether there are alarm events. When an alarm event occurs, the corresponding control module will be called to send an alarm notification, and the alarm event will be backfilled into the database. The main optimization is reflected in the following aspects:
  • First of all, our product system has a unified user system, which is the same as the operation and maintenance management platform, and on this basis, the user group and duty list functions have been added to make it more in line with the usage habits in the alarm field;
  • Second, the open-source Nightingale loads log rules in full, but there will be potential performance risks. We will further eliminate related performance risks by transforming full loading into incremental loading;
  • Third, the alarm notification module has a strong coupling relationship with the environment. Because the alarm notifications in different environments will vary greatly, such as DingTalk, Enterprise WeChat, Feishu , and SMS calls, etc. Even for the same SMS alarm, different environments may have different SMS providers and different docking interfaces, so we provide the capability of dynamic message templates.

Feed Template

  • Some information about alarm events can be referenced through dynamic message templates, and alarm information with rich context can be assembled to make the alarm system more flexible and experience better .
  • The notification method is designed as a plug-in. Users only need to develop different sending plug-ins for different environments. This operation can also make our core process consistent.

notification module

In the notification module, an alarm event is generated by the server, and then a real alarm message is obtained by rendering the previous message template, and then the alarm message is sent to the notification module, which generates a notification record based on the notification method and object and puts it in the queue . In order to better adapt to various environments, the queue here can be a real message queue or a message queue simulated through a database. Finally, several Workers concurrently consume information and call different sending plug-ins to send messages; besides Workers, there are also some regular thread polling /inspection of the overall sending status to retry the failed sending messages, and generate an operation and maintenance alarm when the number of retries reaches a certain amount.
Another major feature of the open source Nightingale system is the dynamic threshold. In the overall use, events occur, alarms are triggered, and then manual feedback to training analysis forms a process of cyclic supervised learning , and the generation rules of dynamic thresholds are constantly adjusted.

monitor

In the overview of the overall monitoring architecture, you can see that the upper layer is the data plane mentioned above. Each cluster in the data plane will have a Prometheus to collect and aggregate the monitoring data of all components of the physical cluster. Here, Prometheus is essentially an agent and will not undertake any data storage, query and other responsibilities. Finally, Prometheus will remotely write the collected monitoring data to the monitoring system on the system plane.
The monitoring system is used to save all monitoring data. In order to facilitate the horizontal scaling of data storage, a layer of abstraction is also made. The real implementation behind it can be the existing cloud monitoring service on the public cloud, or the object storage service of S3, or our self-developed big data storage service. In some private cloud scenarios, it can also connect to user-defined storage services. In the unified storage of this layer, a Query service is added to undertake the query service of all monitoring data, and a visual large-scale display and front-end interaction are formed.
Some targeted optimizations have also been made for the query service . For example, the monitoring data is only added but not updated, and the monitoring data of a certain period of time is frequently retrieved repeatedly. The horizontal splitting capability is introduced. By splitting a query with a large time span into multiple small-span sub-queries to execute concurrently, and then aggregate and recall the query speed. In addition, because the monitoring data itself is immutable, we have introduced a cache, which can make a cache for some of the queried data and add query scenarios to facilitate prediction. Through the mutual promotion of the above two optimizations, the overall query efficiency is improved.
The core advantage of the overall optimized monitoring system is that it can support one-click indicator collection in various environments; in terms of performance optimization, it supports data pre-aggregation, down-sampling and other capabilities, enriching the overall functional system; and it is connected to big data storage, which can have the characteristics of storage and computing separation to a certain extent, which greatly enhances the horizontal scalability of the overall system; finally, it also deeply integrates monitoring and other operation and maintenance tools, such as logs, alarms, link tracking and other functions, optimizing the overall user experience of the product.
 

Panorama of Volcano Engine Cloud Native Computing Products

 
 
RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? The CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released Programmer's Notes CherryTree 1.0.0.0 is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/10085330