Apache Kyuubi & Celeborn (Incubating) helps Spark embrace cloud native

This article is compiled from the sharing of Pan Cheng, software engineer of Netease Shufan, at ASF CommunityOverCode Asia 2023 (Beijing). The content of this article is mainly: 1) The benefits and challenges of Spark cloud native; 2) How to build a unified Spark task gateway based on Apache Kyuubi; 3) How to build Shuffle Service based on Apache Celeborn (Incubating); 4) NetEase’s support for Spark on other aspects Improvements to Kubernetes schemes.

In the past few years, NetEase has made great explorations in the field of big data cloud native. This article focuses on how to build an enterprise-level Spark on Kubernetes cloud-native offline computing platform based on open source technologies such as Apache Kyuubi & Celeborn, including technology selection, architecture design, lessons learned, defect improvement, cost reduction and efficiency increase, etc. research results in this field.

01 Benefits and Challenges of Spark on Kubernetes

Apache Spark, as the de facto standard in the field of big data offline computing, is widely used in NetEase's internal and commercial products such as data centers. At present, Spark on YARN is the most mainstream and mature usage method in the industry, but with the popularity of cloud native technology represented by Kubernetes, Spark on K8s is being favored by more and more users.

Netease has been exploring Spark on K8s technology since 2018. Compared with Spark on YARN, Spark on K8s has significant advantages in many aspects; at the same time, as a relatively new technology, it is not as complete as Spark on YARN in some aspects. Let's make a simple comparison of some of the more critical parts:

In terms of isolation, thanks to the support of container technology, Spark on K8s has significant advantages over the YARN process-level job isolation mechanism. On the one hand, containerization greatly simplifies the dependency management of Spark jobs, especially the isolation of Python dependencies and dynamic link libraries; at the same time, containerization cooperates with the cgroup mechanism to restrict job resources more strictly and finely.

In resource management strategies at the cluster level, applications often do not use 100% of the resources they apply for. Oversubscription is a common strategy to improve cluster resource utilization. Taking CPU as an example, YARN can set the ratio of vCore to physical Core at the cluster level, that is, the CPU oversubscription ratio, but K8s can support job-level CPU oversubscription ratio; tasks in the cluster have different CPU utilization rates , for many IO-heavy jobs represented by data transmission, setting a higher CPU oversubscription ratio can greatly save CPU resources.

Dynamic resource allocation is a very important feature for Spark jobs to improve resource utilization. In Spark on YARN, the External Shuffle Service resides in each NodeManager process as a plug-in to provide reading services for the current node shuffle data. Therefore, Executor can exit at any time without considering how the downstream Reduce Task reads the shuffle data; but on K8s, there is no corresponding component, and instead there are many optional solutions, which will be discussed in detail later.

Spark on YARN provides many auxiliary functions, such as YARN naturally has the concept of Application, provides log aggregation services, supports Spark Live UI agents, etc., which are not available out of the box in Spark on K8s.

In terms of deployment solutions, Spark on YARN provides a standardized solution; however, Spark on K8s has a variety of ways to play, such as the shuffle solution mentioned above, and take task submission as an example, represented by Spark Operator The yaml submission scheme and Spark's native spark-submit scheme emerge in endlessly.

At the same time, we are faced with a very common challenge: users have different Kubernetes infrastructures. How can we make use of their respective characteristics as much as possible and maximize benefits while supporting various infrastructures?

For example, there are significant differences in infrastructure represented by public cloud and privatized deployment: in line with the principle of reducing costs and increasing efficiency, in terms of cost advantages:

  • In addition to supporting the time-based buyout system, the public cloud also provides a pay-as-you-go mode. According to different types of resources, generally when the overall usage rate is lower than 30% to 60% of the total time, pay-as-you-go can be used. Significantly reduce costs; public cloud bidding instances are significantly competitive in price, but are full of uncertainty and the risk of being preempted at any time;
  • Privately deployed hardware is naturally not as flexible as public cloud, and basically must be purchased in advance. In order to maximize resource utilization, we often start with offline mixing. Usually, the peak of online business is during the day, and the peak of offline tasks is at night. Through hybrid deployment and resource transfer, the utilization rate of cluster resources is improved and the overall cost is reduced.

Storage is an object that needs a lot of attention in Spark on K8s. On the public cloud, network disks of various specifications can generally be provided to meet various remote mounting requirements; while private deployment scenarios are often subject to great restrictions, mostly local disks bound to physical nodes, correspondingly, the same IO performance Local hardware tends to be less expensive in terms of capacity and capacity.

Other hardware, such as network cards, CPUs, and memory are also similar. Public clouds can flexibly provide various ratios; private deployments are mostly limited to specific specifications and models, but the unit price is often lower.

02 How to build a unified Spark task gateway based on Apache Kyuubi

Within NetEase, all Spark services are hosted. We use Apache Kyuubi as a unified Spark task submission gateway. Kyuubi provides multiple user interfaces and supports various types of Spark tasks. Typical usage scenarios include: users can use JDBC/BeeLine and various BI tools to connect for interactive data analysis; use RESTful API to submit SQL/Python/Scala/Jar batch jobs to Kyuubi.

As an enterprise-level big data gateway, Kyuubi also fully supports multi-tenancy and security. For example, Kyuubi has made in-depth adaptations on Kerberos support, such as simplifying the way JDBC clients use Kerberos authentication; supporting Kerberos/LDAP at the same time, the client can choose any authentication method; supporting Hadoop user agent mechanism, While ensuring security, it saves the management of a large number of user keytabs; supports Hadoop Delegation Token renewal, and meets the authentication requirements of Spark resident tasks, etc.

Kyuubi has many community users and contributors from financial brokerages and European and American companies. They have more extreme requirements for security. For example, the internal communication between service components also needs to be encrypted, and it supports permission control and SQL auditing. Kyuubi supports such scenarios Also competent.

In addition to the main function as a gateway, Kyuubi also provides a series of Spark plug-ins that can be used independently, which can provide enterprise-level functions such as small file management, Z-Order, SQL lineage extraction, and limiting the amount of query data scanning.

In terms of architecture, the two important components of Kyuubi are Server and Engine. Among them, Server is a lightweight resident service, and Engine is a Spark Application that starts and stops on demand. After the client connects, Kyuubi Server will search for a suitable Engine according to the routing rules. If there is no hit, it will call spark-submit to pull up a new Spark Application. When the Spark Application is idle for a period of time, it will actively exit to release resources. Kyuubi chose to use Spark's native way to connect to Kubernetes instead of the Spark Operator mode. This choice enables Kyuubi to more consistently use the spark-submit command to connect to different resource management systems, such as YARN, Mesos, and Standalone. This design is more suitable for smooth transition to cloud-native big data architecture for enterprises that already have big data infrastructure.

For interactive sessions, Kyuubi creatively proposed the concept of engine sharing level. There are four built-in options: CONNECTION, USER, GROUP, and SERVER. The isolation is lowered in turn, and the degree of sharing is increased in turn. They can be used together to meet various load scenarios. For example, the CONNECTION sharing level pulls up a separate Spark Application for each session, effectively ensuring the isolation between sessions, and is usually used for large-scale ETL scheduling tasks; the USER sharing level enables the same user to reuse the same Spark Application, both This speeds up the startup of new sessions and ensures that others will not be affected when the Spark Application exits unexpectedly (such as OOM caused by large result set queries). For batch tasks, only the semantics similar to the CONNECTION sharing level are supported, and Kyuubi behaves more like a task scheduling system at this time.

Kyuubi Server is designed as a lightweight gateway. In contrast, the stability of Kyuubi Engine is slightly lower. It is very likely that OOM will occur due to the query returning a large result set. Stability, at the same time, the design of the Engine sharing level well controls the impact range of the Engine crash.

In terms of specific internal implementation, there are two important concepts in Kyuubi's interactive session: Session and Operation, which correspond to Connection and Statement in JDBC, and SparkSession and QueryExecution in Spark respectively.

The above is a typical code for connecting to Kyuubi to execute Spark SQL through the JDBC driver, and you can clearly see the correspondence between the client JDBC call and the Spark engine side. In particular, when pulling the result set, the result set will be returned to the client from Spark Driver via Kyuubi Server in the form of micro-batch, which effectively reduces the memory pressure of Kyuubi Server and ensures the stability of Kyuubi Server; in the latest 1.7 In the version, Kyuubi supports the Apache Arrow-based serialization method of result sets, which greatly improves the transmission efficiency of large result set scenarios.

03 How to build Shuffle Service based on Apache Celeborn (Incubating)

As mentioned above, the shuffle scheme plays a vital role in the dynamic resource allocation of Spark on K8s, and the Executor can only be released under the premise of ensuring that the downstream read shuffle data is not affected. In recent years, the community and major companies have emerged endlessly in the shuffle solution. Here is a brief introduction to several of the more mainstream methods.

The first is Shuffle Tracking combined with decommission, which is a lightweight solution built into Spark and requires no maintenance of additional services. Shuffle Tracking is to analyze which shuffle data may be consumed downstream by tracking the lineage of RDD, and then prevent these Executors from exiting to ensure the provision of shuffle data reading services. Obviously, delayed exit will cause a certain waste of resources and cannot handle Executor OOM. Decommission is a supplementary means. When the Executor is idle for a period of time, the shuffle data is moved to the Executor that has not timed out before exiting. After our practice, this solution does not perform well when the amount of data is large and the cluster load is high.

Another natural idea is to reproduce the solution on YARN on K8s, that is, start an External Shuffle Service process on each K8s Node through DaemonSet to provide Shuffle reading service. This solution is completely consistent with Spark on YARN in terms of performance and reliability, and it was applied on a certain scale in the early days of Netease. But at the same time, there are certain disadvantages, such as not applicable to bidding instances (only Pods can be used, and DaemonSets are not allowed to be started on Nodes), and Host Network is required.

In the past one or two years, the Remote Shuffle Service solution has been favored by more and more users. With the development of network card technology, the efficiency difference between network reading and writing and disk reading and writing has gradually narrowed. In theory, converting the local disk reading and writing of Spark's native shuffle into network reading and writing will not necessarily cause disadvantages in performance. Most importantly, deploying Shuffle Service as a separate service, on-demand scaling is more in line with the cloud-native concept; at the same time, we can also have more operating space, such as improving utilization by balancing storage space between nodes, and Storage guarantees performance while reducing the demand for high-performance disk capacity. However, we should be clear that since the birth of the Spark project, shuffle has been continuously improved as a core feature; as a relatively new technology, Remote Shuffle Service will have a long way to go in terms of stability, correctness, and performance. Walk.

In the specific Remote Shuffle Service technology selection, NetEase chose to build an internal Shuffle Service platform based on Apache Celeborn (Incubating). The core features we focus on include:

  • The Celeborn server includes two roles, Master and Worker. Among them, the Master plays a coordinating role and is a Raft cluster with good disaster recovery capabilities and supports rolling upgrades; Workers serve as data nodes to provide shuffle data read and write services, and can expand and shrink at any time according to the load; and the heartbeat between components , The health check mechanism can quickly discover and eliminate faulty Worker nodes;
  • Celeborn provides an asynchronous and efficient copy mechanism, which has little impact on performance after it is turned on. The Client only needs to write data to the main Worker node successfully to return, and the main Worker node will asynchronously copy the shuffle data to the backup Worker node;
  • According to the load of the Worker, the shuffle partition can be intelligently allocated to make the cluster load more balanced. This is very important for deploying Workers on heterogeneous nodes. For example, some nodes use SSDs, while others use HDDs; another example is the difference in performance of different Worker disks caused by mixed use of old and new servers and hardware aging;
  • Hierarchical storage is supported, and for distributed storage, Client can directly read data from the storage system, reducing the pressure on Worker.

Summarize the evolution process of Spark on Kubernetes in Netease:

Early plans:

1. Only support submitting SQL tasks through JDBC and BeeLine 2. The Kyuubi cluster is deployed on physical machine nodes outside the K8s cluster

3. The Spark job runs in Client mode

4. Start the External Shuffle Service as DaemonSet on each node

5. Spark jobs, ESS, etc. all run in Host Network mode 6. Install SSD on each node and mount it in the Pod in hostPath mode

Improved plan:

1. Support submitting SQL/Jar tasks through JDBC, BeeLine, and RESTful 2. Kyuubi is deployed in the K8s cluster in the form of StatefulSet

3. Kyuubi uses MySQL to store state data 4. Spark jobs run in Cluster mode 5. Deploy Celeborn in the form of StatefulSet in K8s cluster as Remote Shuffle Service

6. On the public cloud, use spot instance Pods to provide computing resources for Spark jobs. In particular, spot instances have an extremely low cost advantage, which plays a vital role in reducing costs and increasing efficiency.

04 NetEase's improvements to Spark on Kubernetes in other aspects

As mentioned earlier, Spark on Kubernetes does not natively provide log aggregation services like YARN, which is very unfriendly to Spark job analysis and troubleshooting.

We enable Spark on Kubernetes to obtain a log jump experience similar to Spark on YARN through the following methods:

1. Use Grafana Loki to build a log storage and query service, and configure Grafana as a log display service

2. Use log4j-loki-appender to write Spark Application logs to the remote log service

3. In SPARK-40887, we improved Spark to support adding jump links to external log services in the Spark UI in a configuration manner; the links can be templates, for example, variables such as POD_NAME can be used as queries in the jump links Conditional Pod allocation strategy is another interesting topic, for example, in the following two scenarios, we need to use different allocation strategies.

In a private deployment scenario, for some network and IO-heavy tasks, if a large number of Executors are scheduled to the same node, it is likely to form a hotspot and cause a performance bottleneck on the hardware. In this case, we can use anti-affinity so that ExecutorPods can be scattered on all nodes as much as possible during allocation.

In the offline mixed deployment scenario, we prefer to use the bin-packing Pod allocation strategy to concentrate Executor Pods on a small number of nodes as much as possible, so that when the nodes are transferred, the machines can be vacated quickly and the impact on Spark tasks can be reduced.

Developers from Netease and the Kyuubi community have also made many important improvements to Spark on K8s, which cannot be described in detail due to time and space constraints. You can search the community for the corresponding Pull Request according to the JIRA work order. Here we are also very grateful to the developers of the Spark community for their help in code review and other aspects!

Live Q&A

Q : We have deployed Kyuubi on K8s to submit Spark tasks to K8s. Next, we plan to use Kyuubi to submit Spark and Flink tasks to YARN. In this scenario, is it recommended to deploy a separate set of Kyuubi services for each load, or use the same set of Kyuubi services?

A : The first thing to be clear is that a single Kyuubi instance or cluster supports managing multiple Spark versions, using multiple computing engines, and submitting tasks to different resource management systems. As mentioned above, Kyuubi Server is a lightweight and stable service. In actual scenarios, we also recommend using a single Kyuubi Server cluster to manage multiple engines as much as possible to realize Unified Gateway. We recommend that Kyuubi clusters be deployed independently only in scenarios where users have extremely high SLA requirements or must be physically isolated for security and compliance reasons.

Q : As mentioned in the sharing, Celeborn supports rolling upgrades. I have actually measured that after restarting the Celeborn Worker node, the task will fail. What might be the problem?

A : Celeborn is designed to support rolling restarts. The Master node is a Raft cluster that naturally supports rolling upgrades. In Celeborn 0.3.0, Celeborn added a graceful shutdown feature for Worker nodes to support rolling upgrades. Specifically, when sending a graceful shutdown signal to the Worker node: the writing client will receive the returned information and perceive that the Worker is shutting down, suspend the writing of the current partition, and use the revive mechanism to request a new slot for Write subsequent data; after all write requests are disconnected, Worker itself will flush the data and status in memory to disk, and then exit; the client that is reading will automatically switch to the replica node to read data; Worker restarts After that, the state is recovered from the disk and data reading services can continue to be provided. To sum up, to support rolling upgrade of Worker, it must meet: version 0.3.0 or above; enable data copy; enable graceful shutdown.

 
The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/10102788