Incomplete comparative analysis of Spark on Kubernetes and Spark on Yarn

foreword

Apache Spark is one of the most widely used big data analysis and computing tools. It excels at batch and real-time stream processing, and supports machine learning, artificial intelligence, natural language processing, and data analytics applications. As Spark grows in popularity and usage, the Hadoop (MR) technology stack in the narrower sense is shrinking. In addition, general opinion and practical experience prove that, except for big data related workloads, Hadoop (YARN) does not have the corresponding flexibility to integrate and integrate with the broader enterprise technology stack. For example, to carry some online services, and this is the field that Kubernetes (K8s) is good at. In fact, the advent of Kubernetes has opened up a whole new world of opportunities for Spark improvements. It is also very attractive to use a unified set of clusters to run all online and offline jobs.

Spark on Kubernetes started with the introduction of Spark 2.3 [1], and by the time Spark 3.1 [2] was marked GA by the community, it basically has the conditions for large-scale use in production environments.

In the industry, companies such as Apple[3], Microsoft[4], Google, NetEase, Huawei, Didi, and JD.com have all had classic successful cases of internal large-scale implementation or external services.

Spark on Kubernetes Application Architecture

From the perspective of Spark's overall computing framework, only one more scheduler is supported at the resource management level, and other interfaces can be fully reused. On the one hand, the introduction of Kubernetes and components such as Spark Standalone, YARN, Mesos, and Local have formed a richer resource management system.

On the other hand, while the Spark community supports Kubernetes features, the compatibility with user APIs is also maximized, which greatly facilitates the migration of user tasks. For example, for a traditional Spark job, we can complete the runtime switch between the two scheduling platforms by simply specifying the --master parameter as yarn or k8s://xxx. Other parameters such as mirror, queue, Shuffle local disk and other configurations are isolated between YARN and K8s, which can be easily maintained in the configuration file.

 

Spark on Kubernetes vs Spark on YARN

Usability Analysis

Spark Native API

In terms of the traditional way of submitting jobs, spark-submit, as mentioned above, through configuration isolation, users can easily submit them to K8s or YARN clusters to run, which is basically the same simple and easy to use. This method is very friendly to users who are familiar with Spark API and ecology, and there are basically no hard requirements for the K8s technology stack.

It can be seen that if we ignore the underlying details of K8s or YARN, it is basically the familiar taste of familiar recipes.

Spark Operator

Also, in addition to this approach, Kubernetes is much richer on the API. We can create and manage Spark on k8s applications through Spark Operator[6], such as kubectl apply -f <YAML file path>. This method is undoubtedly the most elegant for the Kubernetes cluster itself and users, but for this part of Spark users who have no experience in Kubernetes, there is a certain learning cost. Another advantage of this method is that all Spark related libs can be deployed through the Docker repository, and a separate Spark Client environment is not required to submit jobs. A separate Client environment can easily lead to inconsistencies between versions and Docker, increase operation and maintenance costs, and also bury hidden dangers that cause unnecessary online problems.

Serverless SQL

Of course, whether it is Spark's native or Operator, it is still too primitive for most users, and it is inevitable to perceive some underlying details. In the Datalake/Lakehouse scenario, data becomes democratic and data applications become diverse, making it difficult to promote it on a large scale. For ease of use, consider using Apache Kyuubi (Incubating) [7] to build a Serverless Spark/SQL service. In most cases, users can directly use BI tools or SQL to directly manipulate data.

Generally speaking, most enterprises will have many offline Hive or Spark tasks running on the YARN cluster. How to smoothly migrate a large number of historical tasks to Kubernetes is also a headache. Kyuubi's service-oriented solution can provide load balancing nodes through the service discovery mechanism, and smoothly transition on the basis of high service availability. For individual abnormal migration tasks, we can also easily rollback to the old cluster to ensure execution, which also leaves us time and space to locate the problem.

Performance comparison

In principle, both Kubernetes and YARN only play the role of resource scheduling, and do not involve changes in computing models and task scheduling, so the difference in performance should be insignificant. In terms of deployment architecture, Spark on Kubernetes generally chooses an architecture that separates storage and computing, while YARN clusters are generally coupled with HDFS. The former will lose "data locality" when reading and writing HDFS, which may affect performance due to network bandwidth factors. . After about 10 years of development since the birth of the storage-computing coupling architecture, with the growth of network performance, the support of various efficient columnar storage formats and compression algorithms has little impact.

Terasort Benchmark (By Myself)

TPC-DS Benchmark (By Data mechanics)

TPC-DS Benchmark (By AWS)

Although these test results are not official data certified by the TPC-DS organization, there is sufficient convincing from the fact that the test results are from different institutions. We shield some of the impact on the deployment architecture, and the performance gap between the two can be said to be basically non-existent.

cost comparison

Migrating Spark jobs to Kubernetes clusters can realize hybrid deployment of offline and online services, and utilize the tidal peak shifting effect of computing resources due to the characteristics of the two services. 50% savings on total useful cost (TCO).

On the other hand, in different development periods of the enterprise data platform, the storage and computing power ratios planned by the clusters are different, which makes the selection of servers difficult. From the perspective of the separation of storage and computing, it is more reasonable to expand the computing clusters and storage clusters separately. Control IT costs effectively.

In addition, Spark on Kubernetes allocates the Executor mode through the Pod. The number of execution threads (spark.executor.cores) and the request cpu of the Pod are separated, which can be controlled at the job level in a more fine-grained manner to improve the efficiency of computing resources. In our actual practice at NetEase, without affecting the overall computing performance, the overall cpu of Spark on Kubernetes jobs can reach an oversale ratio of over 200%.

Of course, the lack or imperfection of the dynamic resource allocation (Dynamic Resource Allocation) feature of Spark on Kubernetes may cause Spark to occupy resources and not use them. Since this feature is directly dependent on the external Shuffle Service, you may need to build the Remote/External Shuffle Service by yourself.

In the Spark on Kubernetes scenario, the temporary storage and computing process can be decoupled based on RSS/ESS. First, eliminate local storage dependencies, so that computing nodes can be dynamically scaled on heterogeneous nodes, and more flexible dynamic expansion in the face of complex physical or virtual environments. Second, discrete local storage is optimized for centralized service storage, and storage capacity is shared by all computing nodes to improve storage resource utilization. Third, reduce the disk failure rate, dynamically reduce the number of computing nodes marked as unavailable, and improve the overall resource utilization of the computing cluster. Finally, transfer the blood relationship of temporary storage so that it is no longer maintained by the Executor Pod computing node, so that the idle Executor Pod can be released back to the resource pool in a timely manner and improve the utilization of cluster resources.

Other comparisons

 

Spark on k8s

Spark on yarn

when to support

Spark 2.3.0

Spark 0.6.0

mass production

less

mainstream

Scheduler Architecture

Shared State Scheduling Architecture

Centralized scheduler

community activity

active

Inactive

Minimum scheduling unit

Pod

Container

log aggregation

none

Have

Spark Web UI

port-forward

proxy

multi-tenancy

Namespace

queue

data locality

HDFS read and write - poor

Shuffle read and write – good

HDFS read and write – good in theory, average in production

Shuffle read and write - good

HDFS integration fluency

Spark 3.0 and above

complete

 

Summarize

It has been four years since Spark on Kubernetes was released with version 2.3.0 in early 2018, and the current version 3.2 has gone through five major versions. It has become a very mature feature under the continuous polishing of the community and users.

With the continuous development of Apache Spark's open source ecosystem, such as Apache Kyuubi, no matter which scheduling framework is used, the ease of use has been greatly improved.

The total cost of ownership (TCO) of IT infrastructure is increasing year by year, which has always been a problem that plagues many enterprises. The flexibility and cost-effectiveness of the combination of Spark + Kubernetes gives us more room for imagination.

References

  1. https://issues.apache.org/jira/browse/SPARK-18278
  2. https://issues.apache.org/jira/browse/SPARK-33005
  3. https://www.youtube.com/watch?v=xX2z8ndp_zg
  4. https://www.youtube.com/watch?v=hcGdW_6xTKo
  5. https://ieeexplore.ieee.org/document/9384578
  6. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  7. https://github.com/apache/incubator-kyuubi
  8. https://aws.amazon.com/cn/blogs/containers/optimizing-spark-performance-on-kubernetes/

Author: Kent Yao, NetEase Shufan Technical Expert, Apache Kyuubi(Incubating) PPMC, Apache Spark Committer

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/5468937