MiHoYo Big Data Cloud Native Practice

In recent years, various cloud-native technologies such as containers, microservices, and Kubernetes have become increasingly mature. More and more companies have begun to choose to embrace cloud-native, and have begun to deploy and run enterprise applications such as AI and big data on cloud-native. . Taking Spark as an example, running Spark on the cloud can fully enjoy the elastic resources, operation and maintenance management and storage services of the public cloud, and many excellent practices of Spark on Kubernetes have emerged in the industry.

At the just-concluded 2023 Yunqi Conference, Du Anming, a big data technology expert from the MiHoYo Data Platform Group, shared the goals, explorations and practices in the process of upgrading MiHoYo's big data architecture to cloud-native, and how to use Alibaba Cloud containers to The Spark on K8s architecture with service ACK as the base obtains value in elastic computing, cost saving, and separation of storage and computing.

01 Background introduction

With the rapid development of MiHoYo's business, the amount of big data offline data storage and computing tasks has increased rapidly, and the early big data offline architecture can no longer meet new scenarios and needs.

In order to solve the problems of the original architecture's lack of elasticity, complex operation and maintenance, and low resource utilization, in the second half of 2022, we started investigating the cloud-nativeization of the big data infrastructure, and finally launched Spark on K8s + OSS- on Alibaba Cloud. The HDFS solution has been running stably in the production environment for about a year, and has achieved three major benefits: flexible computing, cost savings, and separation of storage and computing.

1. Elastic calculation

As the game business undergoes periodic version updates, launches activities, and the launch of new games, the demand and consumption of offline computing resources fluctuates greatly, which may be dozens or hundreds of times higher than usual levels. Utilizing the natural elasticity of the K8s cluster and scheduling Spark computing tasks to run on K8s can relatively easily solve the problem of resource consumption peaks in such scenarios.

2. Cost savings

Relying on the powerful elasticity of the Alibaba Cloud Container Service for Kubernetes ACK cluster, all computing resources are applied for on-demand and released after use. Coupled with our customized transformation of Spark components and full use of ECI Spot instances, we can carry the same computing tasks and Under resource consumption, cost saving reaches 50%.

3. Separation of storage and calculation

Spark runs on K8s and fully uses the computing resources of the K8s cluster. The accessed data is gradually switched from HDFS and OSS to OSS-HDFS. The intermediate Shuffle data is read and written using Celeborn. The entire architecture realizes the integration of computing and storage. Decoupled, easy to maintain and extend.

02 Spark on K8s architecture evolution

As we all know, the Spark engine can support and run on a variety of resource managers, such as Yarn, K8s, Mesos, etc. In the big data scenario, most domestic companies' Spark tasks are still running on Yarn clusters. Spark supported K8s for the first time in version 2.3, and the Spark 3.1 version released in March 2021 was officially GA.

Compared with Yarn, Spark started late on K8s. Although there are still certain deficiencies in maturity and stability, Spark on K8s can achieve outstanding benefits such as elastic computing and cost savings, so major companies also They are constantly trying and exploring, and in the process, the operating architecture of Spark on K8s is also constantly iteratively evolving.

1. Offline mixing

Currently, most companies still use a hybrid online and offline deployment method to run Spark tasks on K8s. The architecture design is based on the principle that different business systems will have different business peak times. The typical task peak period of a big data offline business system is from 0 to 9 o'clock in the morning. For various application microservices, BI systems provided by the Web, etc., the common business peak period is daytime. Outside this time At other times, the machine Node of the business system can be added to the K8s NameSpace used by Spark. As shown in the figure below, Spark and other online application services are deployed on a K8s cluster.

The advantage of this architecture is that it can improve machine resource utilization and reduce costs through mixed deployment and off-peak operation of offline services. However, the disadvantages are also obvious, that is, the architecture is complex to implement, the maintenance cost is relatively high, and it is difficult to strictly implement Resource isolation, especially isolation at the network level, will inevitably have a certain mutual impact between businesses. In addition, we believe that this approach is not in line with the concept of cloud native and future development trends.

2. Spark on K8s + OSS-HDFS

Considering the disadvantages of offline hybrid deployment, we designed and adopted a new implementation architecture that is more in line with cloud native: the underlying storage uses OSS-HDFS (JindoFs), the computing cluster uses Alibaba Cloud's container service ACK, and the Spark selection function The relatively rich and stable version 3.2.3.

OSS-HDFS is fully compatible with the HDFS protocol. In addition to the advantages of unlimited OSS capacity and support for hot and cold storage of data, it also supports directory atomicity and millisecond-level rename operations. It is very suitable for offline data warehouses and can be used for smooth cash flow. There are HDFS and OSS.

Alibaba Cloud ACK cluster provides high-performance and scalable container application management services, which can support the life cycle management of enterprise-level Kubernetes containerized applications. ECS is the well-known Alibaba Cloud server, and elastic container instance ECI is a type of Serverless container. When running services, you can apply for and release them in seconds.

The architecture is simple and easy to maintain. The bottom layer utilizes the elasticity capability of ECI. Spark tasks can easily cope with peak traffic. Scheduling Spark's Executor to run on ECI nodes can maximize computing task flexibility and achieve the best cost reduction effect. A schematic diagram of the overall architecture is shown below.

03 Cloud native architecture design and implementation

1. Basic principles

Before elaborating on the specific implementation, let's briefly introduce the basic principles of Spark running on K8s. Pod is the smallest scheduling unit in K8s. The Driver and Executor of a Spark task are both a separate Pod. Each Pod is assigned a unique IP address. A Pod can contain one or more Containers, whether they are Driver or Executor JVM processes. , are all started, run and destroyed in Container.

After a Spark task is submitted to the K8s cluster, the Driver Pod is started first. Then the Driver will apply for an Executor on demand from Apiserver, and the Executor will execute the specific Task. After the job is completed, the Driver will be responsible for cleaning up all Executor Pods. The following is A brief schematic diagram of the relationship between these.

2. Execution process

The following figure shows the complete job execution process. After the user completes the Spark job development, he will publish the task to the scheduling system and configure the relevant operating parameters. The scheduling system will regularly submit the task to the self-developed Launcher middleware, and will be processed by The middleware calls spark-k8s-cli, and Cli finally submits the task to the K8s cluster. After the task is successfully submitted, the Spark Driver Pod is started first and applies to the cluster for allocation of the Executor Pod. When running a specific Task, the Executor will access data with many big data components such as external Hive, Iceberg, OLAP database, OSS-HDFS, etc. and interact with each other, while the data shuffle between Spark Executors is implemented by CeleBorn.

3. Task submission

Regarding how to submit Spark tasks to K8s clusters, various companies have different practices. The following is a brief description of the current common practices, and then introduces the task submission and management methods we currently use online.

3.1 Use native spark-submit

Submit directly through the spark-submit command. Spark natively supports this method. It is relatively simple to integrate and is in line with user habits. However, it is inconvenient for job status tracking and management. The Service and Ingress of the Spark UI cannot be automatically configured. After the task is completed, It also cannot automatically clean up resources, etc., which is not suitable for production environments.

3.2 Using spark-on-k8s-operator

This is a commonly used method to submit jobs. The K8s cluster needs to install spark-operator in advance. The client submits the yaml file through kubectl to run the Spark job. Essentially, this is an extension of the native method. The final job submission still uses the spark-submit method. The extended functions include: job management, Service/Ingress creation and cleanup, task monitoring, Pod enhancement, etc. This method can be used in a production environment, but it is not well integrated with the big data scheduling platform. For users who are not familiar with K8s, the complexity of use and the threshold for getting started are relatively high.

3.3 Using spark-k8s-cli

In the production environment, we use spark-k8s-cli to submit tasks. spark-k8s-cli is essentially an executable file. Based on the Alibaba Cloud emr-spark-ack submission tool, we have refactored, functionally enhanced, and deeply customized it.

spark-k8s-cli combines the advantages of two job submission methods, spark-submit and spark-operator, so that all jobs can be managed through spark-operator, supports running interactive spark-shell and local dependency submission, and is easy to use. It is completely consistent with the native spark-submit syntax.

In the early days of online use, the Spark Submit JVM process of all our tasks was started in the Gateway Pod. After using it for a period of time, we found that this method was not stable enough. Once the Gateway Pod was abnormal, all Spark tasks running on it would fail. In addition, The log output of Spark tasks is also difficult to manage. In view of this situation, we changed spark-k8s-cli to use a separate Submit Pod for each task. The Submit Pod applies for the Driver to start the task. The Submit Pod and the Driver Pod both run on fixed ECS nodes. On, Submit Pods are completely independent from each other, and Submit Pods will be automatically released after the task is completed. The submission and operation principle of spark-k8s-cli is shown in the figure below.

Regarding spark-k8s-cli, in addition to the basic task submission mentioned above, we have also made some other enhanced and customized functions.

  • Supports submitting tasks to multiple different K8s clusters in the same region to achieve load balancing and failover switching between clusters
  • Implement an automatic queuing and waiting function similar to when Yarn resources are insufficient (if K8s sets a resource Quota, when the Quota reaches the upper limit, the task will fail directly)
  • Add exception handling such as network communication with K8s, retry on creation or startup failure, etc., and perform fault tolerance for occasional cluster jitters and network abnormalities.
  • Supports current limiting and control functions for large-scale complement tasks according to different departments or business lines
  • Alarm functions such as embedded task submission failure, container creation or startup failure, and run timeout

4. Log collection and display

The K8s cluster itself does not provide automatic log aggregation and display functions like Yarn. Driver and Executor log collection needs to be completed by the user. The more common solution at present is to deploy Agent on each K8s Node, and use the Agent to collect logs and store them on third-party storage, such as ES, SLS, etc. However, these methods are difficult for users and developers who are accustomed to clicking to view logs on the Yarn page. For users, it is very inconvenient to use, and users have to jump to a third-party system to retrieve and view logs.

In order to realize the convenient viewing of K8s Spark task logs, we modified the Spark code so that the Driver and Executor logs are eventually output to OSS. Users can directly click to view the log files on the Spark UI and Spark Jobhistory.

The above figure shows the principle of log collection and display. When a Spark task is started, the Driver and Executor will first register a Shutdown Hook. When the task ends and the JVM exits, the Hook method is called to upload the complete log to OSS. In addition, if you want to view the log completely, you need to modify Spark's Job History related code. You need to display stdout and stderr on the History page, and when you click on the log, pull the log file corresponding to the Driver or Executor from OSS. Finally, Rendered and viewed by the browser. In addition, for running tasks, we will provide a Spark Running Web UI to users. After the task is submitted successfully, spark-operator will automatically generate Service and Ingress for users to view the running details. At this time, the logs can be obtained by accessing K8s. The api can pull the running log of the corresponding Pod.

5. Flexibility and cost reduction

Based on the elastic scaling capability provided by the ACK cluster, coupled with the full utilization of ECI, the total cost of running Spark tasks of the same scale on K8s is significantly lower than that on a fixed Yarn cluster, and resource utilization is also greatly improved. Rate.

Elastic Container Instance ECI is a Serverless container running service. The biggest difference between ECI and ECS is that ECI is billed on a per-second basis, and the application and release speed is also on the second level, so ECI is very suitable for load peaks and valleys such as Spark. Obvious computing scenarios.

The above figure shows how Spark tasks apply for and use ECI on the ACK cluster. The prerequisite for use is to install the ack-virtual-node component in the cluster and configure Vswitch and other information. When the task is running, the Executor is scheduled to the virtual node. And the virtual node applies to create and manage ECI.

ECI is divided into ordinary instances and preemptible instances. Preemptible instances are low-cost bidding instances with a default protection period of 1 hour. They are suitable for most Spark batch processing scenarios. After the protection period is exceeded, preemptible instances may be forced Recycle. In order to further improve the cost reduction effect and make full use of the price advantage of preemptible instances, we modified Spark to implement automatic conversion of ECI instance types. The Executor Pods of Spark tasks are prioritized to run on preemptible ECI instances. When insufficient inventory or other reasons prevent the creation of preemptible instances, they will automatically switch to ordinary ECI instances to ensure the normal operation of tasks. The specific implementation principle and conversion logic are shown in the figure below.

6. Celeborn

Since the disk capacity of K8s nodes is very small, and nodes are applied for and released after use, a large amount of Spark Shuffle data cannot be saved. If you mount a cloud disk to the Executor Pod, the size of the mounted disk is difficult to determine. Taking into account factors such as data skew, the disk usage will be relatively low, making it more complicated to use. In addition, although the Spark community provided functions such as Reuse PVC in 3.2, after research, it was found that the functions are not complete and the stability is insufficient.

In order to solve the problem of Spark data shuffle on K8s, after fully researching and comparing multiple open source products, we finally adopted Alibaba's open source Celeborn solution. Celeborn is an independent service specifically used to save Spark's intermediate Shuffle data, so that the Executor no longer relies on the local disk. This service can be used by both K8s and Yarn. Celeborn adopts the Push Shuffle mode. The Shuffle process is append writing and sequential reading to improve data reading and writing performance and efficiency.

Based on the open source Celeborn project, we have also done some internal work to enhance the functions of data network transmission, enrich Metrics, improve monitoring and alarms, and fix bugs. Currently, an internal stable version has been formed.

7. Kyuubi on K8s

Kyuubi is a distributed and multi-tenant gateway that can provide SQL and other query services for Spark, Flink or Trino. In the early days, our Spark Adhoc queries were sent to Kyuubi for execution. In order to solve the problem that the user's query SQL cannot be submitted and run due to insufficient Yarn queue resources, we also support the deployment and operation of Kyuubi Server on K8s. When Yarn resources are insufficient, Spark queries automatically switch to K8s for running. In view of the gradual reduction of the size of the Yarn cluster, query resources cannot be guaranteed, and to ensure the same user query experience, we have currently submitted all SparkSQL Adhoc queries to K8s for execution.

In order to allow users' Adhoc queries to run smoothly on K8s, we have also made some source code modifications to Kyuubi, including rewriting the docker-image-tool.sh, Deployment.yaml, and Dockfile files in the Kyuubi project, and redirecting Log to On OSS, Spark Operator management support, permission control, easy viewing of task running UI, etc.

8. K8s Manager

In the Spark on K8s scenario, although K8s has cluster-level monitoring and alarms, it cannot fully meet our needs. In the production environment, we pay more attention to the running status of Spark tasks, Pod status, resource consumption, and ECI on the cluster. Using the Watch mechanism of K8s, we implemented our own monitoring and alarm service K8s Manager. The following figure shows a schematic diagram of the service.

K8sManager is a relatively lightweight Spring Boot service implemented internally. Its function is to monitor and summarize various resource information such as Pod, Quota, Service, ConfigMap, Ingress, and Role on each K8s cluster, thereby generating customized Metrics indicators, and display indicators and abnormal alarms, including the total cluster CPU and Memory usage, the number of currently running Spark tasks, top statistics of Spark task memory resource consumption and running time, single-day Spark task volume summary, and the total number of cluster Pods , Pod status statistics, ECI machine model and availability zone distribution statistics, expired resource monitoring, etc., I won’t list them all here.

9. Other work

9.1 Automatic switching of scheduled tasks

In our scheduling system, Spark tasks support the configuration of three execution strategies: Yarn, K8s, and Auto. If the user task specifies the resource manager that needs to be run, the task will only be run on Yarn or K8s. If the user selects Auto, where the task will be executed depends on the current resource usage of the Yarn queue, as shown in the figure below. Show. Due to the large total task volume and the continuous migration of Hive tasks to Spark, some tasks are still running on the Yarn cluster, but in the final form all tasks will be hosted by K8s.

9.2 Multiple availability zones and multiple switches support

ECI is used extensively during Spark task running. There are two prerequisites for successful ECI creation: 1. Being able to apply for an IP address; 2. There is inventory in the current availability zone. In fact, the number of available IPs provided by a single switch is limited, and the total number of preemptible instances in a single availability zone is also limited. Therefore, in actual production environments, it is a better practice to use ordinary ECI or Spot type ECI. The method is to configure to support multiple availability zones and multiple switches.

9.3 Cost calculation

Since the CPU, Memory and other model information of each Executor has been clearly specified when the Spark task is submitted, before the SparkContxt is closed at the end of the task, we can get the actual running time of each Executor from the task, combined with the unit price, You can calculate the approximate cost of the Spark task. Since ECI Spot instances change at any time with the market and inventory, the single task cost calculated in this way is an upper limit and is mainly used to reflect trends.

9.4 Optimizing Spark Operator

When the number of tasks was small at the beginning of the launch, the Spark Operator service ran well. However, as the number of tasks continued to increase, the Operator's speed in processing various Event events became slower and slower, and even a large number of ConfigMap, Ingress, Service and other tasks appeared in the cluster during the running process. The generated resources cannot be cleaned up in time, resulting in accumulation, and the Web UI for newly submitted Spark tasks cannot be opened and accessed. After discovering the problem, we adjusted the number of coroutines of the Operator and implemented functions such as batch processing of Pod Events, filtering of irrelevant events, and TTL deletion, which solved the problem of insufficient performance of the Spark Operator.

9.5 Upgrade Spark K8s Client

Spark3.2.2 uses fabric8 (Kubernetes Java Client) to access and operate resources in the K8s cluster. The default client version is 5.4.1. In this version, when the task ends and the Executor is released centrally, the Driver will send a large number of Delete Pod APIs. Requests to the K8s Apiserver will put greater pressure on the cluster Apiserver and ETCD, and the CPU of the Apiserver will instantly increase.

At present, our internal Spark version has upgraded kubernetes-client to 6.2.0, which supports batch deletion of pods to solve the cluster jitter caused by a large number of deletion API requests when Spark tasks are released in a centralized manner.

04 Problems and Solutions

During the entire Spark on K8s solution design and implementation process, we also encountered various problems, bottlenecks and challenges. Here is a brief introduction and our solutions.

1. Elastic network card is released slowly

The slow release speed of elastic network cards is a performance bottleneck in large-scale ECI application scenarios. This problem will lead to intense consumption of IP on the switch, and ultimately lead to Spark tasks getting stuck or failing to submit. The specific triggering reasons are shown in the figure below. At present, the Alibaba Cloud team has solved the problem through technical upgrades and greatly improved the release speed and overall performance.

2. Watcher failure

When a Spark task starts the Driver, it will create an event listener for the Executor to obtain the running status of all Executors in real time. For some long-running Spark tasks, this listener often fails due to resource expiration, network abnormalities, etc. , so in this case, the Watcher needs to be reset, otherwise the task may run away. This problem is a bug of Spark. Currently, our internal version has been fixed and the PR has been provided to the Spark community.

3. The task is stuck

As shown in the figure above, the Driver obtains the running status of the Executor through List and Watch. Watch uses a passive listening mechanism, but due to network and other problems, events may be missed or missed, but this probability is relatively low. List adopts an active request method. For example, every 3 minutes, the Driver can request the Apiserver for the information of all current Executors of its task.

Since List requests all Pod information of a task, when there are many tasks, frequent List puts great pressure on K8s's Apiserver and ETCD. In the early days, we turned off the scheduled List and only used Watch. When a Spark task runs abnormally, for example, there are many Executors OOM, there is a certain probability that the Driver Watch information will be incorrect. Although the Task has not finished running, the Driver no longer applies for Executors to execute the task, and the task is stuck. Our solution for this is as follows:

  • While turning on the Watch mechanism, also turn on the List mechanism, and extend the List time interval, setting a request every 5 minutes.
  • Modify the code related to ExecutorPodsPollingSnapshotSource to allow the Apiserver server to cache, obtain the full amount of Pod information from the cache, and reduce the pressure of List on the cluster.

4. Celeborn read and write timeout, failure

ApacheCeleborn is an open source product of Alibaba, formerly known as RSS (Remote Shuffle Service). There is still a slight lack of maturity in the early stage, and the handling of network delays, packet loss exceptions, etc. is not perfect enough, resulting in some online Spark tasks with a large amount of Shuffle data that take a long time to run or even fail. The following three points is our solution to this problem.

  • Optimize Celeborn, form an internal version, and improve the code for network packet transmission
  • Tuning Celeborn Master and Worker related parameters to improve the read and write performance of Shuffle data
  • Upgrade the ECI underlying image version and fix ECI Linux kernel bugs

5. Quota lock conflict when submitting tasks in batches

In order to prevent resources from being used indefinitely, we set a quota limit for each K8s cluster. In K8s, Quota is also a resource. The application and release of each Pod will modify the content of Quota (Cpu/Memory value). When many tasks are submitted concurrently, Quota lock conflicts may occur, thus affecting the creation of task Drivers. Task startup failed.

To deal with the task startup failure caused by this situation, we modify the creation logic of the Spark Driver Pod and add configurable retry parameters. When it is detected that the creation of the Driver Pod is caused by a Quota lock conflict, the creation will be retried. The creation of the Executor Pod may also fail due to Quota lock conflicts. This situation does not need to be handled. If the Executor creation fails, the Driver will automatically apply to create a new one, which is equivalent to automatically retrying.

6. UnknownHost reports an error when submitting tasks in batches

When a large number of tasks are submitted to the cluster in batches instantly, multiple Submit Pods will be started at the same time, and apply for IPs from the Terway component and bind the elastic network card at the same time. There is a certain probability that the following situation will occur, that is, the Pod has been started and the elastic network card has been bound successfully, but In fact, it is not completely ready. At this time, the network communication function of the Pod cannot be used normally. When the task accesses Core DNS, the request cannot be sent out. The Spark task reports an UnknownHost error and fails to run. We avoid and solve this problem through the following two measures:

  • For each ECS node, a Terway Pod is allocated
  • Turn on Terway's caching function and allocate IP and elastic network cards in advance. New Pods will be obtained directly from the cache pool and returned to the cache pool after use.

7. Network packet loss between availability zones

In order to ensure sufficient inventory, each K8s cluster is configured with multiple availability zones. However, network communication across availability zones is slightly less stable than communication between the same availability zones. That is, there is a certain probability of packet loss between availability zones. The performance is that the task running time is unstable. For the phenomenon of network packet loss across Availability Zones, you can try to set the ECI scheduling policy to VSwitchOrdered. In this way, all Executors of a task are basically in the same Availability Zone, avoiding communication abnormalities between Executors in different Availability Zones. The problem of unstable task running time.

05 Summary and Outlook

Finally, I am very grateful to the students from Alibaba Cloud Container, ECI, EMR and other related teams for giving us a lot of valuable suggestions and professional technical support during the implementation and actual migration of our entire technical solution.

At present, the new cloud native architecture has been running stably in the production environment for nearly a year. In the future, we will continue to optimize and improve the overall architecture, mainly focusing on the following aspects:

1. Continue to optimize the overall cloud-native solution to further enhance system load-bearing and disaster recovery capabilities

2. The cloud native architecture is upgraded, and more big data components are containerized, making the overall architecture more thoroughly cloud native.

3. More fine-grained resource management and precise cost control

Author: MiHoYo Big Data Development

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

Microsoft launches new "Windows App" Xiaomi officially announces that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Vite 5. Alibaba Cloud 11.12 is officially released. The cause of the failure is exposed: Access Key service (Access Key) anomaly. GitHub report: TypeScript replaces Java and becomes the third most popular. The language operator’s miraculous operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems ByteDance: using AI to automatically tune Linux kernel parameters Microsoft open source Terminal Chat Spring Framework 6.1 officially GA OpenAI former CEO and president Sam Altman & Greg Brockman joins Microsoft
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10150099