LAS Spark + cloud native: a new solution for data analysis

For more technical exchanges and job opportunities, please follow the WeChat official account of ByteDance Data Platform and reply [1] to enter the official communication group
With the rapid growth of data scale and the continuous evolution of data processing requirements, cloud native architecture and lake warehouse analysis have become important trends in modern data processing. In this digital era, enterprises are faced with the challenges and opportunities of massive data, and building a scalable, flexible and efficient data analysis platform has become an urgent need.
The article mainly introduces the Volcano Engine lake and warehouse integrated analysis service LAS (hereinafter referred to as LAS) cloud native lake warehouse analysis practice based on Spark, using the powerful functions of Spark and the advantages of cloud native technology to build efficient, scalable and flexible data The analysis platform meets the urgent needs of modern enterprises for data insights and provides enterprises with powerful solutions. 
 
The outline of this article is as follows:
  • Spark on K8S
  • Kyuubi: S park SQL Gateway
  • CatalogService : Lake and warehouse integrated metadata architecture practice
  • LAS Batch Scheduler: Cloud native batch scheduler
  • UIService: Cloud native Spark History Server
  • Falcon:Remote Shuffle Service
  • Summarize

Spark on K8S

As the de facto standard for today's cloud-native infrastructure, Kubernetes plays an important role in LAS Spark. We first share the practical optimization work of LAS Spark based on Kubernetes.
Kubernetes (often referred to as k8s) is an open source container orchestration platform used to automate the deployment, scaling and management of containerized applications. It provides a powerful container orchestration and management system that simplifies the process of deploying, scaling, and managing applications.
Kubernetes was originally developed by Google and open sourced in 2014. It is based on the experience and technology of Google's internal Borg system, and absorbs contributions and feedback from the community, gradually becoming the de facto standard in the field of container orchestration.
Kubernetes' architecture is highly extensible and consists of a core set of components and plug-ins. Developers can extend and enhance the functionality of Kubernetes through the plug-in mechanism. Kubernetes is widely used to deploy and manage cloud-native applications. It provides powerful functionality and flexibility, making it easier for developers and operations teams to build, deploy, and manage containerized applications and achieve goals such as high availability, scalability, and elastic scaling.
Volcano Engine LAS uses Kubernetes as its infrastructure, combined with a series of in-depth self-researched scalable plug-ins, to successfully build the capability of Serverless Spark, thereby achieving cloud-native lake and warehouse integrated service capabilities.
LAS Spark uses Spark Operator to manage the execution of each Spark job on Kubernetes. Operator is an extension mechanism of Kubernetes that uses custom resources to manage applications and their components. Operator follows the design philosophy of Kubernetes controller.
The concept of Operator pattern allows extending the functionality of a cluster by associating controllers with custom resources without modifying the core Kubernetes code. Operator acts as a client of the Kubernetes API and is also a controller of custom resources.
A common way to deploy an Operator is to add a custom resource and its associated controller to the cluster. Similar to deploying containerized applications, controllers usually run outside the Control Panel. For example, the controller can be run as a Deployment in a cluster.
The role of Spark Operator is to describe Spark jobs as custom resources. Users or programs can submit Spark jobs and view the running status of the jobs through the pure Kubernetes interface. This makes managing Spark jobs as easy as managing other Kubernetes resources, converting Spark jobs into standard Kubernetes job workload types.
A user or program submits a Spark job to the Kubernetes cluster through SparkApplication CRD (custom resource definition). Spark Operator subscribes to the status updates of all SparkApplications in the cluster, submits jobs to the Kubernetes cluster by calling spark-submit, and maintains the entire life cycle of the corresponding Spark jobs.
In fact, Volcengine LAS uses Volcengine's container service VKE (Volcengine Kubernetes Engine) on the base. VKE is an enterprise-class container cloud management platform based on Kubernetes.
By building a cloud-native cluster on VKE, LAS Spark provides a multi-tenant isolated operating environment. At the logical level, LAS implements segmentation of user resources through queue design, while at the physical level, it ensures the isolation capability of tenant jobs when running through container isolation policies.
Furthermore, LAS provides the ability to safely isolate sandbox containers based on Volcengine Container Instance (VCI). VCI is a serverless and containerized computing service that can be seamlessly integrated with the container service VKE managed version to provide Kubernetes orchestration capabilities.
LAS builds the tidal quota capability on the basis of VKE/VCI, and realizes resource peak shaving and valley filling through overall monitoring of resource usage at the cluster level. Based on the elastic execution capability of VCI POD granularity, LAS will further enhance the elastic expansion and contraction capabilities in the future, providing completely lossless real-time elastic expansion and contraction capabilities at the Spark job granularity.
 

Kyuubi:Spark SQL Gateway

Based on the previous introduction, we have successfully implemented the cloud nativeization of Spark. In order to further realize Spark's output capabilities, LAS Spark uses Apache Kyuubi to encapsulate complete Spark engine functions. Kyuubi is a distributed and multi-tenant gateway primarily used to provide ingress services on data warehouses and data lakes. It can meet the needs of different big data scenarios within the enterprise, such as ETL, BI reports, etc. Kyuubi provides a standard ODBC/JDBC interface, allowing users to query various data sources using SQL language. It has features such as multi-tenancy, security, and high availability, making it suitable for scenarios such as high-concurrency enterprise-level big data query and analysis.
  • Server Discovery/Load Balance: Use ZK/ETCD for service discovery and load balancing. When the customer submits a job, it will be routed to a KyuubiServer to manage job execution through the load balancing policy of ZK/ETCD.
  • Servers: Supports multiple KyuubiServers, which will be registered to ZK/ETCD during the startup process to facilitate service discovery and load balancing. Multiple servers also implement cold standby HA.
  • Engine Discovery: Client requests will find their own Engine through Engine Discovery in KyuubiServer, and then mention the request to the corresponding Engine.
  • Engines: Specific execution engines, such as Spark, Trino and other engines.
LAS builds Spark's near-real-time query and analysis capabilities based on Kyuubi, effectively supporting multi-tenant and high-concurrency scenarios. Kyuubi supports different levels of isolation capabilities such as Connection, User, and Group. By combining it with the LAS tenant queue capability, it fully realizes resource isolation and ensures fair resource allocation between Spark tasks of different tenants. Based on Kyuubi, LAS provides a simple and easy-to-use interface. Users can interact through the JDBC/ODBC client or LAS Console, and can easily run Spark SQL queries on LAS.
In order to be able to adapt to more types of engines (such as Presto), LAS has conducted in-depth self-developed extensions outside of Kyuubi, provided the ability to unify SQL (code name: ByteQuery), and completed a large number of optimizations in the parsing layer. Due to limited space here, we will share it further with you later.

CatalogService: Lake and warehouse integrated metadata architecture practice

As an integrated analysis service for lakes and warehouses, the next challenge LAS faces is how to shield the differences in metadata for the Spark engine. In order to solve this problem, LAS developed its own unified metadata service CatalogService. CatalogService provides an interface compatible with HMS (Hive Metastore) and provides a unified metadata view for all query engines, solving the metadata management problem of heterogeneous data sources.
CatalogService is divided into three layers. The first layer is Catalog Federation, which provides a unified view and cross-regional data access capabilities. It also provides routing capabilities for source data requests. It supports mapping to route underlying metadata service instances corresponding to different service requests according to the type of metadata request.
The second layer is the implementation of specific metadata services at the lower level of CatalogService, such as Hive MetaStore Service, ByteLake MetaStore Service and other metadata services. These metadata services are connected with CatalogService to uniformly provide metadata services to the upper-level engine.
The last layer is the storage layer of MetaStore, which provides different storage engines in a plug-in manner to meet the storage requirements of different upper-layer metadata service instances.

LAS Batch Scheduler: Cloud native batch scheduler

After the Spark job is submitted to the Kubernetes cluster, how to efficiently schedule resources becomes the next issue that LAS Spark needs to solve. Kubernetes default-scheduler was originally designed for container orchestration services. Although the community has made a lot of subsequent improvements, it is not the best choice for batch processing jobs in terms of scheduling function and throughput performance. Therefore, LAS improves the resource scheduling capabilities of Spark jobs on the basis of cloud native.
LAS Batch Scheduler provides all the scheduling capabilities that batch processing jobs rely on, such as Gang Scheduling, FIFO/Fair Scheduling, min/maxQuota, priority preemption, overselling, CPU/GPU mixed scheduling, etc., and improves the efficiency of batch scheduling through the global scheduling cache. performance. In terms of architecture, we adopt a combination mode design, which is highly scalable and facilitates further improvements based on the batch scheduler in the future.
In view of the execution characteristics of Spark jobs of different sizes, LAS implements real-time allocation of resource quotas at the workload level in the service state based on Batch Scheduler, thereby realizing hot start of small and medium-sized Spark jobs in the service state and cold start of large-scale Spark jobs in tenants. The queue level can realize real-time Quota sharing. In addition, LAS has made many optimizations in terms of multi-tenancy, security, real-time elasticity, etc. We will further expand upon the appropriate opportunities in the future.

UIService: Cloud native Spark History Server

After completing resource scheduling, the Spark job officially enters the execution phase. During the execution phase of the Spark job, LAS has carried out a lot of optimization work, the details of which will be introduced in other special sharing articles. In this article, we will focus on sharing the new cloud-native Spark historical service independently developed by LAS - UIService. Compared with the open source SHS (Spark History Server), UIService storage usage and access latency are reduced by more than 90%.
The native Spark History Service is built on the Spark Event system. During the running of a Spark task, a large number of SparkListenerEvents containing running information will be generated, such as ApplicationStart / StageCompleted / MetricsUpdate, etc., all of which have corresponding SparkListenerEvent implementations. All events will be sent to ListenerBus and listened by all listeners registered in ListenerBus. Among them, EventLoggingListener is a listener specially used to generate event log. It will serialize the event into an event log file in Json format and write it to the file system (such as HDFS).
On the History Server side, the core logic is in FsHistoryProvider. FsHistoryProvider will maintain a thread to intermittently scan the configured event log storage path, traverse the event log files, extract summary information (mainly appliaction_id, user, status, start_time, end_time, event_log_path), and maintain a list. When the user accesses the UI, the task required for the request will be searched from the list. If it exists, the corresponding event log file will be completely read and parsed. The parsing process is a playback process (replay). Each line in the Event log file is a serialized event. Deserialize them line by line, and use ReplayListener to feed back the information to KVStore to restore the status of the task.
Regardless of runtime or History Server, task status is stored in instances of a limited number of classes, and they are stored in KVStore. KVStore is a memory-based KV store in Spark that can store any class instance. The front end will query the required objects from KVStore to render the page.
The native Spark History Service has the following problems:
  • Large storage space overhead
Spark's event system is very detailed, resulting in a very large number of events recorded in the event log. Most of the events are useless for UI display. And the event log is generally stored in Json plain text, which takes up a lot of space.
  • Poor playback efficiency and high latency
History Server uses the method of playback and parsing event log to restore Spark UI, which involves a lot of computing overhead. When the task is large, there will be obvious response delay. After the large job is completed, the user may have to wait for ten minutes or even half an hour to pass the History Server. Seeing the job history greatly affects the user experience.
  • Poor scalability
Before playing back and parsing files, History Server's FsHistoryProvider needs to scan the configured event log path, traverse the event logs, and load the metainformation of all files into memory, which makes the native service a stateful service. Therefore, every time the service is restarted, the entire path needs to be reloaded before it can serve external parties. After each task is completed, it also needs to wait for the next round of scanning before it can be accessed. It is difficult to easily expand horizontally.
  • Not cloud native
Spark History Server is not a cloud-native service. The workloads of different tenants vary greatly, and the transformation and maintenance costs are high in public cloud scenarios.
In order to solve the previous problems, we tried to transform the History Server.
Regardless of whether Spark Driver or History Server is running, they listen to events, reflect the task change information contained in them to instances of several UI-related classes, and then store them in KVStore for UI rendering. In other words, KVStore stores complete information required for UI display. For users of History Server, in most cases we only care about the final status of the task, without caring about the specific event that caused the status change. Therefore, we can only persist the KVStore without storing a large amount of redundant event information. In addition, KVStore natively supports Kryo serialization, and its performance is significantly better than Json serialization. Based on this idea, we rewrote a new History Server system and named it UIService.
We refer to all class instances related to UI in KVStore, and we collectively refer to these classes as UIMeta classes. Specifically, this includes information in AppStatusStore and SQLAppStatusStore (listed below). We define a class UIMetaStore to abstract. A UIMetaStore is a collection of all UI information for a task.
Similar to EventLoggingListener, a dedicated Listener has been developed for UIMeta - UIMetaLoggingListener, which is used to listen to events and write UIMeta files.
Compare with EventLoggingListener: EventLoggingListener will trigger writing every time it accepts an event, and writes serialized events; while UIMetaLoggingListener will only be triggered by specific events. Currently, it will only be triggered by stageEnd and JobEnd events, but each write operation is in batches. written to completely persist the UIMetaStore information in the previous stage.
We use UIMetaProvider to replace the original FsHistoryProvider. The main differences are:
  • Change the process of reading event log files and playback to generate KVStore to reading UIMetaFile and deserializing UIMetaStore.
  • The path scanning logic of FsHistoryProvider has been removed; each time the UI is accessed, the UIMetaFile parsing is directly read based on the appid and path rules. This eliminates the need for UIService to preload all file metainformation, and eliminates the need to increase server configuration as the number of tasks increases, which facilitates horizontal expansion.
By building UIService, we have greatly saved the storage space of Spark UI related events and effectively improved the UI access delay performance. In terms of architecture, we have also implemented multi-tenant access isolation, cloud native and elastic scalability based on UIService.
 

Falcon:Remote Shuffle Service

In addition to UIService, optimization at the Shuffle level is also a topic worth sharing. Shuffle is a process used in Spark jobs to connect upstream and downstream data interactions. The service that provides Shuffle capabilities is called Shuffle Service. Initially, Spark implemented a hash-based Shuffle Service internally, and later introduced a sort-based Shuffle Service. Although Spark continues to iterate and improve the Shuffle mechanism internally, due to the coupling limitation between storage and computing, the usability of the Shuffle mechanism implemented within Spark is limited in certain scenarios.
In order to solve this problem, the industry has proposed a Shuffle design that separates Shuffle Service from Spark, usually called Remote Shuffle Service (RSS). RSS allows Shuffle Service to run outside of Spark, decoupling storage and computing, providing better availability and performance.
 
Falcon is a Remote Shuffle Service on LAS, which adopts a high-availability and storage-computing separation architecture. It can support the Spark engine to read and write remote Shuffle data, and can be deployed and applied in cloud environments.
Falcon implements the ability to aggregate data on the Reducer side and solves the problem of fragmented reading in the Shuffle stage. This greatly reduces the dependence on disk IOPS, reduces the risk of OOM (memory overflow) when Spark Executor is running, and ensures the safety of large Shuffle jobs. stability. In addition, Falcon also provides Tiered Storage (tiered storage) capability. Different media (memory/SSD/HDD) can be selected to store Shuffle data according to different job sizes, further improving the execution performance of small and medium-sized jobs.
On Falcon, LAS provides a CRC check scheme to ensure the stability of Shuffle data and avoid execution correctness problems caused by data loss.

Summarize

The above is what this article wants to share with you about the practice and optimization of LAS Spark in cloud native lake warehouse analysis services. Whether facing large-scale data processing, real-time analysis or complex artificial intelligence tasks, Spark-based cloud native lake warehouse analysis practice provides a powerful solution for enterprises. Through this practical guide, we hope to help readers deeply understand a series of designs and practices of the cloud-native lake-warehouse integrated analysis service LAS based on the Spark engine, and apply Spark and cloud-native technology in actual projects to support data-driven decision-making and innovation for enterprises. provide support.
Due to space limitations, this article cannot elaborate on many practical details. We will explain it in more depth in subsequent feature articles. Interested readers can continue to pay attention to our follow-up sharing.
 
Lakehouse Analytics Service LAS (Lakehouse Analytics Service) is a Serverless data processing and analysis service for Lakehouse integrated architecture. It provides one-stop EB-level massive data storage computing and interactive analysis capabilities based on ByteDance’s best practices. It is compatible with Spark and Presto Ecology helps enterprises easily build intelligent real-time lake warehouses. Newcomer discount is coming! Exclusive benefits presented to all new users are here. LAS Data’s special 1 yuan flash sale event for newcomers in China and Taiwan is newly launched! There are also many stacking discounts waiting for you to grab! Thank you all for your continued support and love for us. We will continue to bring you better content.
 
Microsoft launches new "Windows App" Xiaomi officially announces that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Vite 5. Alibaba Cloud 11.12 is officially released. The cause of the failure is exposed: Access Key service (Access Key) anomaly. GitHub report: TypeScript replaces Java and becomes the third most popular. The language operator’s miraculous operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems ByteDance: using AI to automatically tune Linux kernel parameters Microsoft open source Terminal Chat Spring Framework 6.1 officially GA OpenAI former CEO and president Sam Altman & Greg Brockman joins Microsoft
{{o.name}}
{{m.name}}

Supongo que te gusta

Origin my.oschina.net/u/5588928/blog/10120082
Recomendado
Clasificación