How to make good use of cloud native data lake?

Introduction: The data lake can help companies cope with the current increasing number of data scenarios, increasingly complex data structures, and increasingly diverse data processing requirements. Alibaba Cloud has been deploying data lakes since 2018, and launched cloud-native data lake analytics Data Lake Analytics (DLA), ranging from data lake management (helping customers to efficiently manage and build data lakes), Serverless Spark (providing cost-effective large-scale computing ), Serverless SQL (providing cost-effective online interactive analysis) to help customers tap the value of data. This article shares related technical challenges and solutions.

image.png

Opportunities and challenges of a data lake

Data lakes can well help companies cope with the current problems of more and more data scenarios, more and more complex data structures, and more diversified data processing requirements. A report released by Gartner in 2020 shows that 39% of users are currently using the data lake, and 34% of users are considering using the data lake within one year.

Since 2018, Alibaba Cloud has been deploying data lakes and launched the cloud-native data lake analytics Data Lake Analytics (DLA) product, combined with the object storage OSS, to create a platform for flexible expansion, on-demand payment, and servicing. Competitive products. By adopting the storage and computing separation model, storage and computing are fully paid on demand, and users only need to pay for the calculations that actually generate value; DLA deeply customizes the cloud native elasticity to achieve job-level flexibility, and it can bounce up to 300 nodes in one minute. Cloud native data lake analysis DLA has been greatly improved compared to traditional data analysis solutions in terms of cost, flexibility, and delivery capabilities.

image.png

Thousands of companies have also used data lake services to meet data applications on the cloud. For example, U-Meng+’s U-DOP data open platform, based on U-Meng+ years of accumulated experience in the big data field, has formed apps, webs, small programs, and advertising The collection and processing capabilities of multi-end subject data based on marketing, social sharing and push, form standardized multi-end data assets for customers. In particular, the elasticity of the data lake was used to cope with the business changes during the peak period of Double Eleven. For example, through the implementation of the analysis of the changes in search keywords, the change of the recommended information on the homepage, the active users and the users who placed orders were divided into different channels Analysis and combing, and timely adjustment of preferential strategies to attract more customers for new purchases and repurchase.

The integration trend of database and big data is intensifying. Traditional database users and DBAs can also use and maintain big data systems to solve big data problems in an integrated manner. Specifically, DLA embodies the seamless integration of data in the database with big data, such as the one-key function provided by DLA to build a warehouse; DLA Serverless SQL is compatible with the MySQL protocol and some syntax.

For the DLA Serverless product form, developers only need to use the platform interface, such as using DLA SQL's JDBC interface to submit SQL, and DLA Spark's OpenAPI to submit Spark jobs. Developers only need to focus on the business logic itself, not the complex logic of the platform. Many of the pain points encountered when using open source components can be easily solved:

High barriers to entry

The Hadoop ecosystem often requires the simultaneous use of multiple components, such as Yarn, HDFS, Spark, Hive, Kerberos, Zookeeper, etc. Developers need to understand all the components, because these components are often exposed during the development process.

Difficulty in development and maintenance

Developers will encounter usage problems caused by various components during the development process, and developers need to understand all these components to deal with these problems. These increase the burden on developers.

Stability is difficult to guarantee

The open source components themselves must be carefully adjusted and added with appropriate hardware resource configuration to run well, and many BUGs need to be fixed, and there is no answer to problems.

Lack of performance optimization for the cloud

The OSS, PolarDB and other components on the cloud are all cloud-native components. The open source components are not adequately adapted to the transformation of this part, and they have not fully exploited higher performance.

DLA helps customers tap the value of data from three aspects: data lake management (helping customers efficiently manage and build data lakes), Serverless Spark (providing cost-effective large-scale computing), and Serverless SQL (providing cost-effective online interactive analysis). The overall structure is shown below. Next, this article will describe the related technical challenges and solutions from these three aspects.

image.png

2. How to manage and build a data lake?

The difficulty of data management in the data lake is mainly reflected in two aspects:

  • How to efficiently construct metadata for the data stored on OSS in the data lake?
  • How does non-OSS data enter the lake efficiently and build warehouses?

The main functions related to data lake management include metadata management, metadata discovery, database entry into the lake and warehouse establishment, and real-time data entry into the lake. Next, we will focus on two key technologies: "Massive file metadata automatic construction technology" and "Data management technology for warehouse building in the lake".

1 Mass file metadata automatic construction technology

When using OSS as a data lake storage, the stored data files have the following characteristics:

  • Rich formats: including CSV, Text, JSON, Parquet, Orc, Avro, hudi, Delta Lake and other formats. Among them, CSV and Text contain a variety of custom separators.
  • The number of files is in the million level: OSS has better scalability and cost-effectiveness, and the files stored by users in OSS will be in the million level.
  • File dynamic upload: The data files stored on OSS have the feature of dynamic and continuous upload. How to quickly and incrementally modify the metadata of new files.

In order to efficiently construct metadata for the massive data on OSS, Alibaba Cloud DLA proposed and implemented the "Massive File Metadata Automatic Construction Technology". The specific technology is shown in the figure below. The core solves the two problems: recognition of tens of thousands of meters and tens of thousands of partitions, and incremental sensing and updating of metadata.

image.png

Multi-meter, multi-partition recognition

The number of files on the user OSS will reach millions. These files are not only in different formats, such as JSON, CSV, Text, etc., but also the specific Schema fields of the same format are different due to different business attributes. This technology supports automatic generation of tens of thousands of tables and tens of thousands of partitions through file Schema recognizer and file classifier. Among them, the file schema recognizer, for example, recognizes a single JSON file in 0.15s and a single CSV file recognizes in 0.2s. With a pluggable intelligent sampling strategy and distributed strategy, the schema recognition of millions of files can reach the minute level. The file classifier performs aggregation, pruning, and compression through the structure of the tree, and it takes about 290ms to classify and recognize millions of files.

Incremental awareness update

Users will continue to upload files to OSS. The automatic metadata construction not only updates the file schema changes belonging to the created tables to the existing tables, but also creates new tables for the newly added files. Here, on the one hand, the "File Schema Recognizer" recognizes the changed files by obtaining the addition and deletion changes of the files on the OSS, and at the same time the "File Classifier" performs the generation and change strategy for the newly added file schema and the created tables. Currently, 4 strategies are supported for adding partitions, adding fields, keeping fields unchanged, and not aware of file deletion. New strategies can be added continuously in the future.

2 Data organization technology for warehouse building in the lake

The unified storage of DataBase and message log service data in the data lake storage OSS for management can meet business needs such as computing acceleration, building data warehouse archiving, and cooling and heating separation. DLA's data organization technology for building warehouses in the lake includes three data organization and management modes: mirror mode, partition mode, and incremental mode. The three modes can be matched and friendly to support these business scenarios.

image.png

Mirror mode

Each time the data of all tables under a database in the source database is fully synchronized to the data lake storage OSS, the load of the source database can be controlled within 10% during synchronization. Here, the global unified data fragmentation scheduling algorithm is mainly used. Keep the data in the data lake consistent with the source database.

Partition mode

In the face of archiving scenarios, it supports synchronizing the source database data to the data lake in full amount and increments on a daily basis, and organizes it in a time zoned manner to facilitate archive management. This mode can achieve hourly time delay.

Incremental mode

This mode can achieve T+10min data entry into the lake through mixed row and column storage technology, commitlog and index management technology. Among them, the delta incremental file and asynchronous compaction technology solve the problem of small files; the delta incremental file and index technology can support the database scene update and the incremental real-time writing of the deleted log; record the mapping of the partition file through the commitlog method, Solve the problem of slow performance of million partitions in traditional Catalog management mode.

Three cloud native data lake platforms need to open up cloud infrastructure

DLA is a multi-tenant architecture as a whole, deployed in regions, and users in each region share a set of control logic. The virtual cluster VC is a logical isolation unit. The platform supports engines such as Serverless Spark and Serverless SQL to create cloud native services.

image.png

As shown in the figure above, the main challenges faced by the platform are: efficient resource supply, security protection, and bandwidth guarantee for access to data sources.

1 Efficient supply of resources

The cloud native platform is based on Alibaba Cloud's base ECS&ACK&ECI, and is connected with the Alibaba Cloud IAAS resource pool. This Region cross-availability zone resource scheduling ensures the supply of resources. Supports 300 nodes in 1 minute, and guarantees the resources of computing nodes in a large Region 5w for a single customer.

2 Safety protection

Users can write arbitrary code to run inside the platform, which may be a deliberate malicious attack. If there is no protection, the platform faces security risks. In terms of safety, we ensure safety through the following technologies:

  • One-time key: Each job task will go to TokenServer to apply for a temporary token. If the job fails, the token will expire. If there is an attack, the platform will directly expire the token, and access to services such as Meta will be denied.
  • Prevention of DDOS& injection attacks: All requests to access platform services will be received by the Security Protection Center. The Security Protection Center detects any attacks or injections and directly closes the network port.
  • Computing container isolation: Alibaba Cloud's self-developed security container is used between computing nodes, and the container itself can achieve the same security isolation level as the VM.
  • Security whitelist: The network between users is completely isolated.
  • ENI virtual network card: To get through the VPC, you need to configure the security group and virtual switch (VSwitch) under your account. After configuration, the settlement node container will assign the IP of the user's VPC corresponding to the VSwitch network segment, and mount the user's security group.

3 High throughput network bandwidth

  • Access to OSS services is through high-throughput bandwidth services.
  • Using ENI technology to access a self-supported VPC is the same as deploying a computing engine on the ECS in a self-supporting VPC to access data in the self-supporting VPC. The bandwidth is also the bandwidth of the VPC intranet.

Four Technical Challenges of Serverless Spark Service

Apache Spark is currently the most popular open source engine in the community. It not only has computing capabilities such as streaming, SQL, machine learning, and graphs, but it can also connect to rich data sources. However, in the face of the data lake scenario, the traditional cluster version of Spark solution faces the aforementioned problems of data management difficulties, operation and maintenance costs, insufficient computing resource elasticity, and weak enterprise-level capabilities, as well as poor performance in accessing OSS. , Complex operations are difficult to debug and other issues.

With the help of the data lake management mechanism mentioned in the second chapter, data management problems can be solved well. With the help of the multi-tenant security platform mentioned in Chapter 3, DLA Spark has realized a brand-new cloud-native serverless product form, which solves the problems of elasticity, operation and maintenance costs, and enterprise-level requirements. This chapter further expands the performance optimization of Spark's access to OSS with multi-tenant UI services.

1 Spark access to OSS optimization

Community version problem

By default, the open source version of Spark uses the Hadoop FileFormat interface to directly connect to OSSFileSystem to access OSS data. In practice, this method has found problems such as poor performance and difficulty in ensuring consistency.

(1) Spark access to OSS is poor

The core reason is the difference between the OSS KV model and the HDFS file tree model. The FileFormat algorithm was originally designed based on the HDFS file system, but object storage such as OSS, in order to solve the scalability, essentially uses the KV model. The KV model is quite different from the HDFS file system. For example, the RenameDirectory interface is only a pointer operation in HDFS, but in KV, all sub-files and directories need to be renamed, which is expensive in performance and cannot guarantee atomicity. Hadoop FileOutputFormat first writes to the temporary directory when writing data, and finally writes to the final directory. The process from the temporary directory to the final directory requires file tree merging. There are a lot of Rename operations during the merging process.

(2) Consistency is difficult to guarantee

In the FileFormat v1 algorithm, all file tree merge operations are performed at a single point in AppMaster, which is very inefficient, especially in dynamic partition scenarios. In order to solve the single point of AppMaster, the community provides Algorithm 2. Its core idea is to parallel the merge process to the Task for execution, which will improve performance to a certain extent. However, if the Job execution fails, the partially successful Task will write the data The final data directory leads to dirty data problems.

Spark OSS access optimization

image.png

(1) Implementation of FileOutputFormat based on MultipartUpload

Aiming at the characteristics of Spark's access to OSS, we have newly implemented the Hadoop FileOutputFormat interface, as shown in the figure above. The improvement of the algorithm focuses on optimizing the merge operation. The core of the merge is to solve the problem of when the file is visible. OSS provides the MultipartUpload interface, that is, the resumable upload function. Files can be uploaded in fragments. The upload is not completed, and the fragmented files are invisible. With this feature, we can let Task directly write data to the final directory. Only when the job is successful can the file be finally visible. This method does not need to write to the temporary directory first, which greatly reduces metadata operations. For the temporary fragments written by the task that failed to execute, we can delete them by executing the Abort operation at the end of the job, which reduces the space occupation.

For Spark's typical ETL Benchmark Terasort, with 1TB of input data, the execution time of DLA FileOutputFormat is reduced by 62% and performance is improved by 163%. For dynamic partition scenarios, community algorithm 1 fails, and algorithm 2 can be executed successfully. Compared with algorithm 2, the performance of DLA FileOutputFormat is further improved by 124%.

(2) OSS Metadata Cache

In the process of Spark reading OSS, in the ResolveRelation phase, Spark will traverse the OSS directory, parse the table structure and partition structure, and parse the schema. There will also be a large number of metadata operations in the process, and the metadata of the same OSS object will be Been visited many times. In response to this problem, we implemented a cache of OSS metadata. The metadata of the OSS object accessed for the first time will be cached locally, and subsequent access to the object will directly read the local cache. This method can minimize the access to OSS metadata. The Cache mechanism can increase the performance of ResolveRelation by about 1 times. For typical Spark query scenarios, the mechanism can improve the overall performance by 60%.

2 Multi-tenant UI service

UI services are very important to developers. Developers rely on UI services for job debugging and troubleshooting of production jobs. Good UI services can speed up research and development efficiency.

Pain points of HistoryServer

The HistoryServer provided by the Spark community provides UI and log services for Spark historical jobs. Many pain points have been encountered in practical applications. Typical examples are as follows:

(1) Eventlog space overhead is large

HistoryServer relies on the Spark engine to record all the running Event information in the FileSystem, and then play back in the background and draw the UI page. For complex jobs and long jobs, the amount of Eventlog is large, which can reach the level of 100 GB or even TB.

(2) Complex jobs and long jobs are not supported

The Eventlog of a complex job or a long job is too large, and HistoryServer will fail to parse it, or even OOM. Coupled with the large space overhead, users generally can only close Eventlog.

(3) Poor Replay efficiency and high latency

HistoryServer uses the background Replay Eventlog method to restore the Spark UI, which is equivalent to replaying all the events of the Spark engine, which is expensive and has a delay. Especially in the case of more or more complicated tasks, the delay can reach minutes or even ten minutes.

DLA multi-tenant SparkUI

image.png

SparkUI service is a multi-tenant UI service self-developed by the DLA platform, which has been deeply optimized for community solutions:

(1) Go to Eventlog

DLA Spark removes the Eventlog dependency. At the end of the job, the Spark Driver just dumps the Meta of the UI to OSS and saves the page meta information before the job ends. This part of information is only relative to Eventlog, which will be greatly reduced. Even very complex jobs are only at the MB level. UiServer reads the UI Meta on OSS and deserializes it to display the SparkUI page.

(2) UIServer horizontal expansion

UIServer is mainly responsible for analyzing historical UI Meta and providing Stderr and Stdout log services. It is lightweight, stateless, and can achieve horizontal expansion, thereby supporting simultaneous online services for tens of thousands of customers. UIServer URL uses encrypted token as parameters, token represents user identity, job id, UIServer realizes multi-tenant service based on this.

(3) The local log automatically scrolls

For long jobs, Stderr or Stdout information will accumulate over time, and eventually the disk may even burst. The DLA Spark secure container has a built-in background process to implement log rolling and save the most valuable recent log.

Five Technical Challenges of Serverless SQL Services

DLA Serverless SQL is a cloud-native data lake engine based on PrestoDB currently hosted under the Linux Foundation. Alibaba is also a member of the Presto Foundation and has been contributing to and optimizing Presto. The PrestoDB engine itself has excellent features:

  • The ultimate speed brought by full memory computing.
  • Support the powerful expressiveness brought by the complete SQL semantics.
  • The easy-to-use plug-in mechanism allows us to perform related queries on any data source.
  • The strong community makes us have no worries after use.

However, the community PrestoDB is a single-tenant engine. It assumes that you are using it inside a company, so there is no too much investment in computing power isolation, high availability, etc., which makes it possible to directly use it as an engine for cloud native services. Questions:

  • If a user submits a large number of large queries, it may occupy all the resources of the cluster, rendering other users unavailable.
  • Single Coordinator makes the availability of the entire service impossible to guarantee.

We have made a series of optimizations and transformations so that it can serve all users in a cloud-native form. Today, we will focus on the two main features of multi-tenant isolation technology and multi-Coordinator.

First, let's look at the overall architecture of DLA Serverless SQL:

image.png

We have built services such as access layer and unified metadata around the core PrestoDB cluster to make users stable and convenient. Below we will analyze in detail in the introduction of multi-tenant isolation technology and multi-Coordinator technology.

1 Multi-tenant isolation technology

PrestoDB is natively supported by resource groups. It can support a certain degree of CPU and memory restrictions between different resource groups, but it has some problems that make it impossible for us to achieve computing power isolation based on it:

  • Global scheduling level: Even if a tenant uses too much computing power, it will not be punished in time, only new queries will be blocked.
  • Worker scheduling level: Splits of all tenants are scheduled in the same queue. If a tenant has too many Splits, it will affect other tenants.

Our computing power multi-tenant solution is as follows:

image.png

We introduced a ResourceManager module in the cluster to collect resource usage information of all tenants from all Coordinators. ResourceManager compares the collected resource usage information with our preset computing power threshold to calculate which tenants should be Punish, and then notify all workers of the punishment information. When the worker is scheduling, it will refer to the penalty information notified by the ResourceManager to determine which tenants' queries are scheduled, and which tenants' queries are not scheduled. In this way, the computing power between different tenants will be isolated. We tested that if a tenant overuses resources, it will be punished within a maximum of 1.3 seconds, thereby releasing resources to other tenants. The community default version of "penalty" "It will not arrive until all the tenants' queries are executed. Only when metadata and computing power are isolated can we rest assured that we can use a cluster to serve all our users.

2 Multi-Coordinator technology

In the community version of Presto, Coordinator is a single point, it will bring two problems:

Availability hazards: Once the Coordinator goes down, the entire cluster will be unavailable for 5 to 10 minutes.

Unable to achieve seamless upgrade, the upgrade process affects all users' query usage.

We adopted the following architectural plan:

image.png

First, we placed a new FrontNode module on Presto’s Coordinator, allowing users to connect to this module instead of directly connecting to our underlying Coordinator, so how many Coordinators we have at the bottom, and which Coordinator is currently providing services to users All are completely transparent to users, so the architecture is more flexible, so that we can expand the Coordinator at the bottom.

After FrontNode receives the user's query, it will send the request to multiple Coordinators at the bottom in a Round Robin way, so that multiple Coordinators can share the pressure, but the entire cluster still has some global things that need to be done by a single Coordinator, such as Presto's Worker status monitoring, OOM Killer, etc., so we introduced a Zookeeper to do the coordinator election. After the election, the main Coordinator's responsibilities will be similar to the community's Presto: do global Worker status monitoring, OOM Killer and execution The query assigned to it; the responsibility of the Coordinator is relatively lightweight: only responsible for executing the query assigned to it.

If one of the Coordinator goes down due to any problem, Zookeeper will find the problem in seconds and re-elect the master. Users only need to retry the affected query. And one of the attempts we are doing is to do automatic retry of the query. For failures determined to be caused by the system, we do automatic retry. Such a Coordinator will have little impact on users.

With the multi-Coordinator architecture, it is very simple for us to achieve a seamless upgrade. When we upgrade, we only need to take the initiative to remove a certain Coordinator/Worker from the cluster, perform the upgrade, and then join the cluster after the upgrade is completed. Perception, because there is always a working cluster providing services to users during the upgrade process. For example, when we upgrade from Coordinator, the entire cluster is as follows:

image.png

Through optimizations such as multi-tenant isolation technology, multi-Coordinator architecture, etc., based on PrestoDB, we built Alibaba Cloud Cloud Native Data Lake Serverless SQL engine that can serve all users.

Six cloud native data lake end-to-end best practices

image.png

As shown in the above scheme, DLA provides an end-to-end solution. In the face of the management and lake access difficulties caused by OSS data openness, DLA data lake management helps you build a secure data lake in one stop.

  • Provide a unified and open Meta service to manage OSS data and support database table permissions.
  • With the metadata crawling function, you can create metadata information on OSS with one click, easily and automatically recognize formats such as CSV/JSON/Parquet, and establish database table information to facilitate subsequent computing engines.
  • One-click synchronization of data from databases such as RDS/PolarDB/MongoDB to OSS storage, build a hierarchical business architecture for hot and cold data, and perform data insight analysis on multi-source massive data.
  • Support streaming Hudi format to meet the T+10 minute delay requirement, which greatly improves the end-to-end delay of analysis.
  • Serverless SQL analysis helps you open and use the data lake immediately. Users do not need to purchase any resources to run standard SQL syntax to query data.
  • Supports acceleration of OSS Cache for data lake storage, improving performance by 10 times.
  • Supports analysis of ten data sources of RDS, PolarDB, ADB, and MongoDB data.
  • Compared with the traditional Presto and Impala solutions, the performance-to-price ratio is increased by 10x.
  • Serverless Spark computing helps you play the data lake independently. Users do not need to purchase any resources to use the cloud-native Spark service, which supports OSS data cleaning, machine learning, user programmable, and data lakes.
  • 500 nodes can be popped up every minute to participate in the calculation.
  • Compared with the traditional self-built Spark solution, the cost-effective improvement of 3x is improved.

Since DLA involves many technical points, this article describes some technical details, please pay more attention to cloud native data lake analysis DLA:
https://www.aliyun.com/product/datalakeanalytics

Original link: https://developer.aliyun.com/article/776439?

Copyright statement: The content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own its copyright and does not assume corresponding legal responsibilities. Please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines" for specific rules. If you find that there is suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Guess you like

Origin blog.csdn.net/alitech2017/article/details/109313361