How to build an extremely fast data lake analysis engine

"   Author: Alibaba Cloud EMR Open Source Big Data OLAP Team,

StarRocks Community Data Lake Analysis Team  

foreword

As digital industrialization and industrial digitization become an important driving force for the economy, the data analysis scenarios of enterprises are becoming more and more abundant, and the requirements for data analysis architecture are getting higher and higher. New data analysis scenarios have spawned new requirements, which mainly include three aspects:

  • Users want to import and store any amount of relational data (for example, operational databases and data from line-of-business applications) and non-relational data (for example, from mobile applications, IoT devices, and social Operational databases and data for the media)

  • Users expect their data assets to be tightly protected

  • Users expect data analysis to become faster, more flexible, and more real-time 

The emergence of data lakes has well satisfied the first two needs of users, which allow users to import any amount of data obtained in real time. Users can collect data from multiple sources and store it in a data lake in its raw form. The data lake has extremely high horizontal scalability, enabling users to store data of any scale. At the same time, the bottom layer usually uses a cheap storage solution, which greatly reduces the cost of storing data for users. Through measures such as sensitive data identification, classification, privacy protection, resource authority control, encrypted data transmission, encrypted storage, data risk identification, and compliance auditing, the data lake helps users establish a security early warning mechanism, enhances overall security protection capabilities, and makes data available Unavailable and safety compliant.

In order to further meet users' requirements for data lake analysis, we need a set of analysis engines suitable for data lakes, which can utilize more data from more sources in a shorter period of time, and enable users to collaboratively process and analyze in different ways data to make better, faster decisions. This article will reveal the key technologies of such a data , and help users to further understand the architecture of the system through StarRocks.

After that, we will continue to publish two articles to introduce the core and use cases of the ultra-fast data lake analysis engine in more detail:

  • Code Walk Reading : By reading the key data structures and algorithms of StarRocks, an open source analytical database core, readers can further understand the principle and specific implementation of the Extreme Data Lake Analysis Engine.

  • Case Study : Introduce how large enterprises use StarRocks to gain real-time and flexible insight into the value of data on the data lake, so as to help businesses make better decisions and help readers further understand how the theory is implemented in practical scenarios.

What is a data lake

What is a data lake, according to Wikipedia's definition, "A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files". Generally speaking, the data lake can be understood as a layer on the cheap object storage or distributed file system, so that the discrete objects or files in these storage systems can be combined to show a unified semantics, such as relational Common database "table" semantics, etc.

After understanding the definition of a data lake, we naturally want to know what unique capabilities a data lake can provide us, and why should we use a data lake?

Before the concept of data lake came out, many enterprises or organizations used HDFS or S3 to store all kinds of data generated in the daily operation of the business (for example, a company that makes an APP may want to store click events generated by users) detailed records). Because the value of these data may not be discovered in a short period of time, find a cheap storage system to temporarily store them, and hope that valuable information can be extracted from it when the data can be used one day in the future. However, the semantics provided by HDFS and S3 are relatively simple after all (HDFS provides the semantics of files externally, and S3 provides the semantics of objects externally). Over time, engineers may not be able to answer what data they store in it. In order to prevent the subsequent use of the data, the data must be parsed one by one to understand the meaning of the data. Smart engineers thought of organizing data with consistent definitions, and then using additional data to describe the data. These additional data are called are "meta" data because they are the data that describe the data. In this way, the specific meaning of these data can be answered by parsing the metadata in the future. This is the most primitive role of the data lake.

As users have higher and higher requirements for data quality, data lakes have begun to enrich other capabilities. For example, it provides users with database-like ACID semantics, helps users get a point-in-time view in the process of continuously writing data, and prevents various errors in the process of reading data. Or to provide users with higher-performance data import capabilities, etc., until now, the data lake has changed from simple metadata management to now has richer and more database-like semantics.

To describe the data lake in an inaccurate term, it is an "AP database" with cheaper storage costs. However, data lakes only provide data storage and organization capabilities. A complete database must not only have data storage capabilities, but also data analysis capabilities. Therefore, how to create an efficient analysis engine for the data lake and provide users with the ability to gain insight into data will be the focus of this article. The following chapters will gradually disassemble the internal structure and implementation of a modern OLAP analysis engine:

  • How to perform ultra-fast analysis on the data lake

  • Architecture of a modern data lake analytics engine

How to perform ultra-fast analysis on the data lake

From this section, let's go back to the database course. An analysis engine for a data lake and an analysis engine for a database are the same in architecture. Usually we think they will be divided into the following parts:

  • Parser: Parses the query statement entered by the user into an abstract syntax tree

  • Analyzer: Analyze whether the syntax and semantics of the query statement are correct and conform to the definition

  • Optimizer: Generate higher-performance, lower-cost physical query plans for queries

  • Execution Engine: Execute physical query plans, collect and return query results

For a data lake analysis engine, Optimizer and Execution Engine are the two core modules that affect its performance. Below we will start from three dimensions, disassemble the core technical principles of these two modules one by one, and compare different technical solutions , to help readers understand the beginning and end of a modern data lake analytics engine.

RBO vs CBO

Basically, the job of the optimizer is to generate the least expensive (or relatively low) execution plan for a given query. The performance of different execution plans will vary by thousands of times. The more complex the query and the larger the data volume, the more important query optimization is.

Rule Based Optimization (RBO) is an optimization strategy commonly used by traditional analysis engines. The essence of RBO is that the core is based on the equivalent transformation of relational algebra, and the query is transformed through a set of pre-established rules to obtain a lower-cost execution plan. Common RBO rules predicate pushdown, Limit pushdown, constant folding, etc. In RBO, there is a set of strict usage rules. As long as you write the query statement according to the rules, the generated execution plan is fixed regardless of the content in the data table. However, in the actual business environment, the magnitude of the data will seriously affect the performance of the query, and RBO cannot obtain a better execution plan through this information.

In order to solve the limitations of RBO, the optimization strategy of Cost Based Optimization (CBO) came into being. CBO estimates the cost of execution plans by collecting statistics on the data, including the size of the dataset, the number of columns, and the cardinality of the columns. For example, suppose we now have three tables A, B and C. When querying A join B join C, we cannot judge the difference in the execution order cost of different joins if there is no corresponding statistical information. If we collect the statistical information of these three tables and find that the data volume of table A and table B are both 1M rows, but the data volume of table C is only 10 rows, then by executing B join C first, the intermediate result can be greatly reduced. The amount of data, which is basically impossible to judge without statistical information.

As the query complexity increases, the state space of the execution plan becomes very large. Anyone who has read the algorithm questions knows that once the state space is very large, it is impossible to AC through brute force search. At this time, a good search algorithm is particularly important. Usually CBO uses dynamic programming algorithm to get the optimal solution and reduce the cost of repeated calculation of subspace. When the state space reaches a certain level, we can only choose the greedy algorithm or some other heuristic algorithm to get the local optimum. Essentially a search algorithm is a trade-off in search time and result quality.

(Common CBO implementation architecture)

Record Oriented vs Block Oriented

The execution plan can be regarded as the execution flow of a series of operators (operators of relational algebra) connected end to end, and the output of the previous operator is the input of the next operator. The traditional analysis engine is Row Oriented, which means that the output and input of the operator are line-by-line data.

As a simple example, suppose we have the following table and query: 

CREATE TABLE t (n int, m int, o int, p int); 

SELECT o FROM t WHERE m < n + 1;

Example source: GitHub - jordanlewis/exectoy

When the above query statement is expanded into an execution plan, it is roughly as shown in the following figure:

Usually, in the Row Oriented model, the execution process of the execution plan can be represented by the following pseudocode:

next: 
  for: 
    row = source.next() 
    if filterExpr.Eval(row): 
      // return a new row containing just column o 
      returnedRow row 
      for col in selectedCols: 
        returnedRow.append(row[col]) 
      return returnedRow

According to the evaluation of  DBMSs On A Modern Processor: Where Does Time Go?,  this execution method has a large number of L2 data stalls and L1 I-cache stalls, and the efficiency of branch prediction is low.

With the vigorous development of hardware technologies such as disks, and the widespread use of various compression algorithms, Encoding algorithms, and storage technologies that exchange CPU for IO, the performance of the CPU has gradually become the bottleneck of the analysis engine. In order to solve the problem of Row Oriented execution, the academic community began to think about solutions. The paper Block oriented processing of Relational Database operations in modern Computer Architectures  proposes to pass data between operators in a block-wise manner, which can amortize condition checking and summarization. The time-consuming work of branch prediction, MonetDB/X100: Hyper-Pipelining Query Execution  goes a step further, and proposes to further improve the efficiency of CPU Cache by changing the data from the original Row Oriented to Column Oriented. Useful for compiler optimizations. In the Column Oriented model, the execution process of the execution plan can be represented by the following pseudocode:

// first create an n + 1 result, for all values in the n column 
projPlusIntIntConst.Next(): 
  batch = source.Next() 

  for i < batch.n: 
    outCol[i] = intCol[i] + constArg 

  return batch 

// then, compare the new column to the m column, putting the result into 
// a selection vector: a list of the selected indexes in the column batch 

selectLTIntInt.Next(): 
  batch = source.Next() 

  for i < batch.n: 
    if int1Col < int2Col: 
      selectionVector.append(i) 

  return batch with selectionVector 

// finally, we materialize the batch, returning actual rows to the user, 
// containing just the columns requested: 

materialize.Next(): 
  batch = source.Next() 

  for s < batch.n: 
    i = selectionVector[i] 
    returnedRow row 
    for col in selectedCols: 
      returnedRow.append(cols[col][i]) 
      yield returnedRow

It can be seen that Column Oriented has better data locality and instruction locality, which is beneficial to improve the hit rate of CPU Cache, and it is easier for the compiler to perform SIMD optimization.

Pull Based vs Push Based

In the database system, the input SQL statement is usually converted into a series of operators, and then the physical execution plan is generated for the actual calculation and the result is returned. In the generated physical execution plan, operators are usually pipelined. There are two common pipeline methods:

  • Based on the data-driven Push Based mode, the upstream operator pushes data to the downstream operator

  • In demand-based Pull Based mode, downstream operators actively pull data from upstream operators. The classic volcano model is the Pull Based model.

The Push Based execution mode improves cache efficiency and can better improve query performance.

参考:Push vs. Pull-Based Loop Fusion in Query Engines

Architecture of a modern data lake analytics engine

Through the introduction in the previous section, I believe that readers have a corresponding understanding of the cutting-edge theory of the data lake analysis engine. In this section, we take StarRocks as an example to further introduce how the data lake analysis engine organically combines the above advanced theories, and presents it to users through an elegant system architecture. 

As shown in the figure above, the architecture of StarRocks is very simple. The core of the entire system only has two types of processes: Frontend (FE) and Backend (BE). It does not depend on any external components, which is convenient for deployment and maintenance. Among them, FE is mainly responsible for parsing query statements (SQL), optimizing queries and query scheduling, while BE is mainly responsible for reading data from the data lake and completing a series of operations such as Filter and Aggregate.

Frontend

 

The main function of FE converts SQL statements through a series of transformations and optimizations, and finally converts them into Fragments that BE can recognize. A less accurate but easy-to-understand analogy, if the BE cluster is regarded as a distributed thread pool, then Fragment is the Task in the thread pool. From SQL text to Fragment, the main work of FE consists of the following steps:

  • SQL Parse: Convert SQL text into an AST (Abstract Syntax Tree)

  • Analyze: Syntactic and semantic analysis based on AST

  • Logical Plan: Convert AST to Logical Plan

  • Optimize: Rewrite and convert the logical plan based on relational algebra, statistical information, and Cost model, and select the physical execution plan with the "lowest" Cost

  • Generate Fragment: Convert the physical execution plan selected by the Optimizer into a Fragment that BE can directly execute

  • Coordinate: Schedule Fragment to the appropriate BE for execution 

Backend

BE is the back-end node of StarRocks, responsible for receiving Fragment execution from FE and returning the result to FE. The BE nodes of StarRocks are all completely equal, and FE allocates data to the corresponding BE nodes according to a certain strategy. The common Fragment workflow is to read some files in the data lake, and call the corresponding Reader (for example, Parquet Reader adapted to Parquet files and ORC Reader adapted to ORC files, etc.) to parse the data in these files, using vectorization After the execution engine further filters and aggregates the parsed data, it returns to other BE or FE. 

Summarize

This article mainly introduces the core technical principles of the ultra-fast data lake analysis engine, and compares different technical implementation schemes from multiple dimensions. In order to facilitate the subsequent in-depth discussion, the system architecture design of the open source data lake analysis engine StarRocks is further introduced. I hope to discuss and communicate with all colleagues.

appendix

Benchmarks

This test uses the standard test set of TPCH 100G to compare and test the performance gap between StarRocks local table, StarRocks On Hive and Trino (PrestoSQL) On Hive.

A comparison test was performed on the TPCH 100G-scale data set, with a total of 22 queries, and the results are as follows:

 

StarRocks uses both local storage query and Hive foreign table query for testing. Among them, StarRocks On Hive and Trino On Hive query the same data, which is stored in ORC format and compressed in zlib format. The test environment is built using  Alibaba Cloud EMR  .

In the end, the total time taken for StarRocks local storage query is 21s, and the total time for StarRocks Hive external query is 92s. The Trino query takes a total of 307s. It can be seen that StarRocks On Hive far exceeds Trino in terms of query performance, but there is still a long way to go compared to local storage queries. The main reason is that accessing remote storage increases network overhead, and the delay and IOPS of remote storage usually It is not as good as local storage. The plan behind is to make up for the problem through mechanisms such as Cache, and further shorten the gap between StarRocks local tables and StarRocks On Hive.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324132796&siteId=291194637
Recommended