In-depth analysis of the five service components of the ZNBase distributed SQL engine architecture

Guided reading

Compared with the traditional relational database, the distributed database system has the characteristics of multi-cluster, multi-node, high concurrency, etc., which requires the SQL engine of the distributed database to provide multi-cluster and multi-node collaboration in addition to satisfying users' regular SQL requests. computing power to improve query efficiency. This article will introduce the features of the SQL engine architecture of the distributed database ZNBase , as well as the technical principles and workflows of the major service components.

Distributed Database Architecture

At present, the most popular distributed databases in the industry are mainly divided into two architectures. One is the Shared nothing architecture represented by Google Spanner, and the other is the computing/storage separation architecture represented by AWS Auraro. 

Spanner is a shared nothing architecture. It maintains automatic sharding, distributed transactions, and elastic expansion capabilities. Data storage still needs sharding, and plan calculation also needs to involve multiple machines, which also involves distributed computing and distributed transactions. 

Auraro's main idea is to separate computing and storage architecture, using shared storage technology, which improves disaster tolerance and expansion of total capacity. However, at the protocol layer, as long as it does not involve storage, it is essentially a single-machine instance of the SQL engine, and does not involve distributed storage and distributed computing, so it is highly compatible with traditional databases.

Inspur Yunxi NewSQL database ZNBase perfectly inherits the design concept of Spanner and realizes a distributed SQL engine based on peer-to-peer architecture.

ZNBase's SQL engine

Based on the traditional SQL engine, the SQL engine of ZNBase introduces the concept of distribution, and executes user SQL queries more efficiently through collaborative computing of multiple cluster nodes. The overall architecture diagram is as follows:

SQL engine static structure, including five services

Each node in the cluster has three unique services : Connectivity Service, Compile Service, and Cache Service, which can complete the front-end preparation for the user's SQL query execution.

At the same time, all the nodes together form the Distributed Catalog Service and the Distributed Execute Service. Through these two services, the coordinated execution of multiple nodes is completed and the distributed SQL is improved. engine performance. Finally, the structured data is converted into KV code pairs identifiable by the underlying storage, and sent to the transaction layer for processing through Batch.

SQL engine execution flow

These five services are described below.

1. Connectivity Service

The distributed database ZNBase adopts a peer-to-peer architecture, and any node in the cluster can be used as an access node. At the same time, ZNBase supports the PostgreSQL protocol, and SQL queries can be sent to the cluster through various drivers that support the PostgreSQL protocol.

The connection service process is as follows:

  1. The user manages the connector through a background daemon that builds a new Executor for each client.
  2. After the user initiates an instruction from the client, the stream is received and unpacked from the client.
  3. After the execution is completed, the operation result is packaged and returned to the client.
  4. Each operation of the user is considered as a separate transaction operation.

2. Dist Catalog Service

ZNBase's Dist Catalog Service not only implements the schema metadata of traditional relational databases, but also includes commonly used database metadata such as libraries, tables, columns, and schemas, and realizes high availability and distributed access of metadata information. Metadata adopts multi-copy storage and distributed storage to ensure that metadata information is still available when less than half of the data is unavailable. Moreover, each peer node will directly memorize the first-level Root Meta Range data of the metadata routing table when it starts, ensuring that any node can access the required metadata information. 

When the catalog information changes, it will first update the write node of the metadata store, and synchronize to multiple copies through the Raft protocol. At the same time, the catalog cache of each node is invalidated, and it is updated asynchronously during use to ensure the consistency of the data of each node.

3. Compile Service

The compilation service of ZNBase includes SQL front-end and SQL middle-end functions. The SQL front-end implements the Scanner, Parser, SQL syntax, SQL semantics of traditional databases, and processing of database objects and permission verification, and generates AST (Abstract Syntax Tree).

The SQL middle end implements the function of the optimizer of the database. The optimizer is responsible for providing input to the execution engine. It receives the parsed AST tree from the SQL front-end, and then needs to select the most cost-effective plan from all possible plans and provide it to the execution engine. 

ZNBase's optimizer is a search framework implemented based on the Cascades paper. From the perspective of the development of the database, the Cascades-based search framework has become the industry standard, including the commercial database SQL Server and the open source database GP/ORCA, which are implemented by Cascades. The overall architecture of the compilation service is as follows:

SQL engine compilation service structure diagram

As shown in the figure above, the SQL statement input by the client is parsed into an AST syntax tree through the lexical, grammar, and semantic meaning of the go-yacc layer, and is converted into the initial CBO Memo tree through Memo construction. Memo consists of some columns of equivalent groups, each group represents a set of logically equivalent expressions, Memo itself is tree-structured and can represent query statements, but it does not contain a lot of metadata information and can be cached to Improve execution efficiency, which will be parsed in Cache Service. The constructed Memo is directly applied to the basic RBO transformation. After that, Memo data is optimized by CBO (equivalent discovery and optimization Cost) according to the statistical information to select the plan that is converted into the optimal path. 

RBO selects the execution plan for the specified table according to the specified priority order. For example, in the rule: the priority of the index is higher than the full table scan. 

When some SQL statements are not written in a way that is not conducive to quickly querying data from storage, RBO will transform them accordingly, for example:

SELECT * FROM  t1 ,t2  WHERE  t1.a > 4  AND  t2.b >5;

If the Cartesian product is performed first and then the filter condition is performed, a lot of unnecessary tuples will be generated. However, if the relationship between t1 and t2 is filtered first, and the Cartesian product is performed, then the consumption of the expression will be greatly reduced. When filtering, if you can do it in a select operator, you can do it in the operator. If you can't, do it in time when you have the columns needed for filtering, such as aa > 5 and bb > 10 and ac > ab, the first Both the first and second conditions can be pushed into the select operator, and a filter condition of ac > ab is immediately added to the two operators. 

CBO estimates the cost based on the statistical information, and obtains a better query path. For example: when we join three tables, if there is statistical information, we can know which two tables to join first will make the next execution cost less, because when doing hashjoin, we always hope that the The table entered first, and then made into a small hashtable, because the hashtable is relatively small, so the subsequent large table will have a higher hit rate when doing join.

4. Cache Service

ZNBase provides two types of caching services, which are mainly used to improve data access efficiency and reduce repeated consumption. 

The first is the session-level Querycache, which mainly caches the Memo tree data structure corresponding to the user's SQL statement fingerprint, reducing the overhead of constructing logical plans for multiple SQL statements in the same session. The SQL statement fingerprint contains verification information such as the relevant catalog information and permissions of the SQL statement. 

Before reusing a Memo, check whether the Memo is out of date: resolve each data source and schema that the metadata depends on, in order to check that the fully qualified object name still resolves to the same version of the same object, check the time-dependent type How they are constructed and compared, and whether the user still has sufficient permissions to access these objects. If the dependencies are no longer up to date, the Memo is determined to be out of date and needs to be rebuilt. 

The second is the cluster-level metadata-related Cache. The Catalog information contains the schema information and metadata routing information commonly used in the database. Metadata routing information is provided by the Dist Catalog service. Through metadata routing information, any node in the cluster can access all required metadata or data.

5. Dist Execution Service

The overall design model of the SQL engine of ZNBase refers to the Volcano model [1]. The Volcano model was proposed by Goetz Graefe, who published this article in 1994 and won the Edgar F. Codd (the founder of the relational model) Innovation Award in 2017. 

The distributed execution of ZNBase proposes some concepts similar to Map-Reduce, but completely different from the execution model of Map-Reduce. 

The logical plan of ZNBase builds a Plan node tree structure from the bottom up from the optimized Memo, adding some additional table information, column information, etc. for the subsequent construction of the physical plan. 

The key idea of ​​distributed execution is how to move from the logical execution plan to the physical execution plan, which mainly involves two aspects of processing, one is the distributed processing of computing, and the other is the distributed processing of data. 

Once the physical plan is generated, the system needs to split and distribute it among the nodes for operation. Each node is responsible for locally scheduling data processors and input synchronizers synchronizers. The nodes also need to be able to communicate with each other to connect the output output router to the input synchronizer. In particular, a streaming interface is required to connect these components. In order to avoid additional synchronization costs, a flexible enough execution environment is required to satisfy all of the above operations, so that in addition to executing the initial scheduling of the plan, different nodes can start the corresponding data processing work relatively independently without being affected by the gateway. Other orchestration effects of the node. 

The Gateway node in the ZNBase cluster creates a Scheduler that accepts a set of flows, sets input and output related information, creates a local processor and starts execution. When the node processes the input and output data, we need to control the flow. Through this control, we can reject some requests in the request. 

Execute Flow Diagram

Each Flow represents a complete segment executed across nodes in the entire physical plan, consisting of processors and streams, which can complete the data pulling, data calculation processing and final data output of the segment. As shown below:

plan execution diagram

For cross-node execution, the Gateway node will first serialize the corresponding FlowSpec as SetupFlowRequest, and send it to the remote node through grpc. After the remote node receives it, it will restore the flow first, and create the processor it contains and the interactive stream ( TCP channel), complete the construction of the execution framework, and then start the multi-node computing driven by the gateway node. Flows are scheduled asynchronously through the box buffer pool to achieve parallel execution of the entire distributed framework. 

For local execution, that is, parallel execution, each processor, synchronizer and router can run as a goroutine, and they are interconnected by channels. These channels can buffer channels to synchronize producers and consumers.

In order to realize distributed concurrent execution, ZNBase introduces the concept of Router during execution. For complex operators such as JOIN and AGGREGATOR, three data redistribution methods are implemented according to the data distribution characteristics, mirror_router, hash_router and range_router, which are realized by data redistribution. The processor operator is internally divided into two stages for execution. The first stage is to process part of the data at the node where the data is located. After processing, the result will be redistributed according to the operator type, and the second stage will be aggregated and processed, thus realizing a single operator. Multi-node cooperative execution.
 

summary

This paper introduces the SQL engine architecture of the distributed NewSQL database ZNBase based on the Google Spanner paper, and introduces in detail the connection service, compilation service, cache service in each node, as well as the distributed directory service and distributed execution service in the system. The technical principles and workflow of the five service components. In the next article, we will introduce a series of optimization and improvement work by the ZNBase team for the compilation service, distributed execution service and other components based on the original SQL engine architecture. 

Next : In-depth analysis of the SQL engine optimization of the distributed database ZNBase

More details about ZNBase can be found at:

Official code repository: https://gitee.com/ZNBase/zn-kvs

ZNBase official website: http://www.znbase.com/ 

If you have any questions about related technologies or products, please submit an issue or leave a message in the community for discussion. At the same time, developers who are interested in distributed databases are welcome to participate in the construction of the ZNBase project.

Contact email: [email protected]

 

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324227246&siteId=291194637