Powerful Internet genes, in-depth disclosure of Tencent Cloud's new generation of enterprise-level HTAP database TBase

About the author: Li Yuesen, senior database expert of Tencent Cloud, big data expert, PostgreSQL community member, responsible for the architecture design and technology research and development of TBase data, more than 10 years of experience in database kernel design and development, participated in and completed the architecture design and development of various databases .

At the end of 2017, Tencent Cloud PostgreSQL-XZ (PGXZ) was officially renamed as TBase, and it has been applied to more than a dozen customers in government affairs, public security, fire protection, telecommunications, finance and other industries. TBase is widely recognized by customers for its powerful functions, stable operation and powerful Internet genes.

In 2016, based on the changes in demand inside and outside Tencent Cloud, TBase's HTAP solution began pre-research, and it has been applied to many customers including WeChat Pay. In April 2018, TBase's HTAP solution (will be officially released.

TBase core concepts:

The important technical features and concepts of Tbase mainly include the following aspects:

Enterprise:

Enterprise features include the following:

  • User-friendly transaction characteristics: The business does not need to pay attention to the transaction characteristics of the database. The database kernel supports complete distributed transactions and ensures the ACID of transactions.

  • User-friendly database features: primary keys, foreign keys, sequences, constraints, partitioned tables, stored procedures, triggers, subqueries and other enterprise-level features are fully supported.

  • User-friendly SQL interface: Currently, Tbase is compatible with the SQL2003 standard and common ORACLE syntax, which facilitates the migration of deep ORACLE users. There are already cases of ORACLE migration externally.

  • User-friendly distributed query capability: Good distributed query support capability, and the database kernel can efficiently handle distributed JOIN.

  • Efficient online linear expansion will not affect the operation of the business when the cluster size changes.

HTAP Capability:

Hybrid Transactional/Analytical Processing, that is, transaction and analysis hybrid processing technology, this technology requires two business types with conflicting resource demands to be processed in the same database.

Tbase has been specially designed to perfectly achieve HTAP, and at the same time has efficient OLAP capabilities and massive OLTP processing capabilities.

The following are the benchmark test results of the transaction test model TPCC we tested. The system completed more than 3.1 million transactions per minute, and the test of a larger-scale cluster is still in progress. From the current architecture design, if the hardware allows, the transaction throughput of the system will increase quasi-linearly with the size of the cluster:

The following picture shows the comparison between TBase in row storage mode and the industry MPP data warehouse benchmark in the OLAP test set TPCH 1T benchmark:

Through this figure, we can intuitively see the OLAP analysis capability of TBase. In each of the 22 use cases, the time consumption of TBase is better than that of Greenplum, and the time consumption of some use cases is significantly higher than that of Greenplum.

Through HTAP technology, businesses can process OLTP transactions and OLAP analysis simultaneously in a single TBase cluster. Through HTAP, the complexity of the business system can be greatly reduced, and the operation and maintenance cost can be reduced.

High data security:

In the process of communicating with customers, customers in many industries have mentioned the demands of data security. The TBase team designed the data security system of TBase based on the needs of customers and the industry's advanced database security solutions. This system mainly includes the following aspects:

  • Separation of powers: Decompose the role of the database system dba into three mutually independent roles, security administrator, audit administrator, and data administrator. These three roles restrict each other, and eliminate the authority of God in the system. The role design solves the data security problem.

  • Mandatory security rules: Combined with the industry's advanced database security solutions, Tbase proposes a mandatory security rule solution. The mandatory security rules formulated by security administrators can also achieve row-level visibility and column-level visibility, thereby restricting what users can see. Data, to achieve mixed control of permissions for different users, effectively prevent unauthorized viewing of data, and ensure the security of key data.

  • Transparent data desensitization management: For industries with special requirements for data security, such as finance and security, there are often demands for data desensitization. However, many of the existing solutions require business participation and require in-depth business participation. threshold. TBase has been specially designed for this pain point to achieve transparent desensitization of the business. The business only needs to design the business logic according to its own business rules combined with the desensitization syntax of TBase. Data desensitization can be done within TBase. At the same time, combined with the mandatory security rules mentioned above, security administrators can specify data desensitization targeting users, and finally users who reach a high security level will see non-desensitized data. , users with low security levels see desensitized data.

  • Audit capability: In the process of communicating with customers, many customers have mentioned the demands of database auditing. During the design process, Tbase designed its own auditing system in combination with industry benchmarking auditing standards, and realized the core functions of auditing in the kernel. , so that the performance of the system can be guaranteed while taking into account the high-precision audit granularity. At the same time, for some problems encountered in the business, we design special solutions to achieve real-time notification of audit results.

Multi-tenancy capability:

TBase provides multi-tenancy capabilities at both the cluster level and the cluster user level. Through the cluster-level multi-tenancy capability, it can help businesses quickly build a database private cloud and help customers quickly provide TBase-based DCDB (distributed database) services. The cluster-level multi-tenant capability architecture is shown in the following figure:

In addition, the TBase database cluster also provides an in-cluster multi-tenant solution based on the node group, so that the business and resources within the database cluster are isolated, and multiple businesses run in isolation within TBase. As shown in the figure below, APP1, APP2, and APP3 run in a database cluster at the same time, and are isolated from each other through groups without affecting each other.

TBase product architecture:

The above generally introduces the technical characteristics of TBase. The students who are in contact with TBase for the first time are still a little unclear. Therefore, in order to facilitate the understanding of the following content, the overall architecture of TBase is introduced here. The old version and the new version are similar in overall architecture:

There are three types of nodes in the cluster, each of which undertakes different functions and is connected to a system through a network. The three node types are:

  • Coordinator: Coordinator node, providing external interfaces, responsible for data distribution and query planning, multiple nodes are located in the same position, each node provides the same database attempt; functionally, CN only stores the global metadata of the system, and does not store Actual business data.

  • Datanode: Processes and stores metadata related to this node, and each node also stores a shard of the data. Functionally, the DN node is responsible for fulfilling the execution request distributed by the execution coordinator node.

  • GTM: Global transaction manager. It is responsible for managing cluster transaction information and managing cluster global objects, such as sequences. In addition, GTM does not provide other functions.

Through the above architecture, TBase provides a database cluster with a friendly interface. This database cluster architecture has the following advantages:

  • Write-scalable: Multiple CNs can be deployed and write operations issued to these nodes simultaneously.

  • Multi-master: Each CN node of the system can initiate write operations, and can provide a unified, complete and consistent database view;

  • Automatic data synchronization (Synchronous): For business, the write operation on one CN node will be immediately presented on other CN nodes;

  • Data transparency (Transparent): It means that although the data exists in different DN nodes, when the business queries the database through CN, you can still write SQL statements like a common database, and you don't need to care that the data is located in a specific node. The TBase database kernel automatically Complete the scheduled execution of SQL and ensure transaction characteristics.

Detailed explanation of TBase features

OLAP capability improvement:

When it comes to the improvement of OLAP capabilities, we must first talk about the difference between the old version of TBase and the processing of OLAP requests. The differences between the old version and the processing of OLAP requests mainly include the execution method and whether the DN nodes communicate with each other:

Execution method: In the old version, when an OLAP request is executed, the CN sends the SQL statement to the DN. The DN is responsible for the planning and execution of the SQL statement, and then reports the results to the CN, and the CN completes the summary of the results. CN collects cluster statistics, plans cluster-level distributed query plans for OLAP queries, and sends them to each DN for execution. That is to say, CN issues execution plans, and DNs are only responsible for execution.

Whether data is exchanged between DNs: In the old version, there is no communication channel between DNs and data exchange cannot be performed. The version establishes an efficient data exchange channel between DN nodes, which can efficiently exchange data between DN nodes.

The difference is shown in the image below:

Based on the OLAP framework, we have developed a complete set of efficient multi-threaded data transmission mechanisms. When running OLAP queries, this framework ensures that data can be efficiently synchronized between nodes, greatly improving OLAP processing efficiency.

At the algorithm level, based on the multi-core parallel execution capability of PG10, we systematically redesigned the commonly used JOIN and AGGRATE algorithms in a cluster environment, so as to give full play to the performance of existing hardware, and test the OLAP benchmark under the same cluster scale. TPCH 1T, the average performance exceeds the industry benchmark Greenplum by 2 to 5 times.

OLTP capability optimization and improvement:

GTM is the module responsible for processing transaction information in the TBase cluster, and its processing capability directly determines the transaction throughput of the system. And GTM is the only single point in the system, and its processing upper limit directly affects the ceiling of the system's processing capacity.

To this end, we have specially optimized and designed the GTM. Mainly focus on the following four aspects:

  • The optimization of network bandwidth, cancel the cluster snapshot of the system, and use the logical clock to judge the cluster visibility of the transaction, which greatly reduces the network bandwidth occupation of GTM, and also reduces the CPU usage of GTM.

  • The optimization of CPU usage greatly reduces the thread data of GTM through the reuse of thread resources, reduces the CPU usage of system scheduling, and greatly improves the processing efficiency of GTM.

  • The optimization of system locks. When the system throughput reaches one million, the system mutex used by GTM occupies most of the CPU. We have written a user-mode mutex, so that the CPU usage is only one-tenth of the original. , which raises the upper limit of the processing capacity of the system.

  • The use of lock-free queues, using lock-free queues to replace the original locked queues, reduces the use of locks in the system and greatly improves the processing efficiency of the system.

In addition, we also propose a patented distributed transaction consistency technology to ensure transaction consistency in a fully distributed environment. Through the above optimizations, the TPCC transaction processing capability of the TBase single-machine cluster has been greatly improved, and the processing capability will increase quasi-linearly with the size of the cluster. The following is the result of tpcc we tested under the scale of 60 clusters. The maximum throughput reached 310W per minute. At this time, the system DN and CN resources were tight, but there was still a considerable amount of GTM resources left. Therefore, with the increase of the cluster scale, the system throughput can continue to improve.

TBase HTAP processing capability:

Before talking about the HTAP capabilities of TBase, let’s first analyze the HTAP capabilities of the mainstream distributed database architectures on the market. We will not discuss all-in-one solutions such as Exadata and HANA here.

First, let's talk about the common sharding distributed architecture in Internet companies:

This architecture uses middleware to provide a unified database access interface for multiple single-machine database instances through physical sub-database and sub-table to achieve the effect of a distributed database. This architecture has certain competitiveness in processing simple SQL requests, but is often incapable of processing complex SQL such as complex distributed joins and subqueries, so it is not very good at HTAP-type business processing.

The second architecture is the classic MPP architecture. The typical product is Pivotal Greenplum. There is only one single Master node in this architecture, and the primary data node is used to provide query services. This architecture is born for OLAP. Because of the single Master problem, the system The processing capacity of the system is limited by the processing capacity of the master, and the minimum granularity of system locks is table level, which directly affects the processing capacity of transactions. This architecture is only suitable for processing OLAP services, not OLTP services.

The structure of share nothing is mentioned above, and the structure of share everything is analyzed later. The structure is as follows:

Typical products are Sybase IQ, Oracle RAC. As a classic data warehouse product, Sybase IQ used to be all the rage. Now many concepts and solutions of data warehouses can be found in Sybase IQ. Since it was positioned as a data warehouse product at the beginning of the design, there is no architectural Consider handling transactional requests, which can only be used to handle OLAP requests.

For ORACLE RAC, as a popular database product at present, it has a good performance in processing OLAP and OLTP requests, but because of the expensive price of itself and the supporting hardware and the complicated and lengthy process of expansion, it has been complained by many customers.

After analyzing the solutions in the industry, we basically came to a conclusion that we need an HTAP distributed solution that can efficiently process OLTP and OLAP services at the same time, and take into account ease of use and low cost. The existing solutions are difficult to meet our needs. need. In addition to the basic capabilities, there is another problem that needs to be paid attention to. OLTP requests focus on latency and throughput, while OLAP focuses on latency. The two have completely different resource usage models because of their different concerns. It is a difficult problem to efficiently process these two services at the same time within the cluster and to achieve good resource isolation.

The TBase team carefully designed TBase's HTAP solution after comprehensively considering the above factors. The overall architecture is as follows:

TBase divides HTAP into two scenarios:

CASE 1, OLAP and OLTP access different business data. In this scenario, we can use TBase's group isolation technology to allow TBase to exert efficient OLAP and massive OLTP capabilities respectively on the basis of naturally satisfying physical isolation.

CASE 2, OLAP and OLTP access the same data, both OLTP and OLAP operations need to be performed on the same data at the same time, and the efficiency of both needs to be ensured at the same time. At this time, in order to achieve resource isolation, TBase uses DN hosts to run OLTP services, and uses dedicated OLAP standby DN nodes to run OLAP services to achieve the effect of natural resource isolation.

TBase security architecture introduction:

During the communication between the TBase team and customers, customers in many industries have put forward demands for data security. TBase designed the TBase security system according to the pain points of the business and combined with the leading database concept in the database industry.

The TBase data security system is based on the separation of the three rights of the database, and decomposes the rights of the traditional DBA into security administrators, audit administrators, and data administrators. In the field of security rules, security rules and data transparency and desensitization rules are added for security administrators. In terms of auditing, object auditing, user auditing, and fine-grained auditing are added in combination with industry auditing standards and business scenarios. In addition, the data administrators perform the data management and database operation and maintenance functions of the previous DBA.

TBase multi-tenancy capability:

As an enterprise-level distributed database, TBase also provides multi-tenant capabilities that are often used by enterprises, allowing customers to run multiple services in one database environment. without mutual influence.

The overall multi-tenant architecture of TBase is as follows:

The multi-tenant management of TBase is divided into three levels. The bottom layer is the resource management layer. This layer manages the basic physical machines, and performs resource slicing and pooling of material resources and isolation between each shard. It is also responsible for the upper layer. resource allocation and release.

The second layer is the tenant management layer. This layer is responsible for the management of rights within the tenant. Each tenant has a complete system of separation of powers. Each tenant can correspond to multiple projects. Each project corresponds to a cluster. The storage location of physical resources is not aware. Each tenant can only see their own related projects and clusters.

The top layer is system management. This layer is responsible for the creation of tenants and clusters, and manages the resource allocation and release of the entire platform. This layer can see the tenants and clusters of the entire platform, as well as the status information of physical machines.

In addition to the cluster-level multi-tenancy provided by the system, within a single TBase cluster, it also provides the multi-tenancy solution capability within the cluster based on node groups. For example, as shown below:

Within a database cluster, the three APP services use different node groups and separate CNs to isolate resources and perfectly realize multi-tenancy within the cluster.

Through the above two solutions, businesses can build a suitable multi-tenant environment according to their own needs and quickly deploy businesses.

TBase online elastic expansion:

For a distributed system, elastic expansion is a rigid requirement. TBase is also unambiguous in this area. In the old version of TBase, shardmap and shard table were introduced. Through the shard table, TBase can provide online linear expansion capability.

In the old version of the kernel, shard records are stored in the order of allocation. It is possible that the records of the same shardid are not stored continuously. In order to complete the expansion process, the expansion business process has made a lot of adaptations to the bottom layer, and the process is lengthy.

The kernel introduces the concept of shard clustering in the storage layer, that is, records with the same shardid are continuously stored at the bottom layer, which greatly simplifies the design of the upper-layer business process and greatly improves the efficiency of capacity expansion.

TBase partition table:

In the old version of TBase, we introduced the cluster partition table. Compared with the community partition table, the performance of this partition table is improved by 1-3 orders of magnitude in the OLTP scenario, especially when the number of sub-tables is large. The overall structure is as follows:

Community performance comparison:

In this schema:

Coordinator: Responsible for vertical table partitioning, not aware of the logic of table partitioning (horizontal table partitioning).

DataNode: Responsible for horizontal sub-tables, dividing a logical table into multiple physical tables according to the sub-table fields.

Tbase also has this excellent built-in partition table function. However, in addition to this, PG 10.0 also inherits the RANGE and LIST partitions of the community. With these two newly added partition types, businesses have more choices when building partition tables.

Other new features of TBase:

  • Hash index supports Crash Safe:

PG10 officially introduced the XLOG mechanism of Hash index in this version, which means that we can use Hash index with confidence. For operations such as equivalent query and in, we have a better choice than btree.

  • Efficient operation of expressions:

In PG10, in order to support JIT, the calculation method of expressions is changed from the previous traditional executor to expanded execution, and the execution efficiency of expressions is greatly improved compared to before. The team made a special comparison. We made a JIT DEMO for typical expression operations and compared the performance of the kernel expression expansion of PG10. We found that the test results of PG10 were not worse than the program generated by JIT.

  • Multi-core scalability enhancements:

Transaction scalability: PG has launched a number of OLTP transaction enhancement features since 9.6, including XLOG parallel disk placement optimization and system snapshot mechanism optimization. From the test of our team, from 24-core servers to 96-core servers, Transaction throughput can be improved quasi-linearly. That is to say, under TBase, whether it is a common server with 24 cores or a high-end server with more cores, TBase can give full play to the potential of the device and make full use of resources.

Multi-core parallel execution: PG9.6 began to introduce a parallel execution framework, and by PG10, parallel optimization of operators such as aggregate, hash join, merge, seqscan, and bitmap scan can be achieved.

Conclusion:

Of course, TBase also has many other features, which will not be repeated here. It is with the support of these capabilities that TBase is more comfortable with OLAP analysis operations on massive data. Tencent Cloud TBase is about to meet with you. The friends of Tencent Cloud team are very welcome to communicate and use it.

PS: From May 23rd to 24th, the 2018 Tencent "Cloud + Future" Summit with the theme of "Huan·kai" will be held in Guangzhou. This year's "Cloud + Future" summit will gather internationally renowned experts and scholars, business elites, industry and technical guests, and bring more than 80 keynote speeches to discuss the development of cloud computing and the future of the digital industry. Click here for the registration address (https://cloud.tencent.com/developer/summit/2018-guangzhou?fromSource=gwzcw.914482.914482.914482)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325345518&siteId=291194637