Can HTAP of distributed database unify OLTP and OLAP?

OLAP and OLTP are connected through ETL. In order to improve OLAP performance, a large amount of pre-computation is required in the ETL process, including:

  • Data structure adjustment
  • business logic processing

Benefits: It can control the access delay of OLAP and improve user experience. However, in order to avoid affecting the OLTP system by extracting data, the ETL process must be started during the low trading period at the end of the day. Therefore, the data delay of OLAP and OLTP is usually at least one day, and this timeliness expression is T+1:

  • T day, that is, the date when the OLTP system generates data
  • T+1 day, which is the date when the data is available in OLAP
  • The interval between the two is 1 day

The main problem of this system is the data timeliness of the OLAP system, T+1 is too slow. Yes, after entering the era of big data, business decisions pay more attention to data support, and data analysis continues to penetrate into front-line operations, all of which require OLAP systems to reflect business changes more quickly.

1 Solution idea

What HTAP needs to solve is the timeliness problem of OLAP, but it is not the only choice. The solution to this problem is as follows:

  1. Replace the original batch ETL process with quasi-real-time data calculation and rebuild the OLAP system
  2. Weaken or even remove OLAP altogether, and expand directly in the OLTP system, that is, HTAP

1.1 Rebuild the OLAP system

Emphasis on the timeliness of data processing, the main development direction of big data technology in recent years. The Kappa architecture is the representative of the new system. It was first proposed by Jay Kreps of LinkedIn in an article in 2014 :

The original batch file transfer method is completely replaced by Kafka, and the stream computing system completes the rapid processing of data, and the data finally lands in Serving DB to provide query services. Serving DB generally refers to various types of storage such as HBase, Redis or MySQL.

The Kappa architecture has not been fully realized, because in practice, stream computing cannot replace batch computing, and Serving DB cannot meet various types of analysis and query requirements.

The Kappa architecture needs to be further improved:

  1. The enhancement of stream computing capability requires the use of Kafka and Flink
  2. The enhancement of Serving DB's real-time computing power, hoping for breakthroughs in OLAP databases, just like ClickHouse

The new OLAP system attempts to improve real-time computing capabilities, remove batch ETL, and reduce data delays. This new system is an opportunity for streaming computing and an opportunity for OLAP database self-salvation.

1.2 Create a new HTAP system (Hybrid Transaction/Analytical Processing)

Hybrid transaction analysis processing first appeared in a Gartner report in 2014 , the same year as the Kappa architecture. Gartner uses HTAP to describe a new type of database, breaking the gap between OLTP and OLAP, and supporting both transactional database scenarios and analytical database scenarios in one database system. The idea is wonderful, HTAP can save tedious ETL operations, avoid the lag caused by batch processing, and analyze the latest data faster.

This idea quickly showed its aggressive side. Since the data source is in the OLTP system, the HTAP concept quickly became a banner for OLTP databases, especially NewSQL-style distributed databases, to enter the OLAP field.

After NewSQL initially solves the problems of high concurrency and strong consistency in OLTP scenarios, can it also take into account OLAP scenarios?

It is hard to say, from the perspective of technical practice, the related technologies for rebuilding the OLAP route seem to be developing faster, participating manufacturers are more extensive, and the actual production environment is constantly improving.

However, HTAP is progressing slowly, and there are few production-level industrial practices, but many manufacturers still regard it as the direction of product evolution:

  • The manufacturer's official HTAP includes at least TiDB and TBase
  • OceanBase also announced the introduction of OLAP scene features in the recent version

Based on business strategy considerations, more distributed databases will raise the banner of HTAP in the future.

2 HTAP storage design

The differences between OLTP and OLAP architectures are:

  • calculate

    The difference between computing engines, the goal is to schedule multi-node computing resources to achieve maximum parallel processing. Because OLAP pursues high throughput for massive data, while OLTP focuses on low latency for small amounts of data, so their computing engines have different emphases

  • storage

    Data is organized in different ways on the disk, and the way of organization directly determines the efficiency of data access. The storage formats of OLTP and OLAP are row storage and column storage respectively, and their differences will be explained in detail later.

The flow design concept of a distributed database is to separate computing and storage, and computing is easier to achieve stateless, so it is not difficult to build multiple computing engines in one HTAP system, but it is fundamental to implement the concept of HTAP into a runnable system. The biggest challenge is storage. Facing this challenge, the industry’s solutions include:

  1. The fusion storage PAX (Partition Attributes Across) used by Spanner tries to be compatible with two types of scenarios at the same time
  2. The design in TiDB4.0 version, on the basis of the original row storage, adds column storage, and through innovative design, the consistency of the two is guaranteed

2.1 Spanner: storage in one

Spanner's 2017 paper "Spanner: Becoming a SQL System" introduced its new generation of storage Ressi, which uses a method similar to PAX. This PAX is not an innovation of Spanner. It was proposed as early as the VLDB2002 paper " Data Page Layouts for Relational Databases on Deep Memory Hierarchies ". From the perspective of CPU cache friendliness, this paper discusses different storage methods, involving NSM, DSM, and PAX storage formats.

NSM (row store)

NSM (N-ary Storage Model) is row-based storage, the default storage method of OLTP databases, always accompanying the development of relational databases. Commonly used OLTP databases, such as MySQL (InnoDB), PostgreSQL, Oracle, and SQL Server, all use row storage.

Store a data record together, which is closer to the relational model. The writing efficiency is high, and a complete data record can be quickly obtained when reading. This feature is called Intra-Record Spatial Locality.

img

However, row storage is not friendly to OLAP analysis queries. OLAP system data is often merged from multiple OLTP systems, and a single table may have hundreds of fields:

  • However, a user usually only accesses a small number of fields in one query, such as reading data in units of rows, most fields are useless to query, and a large number of I/O operations are invalid
  • A large amount of invalid data is read, causing the CPU cache to fail, further reducing system performance

The figure shows the processing of the CPU cache. We can see that a lot of invalid data is filled into the cache, crowding out the data that had the opportunity to be reused.

DSM (column storage)

DSM (Decomposition Storage Model), appeared later than row storage. The typical representative system is C-Store, an open source project led by Michael Stonebraker, and later commercialized product Vertica.

Centralized storage of all columns is not only more suitable for OLAP access characteristics, but also more friendly to CACHE. This feature is called Inter-Record Spatial Locality. Columnar storage can greatly improve query performance, and ck, which is famous for its fastness, is columnar storage.

The problem with columnar storage is that the writing overhead is higher, because according to the relational model, the logical organizational unit of data is still a row. After changing to columnar storage, the same amount of data will be written to more data pages (page ), and the data page directly corresponds to the physical sector, and the disk I/O overhead naturally increases.

img

The second problem with columnar storage is that it is difficult to associate different columns efficiently. After all, in most application scenarios, not only single-column or single-table data is used, but after the data is dispersed, the association cost is higher.

PAX

img

PAX adds the concept of minipage, which is a secondary unit under the original data page, so that the basic distribution of a row of data records on the data page will not be destroyed, and the data of the same column will be stored together in a centralized manner. In essence, PAX is still closer to row storage, but it is also trying to balance the locality within records and the locality between records, which improves the performance of OLAP.

In theory, PAX provides a storage method with better compatibility, but what makes people somewhat less confident is that it was proposed as early as 2002, but it was rarely implemented before Spanner.

A design similar to this idea is HyPer's DataBlock (SIGMOD2016). DataBlock constructs a unique data structure for both OLTP and OLAP scenarios.

TiFlash: storage separation

If the underlying storage is a piece of data, then the data consistency between OLTP and OLAP can be guaranteed naturally, which is the biggest advantage of PAX. However, due to different access modes, the interaction of performance seems to be unavoidable, so we can only try our best to choose a balance point . TiDB shows a different way of thinking, between PAX and traditional OLAP systems, that is, OLTP and OLAP adopt different storage methods, which are physically separated, and then use innovative replication strategies to ensure that the data of the two is consistent sex.

TiDB proposed the goal of HTAP in an earlier version, and added TiSpark as an OLAP computing engine, but still shares the OLTP data storage TiKV, so resource competition between the two tasks is still inevitable. Until the recent version 4.0, TiDB officially launched TiFlash as a dedicated storage for OLAP.

img

Our focus is on the synchronization mechanism between TiFlash and TiKV. In fact, this synchronization mechanism is still based on the Raft protocol. TiDB adds a role of Learner to the original Leader and Follower of the Raft protocol. This Learner and the role of the same name in the Paxos protocol have similar responsibilities, that is, they are responsible for learning the agreed state, but they do not participate in voting. That is to say, the Raft Group does not include the Learner when counting the majority of nodes during the writing process. The advantage of this is that the Learner will not slow down the write operation, but the problem is that the data update of the Learner will inevitably lag behind the Leader.

Isn't this just an asynchronous replication? If you change your vest, what's the innovation? This can't guarantee the data consistency between AP and TP, right?

The Raft protocol can achieve data consistency because only the master node is restricted to provide services. Otherwise, not to mention the direct external services of the learner or the follower, it cannot satisfy the data consistency. So, here's another design.

Every time the Learner receives a request, it must first confirm whether the local data is new enough, and then execute the query operation. How to confirm that it is new enough? The Learner will send a request to the Leader with the timestamp of the read transaction to obtain the latest Commit Index of the Leader, which is the sequence number of the committed log. Then, wait for the local log to continue Apply until the local log number is equal to the Commit Index, and the data is new enough. However, the request will wait until it times out before the local Region copy completes synchronization.

The prerequisite for the effective operation of this synchronization mechanism is that TiFlash cannot fall behind too much, otherwise each request will bring about data synchronization operations, and a large number of requests will time out, making it impossible to use it in practice. However, TiFlash is a columnar storage, and the write performance of columnar storage is usually not good. How can TiFlash maintain a write speed close to that of TiKV? TiFlash's storage engine, Delta Tree, refers to the design of B+ Tree and LSM-Tree, and is divided into two layers: Delta Layer and Stable Layer. The Delta Layer guarantees high write performance. Because the background knowledge of the storage engine has not been introduced to you so far, the Delta Tree will not be expanded here.

TiFlash is an OLAP system, and its primary goal is to ensure read performance. Therefore, no matter how important writing is, it must give way to read optimization. As a distributed system, there is a last resort available, which is to reduce the pressure of single-point writing through capacity expansion.

Summarize

  1. OLTP connects with OLAP through ETL, so the timeliness of OLAP data is usually T+1, which cannot reflect business changes in a timely manner. There are two ways to solve this problem: rebuild the OLAP system, replace batch data processing with stream computing, and shorten the data delay of OLAP. The typical representative is the Kappa architecture; HTAP proposed by Gartner
  2. The design points of HTAP are computing engine and storage engine, among which the storage engine is the foundation. There are also two different solutions for the storage engine. One is represented by PAX, which uses a physical storage to integrate the characteristics of row and column. Spanner adopts this method. The other is TiFlash of TiDB, which sets row storage and column storage for OLTP and OLAP respectively, and ensures data consistency through an innovative synchronization mechanism.
  3. TiDB's synchronization mechanism is still based on the Raft protocol, and asynchronous replication is achieved by adding the role of Learner. Asynchronous replication inevitably brings data delay. Before responding to the request, the Learner satisfies data consistency by synchronizing incremental logs with the Leader, but this will bring additional communication overhead.
  4. As a column storage, TiFlash must first ensure read performance, but because it must ensure data consistency, it also requires high write performance. TiFlash balances read and write performance through the design of Delta Tree. We have not expanded on this issue and will continue to discuss it in Lecture 22.

Generally speaking, HTAP is an idea to solve traditional OLAP, but the promoters are only a few OLTP database vendors. Looking at the other side, the new OLAP system based on stream computing, these technologies themselves are part of the big data technology ecology, there are more participants, and new achievements are constantly being implemented. As for the data consistency that HTAP has an absolute advantage in, in fact, there may not be enough rigid requirements in business scenarios. Therefore, starting from the practical effect, I am more optimistic about the latter, which is the new OLAP system.

Of course, HTAP also has a relative advantage, that is, the family bucket solution prevents users from integrating multiple technical products, and the overall technical complexity is reduced. Finally, the solution given by TiDB is very innovative, but whether it can cover a large enough OLAP scene remains to be seen.

FAQ

Introduce TiFlash, the OLAP component of TiDB. It maintains data consistency. Every time TiFlash receives a request, it will request the latest log increment from TiKV Leader, and then continue to process the request after locally replaying the log. Although this mode can ensure that the data is fresh enough, it has one more network communication compared to the independent service of TiFlash, which has a greater impact on delay. My question is, do you think this model can be optimized? Under what circumstances does it not need to communicate with the Leader?

reference

A thread that polls log increments can be started in the background, and actual data synchronization is triggered when the difference is greater than a certain amount. Or add a version in the heartbeat package for comparison, and trigger active synchronization when the difference is large. In this way, there is no need to wait until the request arrives to trigger, saving this waiting delay. However, since it is a non-member node of Raft, there will be certain data differences in how to do it. It should be sufficient for most OLAP analysis scenarios.

No exposure to OLAP.

Is it possible not to request the "latest" log increment every time, but to request data on demand: save a new and old timestamp of the data locally, if it is earlier than the timestamp of the read request, there is no need to request it;

Or set a quality factor, you can allocate the requested data, use an algorithm similar to the sliding average, dynamically calculate the target index, and stop requesting data after the quality requirements are met.

When the timestamp requested by the client can be sure to be less than the timestamp of the server. The difficulty should be how to ensure the time synchronization between the client and the server.

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/132234932