GaussDB Technical Interpretation Series: Advanced Compression OLTP Table Compression

This article is shared from Huawei Cloud Community " DTCC 2023 Expert Interpretation | GaussDB Technical Interpretation Series: Advanced Compression OLTP Table Compression ", author: GaussDB database.

On August 16, the 14th China Database Technology Conference (DTCC2023) was successfully held at the Beijing International Convention Center. In GaussDB's "five highs and two easy" core technologies, a better choice for the world , Feng Ke, Chief Architect of Huawei Cloud Database GaussDB, gave a detailed interpretation of the advanced compression technology of Huawei Cloud GaussDB database.

2.jpg

The following is the transcript of the speech:

Distinguished guests, good afternoon! I am very glad that I started to bring you a technical interpretation of a series of new features of GaussDB this year. What I interpreted was the first feature, Advanced Compression.

GaussDB Advanced Compression Panorama

Advanced compression is a database compression solution for all business scenarios. The applicable scenarios are mainly divided into two categories. The first type is the storage type, which mainly provides capacity control for the business, reducing the probability and cost of business expansion; the second type is the transmission type, which is mainly for the actual conditions of how to match the network bandwidth of the business in cross-Region and cross-AZ business scenarios , to provide a more stable SLA guarantee for the business. There are many subdivided scenarios, including TP and AP.

3.png

There are many challenges here, one is how to design the compression algorithm, and the other is how to make hot and cold judgments. We use selective compression in the compression of the entire storage class. Based on the system's automatic discovery of data hot and cold, we only compress relatively cold data in the business and do not touch relatively hot data. Including the realization of zero intrusion into the business and the combination with the storage engine, there are many technical challenges.

Typical scenarios and target design

Different scenarios have different compression algorithms, including compression ratio, business impact, and business intrusion tolerance. What we want to introduce here is the technical details of our first published OLTP table compression. Before talking about this, let's talk about what kind of customer scenarios OLTP table compression solves, which determines our entire technical goal.

There are two typical scenarios that we encounter in real business. In the first scenario, the customer's business comes from IBM minicomputers, and the entire single database capacity reaches dozens of TB, which is relatively large. If the business is migrated to an open platform, the biggest problem is that the capacity of the single unit is too large, and the entire operation and maintenance window takes a long time. We have different options. We can choose the so-called dismantling of databases, that is, split tables and sub-databases. However, dismantling databases means that the entire distributed transformation is required. For some businesses, it is a key business that has existed for many years. This kind of transformation The whole stakes are very high that way. The second option is to use compression. Compression can reduce capacity, but the customer did not separate hot and cold at the beginning of the business design. For example, the user's data was not segmented based on the time dimension. If compression is used, the customer's primary demand is to be able to Whether the impact of compression on the business is low enough, followed by the compression rate, this is the first typical scenario.

In the second typical scenario, customer business is deployed based on distributed clusters, and the capacity grows very fast, exceeding one PB and still growing. For customers, this is a very big problem, and regular expansion is required. Using compression can also help customers reduce the frequency of expansion and the risk of changes. But the problem is the same. Customer data is also not separated from hot and cold. It is designed for scalability, such as sharding based on user ID numbers, so that the load of different users can be evenly distributed to different data nodes. Since there is no separation of cold and heat, if compression is used, whether the impact of compression on the business is low enough, followed by the compression rate. This is also a very typical OLTP compression scenario we have seen.

We analyzed these two scenarios and derived three basic design goals. First of all, the entire compression scheme must be zero-intrusive. It cannot assume the existing data distribution of the business, and it cannot be said that a partition is built. The data must be able to distinguish between hot and cold, because the business does not have such conditions. No assumptions can be made about the data distribution and logic model of the business, and the solution must be zero-invasive. Second, if business compression is enabled, the impact on business should be extremely low. We define at least 10%, or even 5%, which is very important. The third is reasonable compression ratio, 2:1 or 3:1, without compression ratio, the value of doing these things is not there. These three basic goals also determine the design and engineering implementation of our entire technical solution.

Key challenge 1: How to determine the hot and cold data of the business?

After determining the goals, there are three key issues that need to be resolved. The first is how to determine the hot and cold data of the business, the second is how to combine with the existing compression engine after the determination, and effectively store the compressed data, and the third is how to implement a competitive compression algorithm.

When we make hot and cold judgments, we first determine the granularity of the judgment. It can be judged by table, partition, block, or row. The finer the granularity of judgment, the lower the intrusion to the business, without any assumptions about the entire data distribution of the business, of course, the greater the challenge of implementation. Based on the technical goals we defined, when doing OLTP table compression, the first goal is to determine the hot and cold must be at the row level, so that the intrusion to the business is minimal.

We leverage the existing mechanisms of GaussDB's existing storage engine. GaussDB's current storage engine is the same as other engines. In addition to storing user data, it also stores metadata on the entire data. The metadata contains transaction information. This transaction information is usually used to achieve transaction visibility. It records the last time Modified transaction ID number. When the transaction ID number is old enough to be visible to all current transactions, we replace the transaction ID number with a physical timestamp. This physical timestamp can be used to express when the row of data was last modified. If this If the time is early enough and old enough to really meet the cold condition, then we can compress it, and users can use very simple logic to realize the hot and cold judgment.

4.png

The second example is that the user can customize the hot and cold conditions. If this row has not been modified for a long time, the system can compress it, otherwise don’t touch it. This is a very simple strategy. If some fields in the customer business have very clear hot and cold attributes, such as transaction time and transaction completion status, then you can specify this field for hot and cold judgment. Or most of the customer's transaction data meet the requirement that the transactions 3 months ago are cold data, but some special types of transactions, such as secured transactions, may not be able to meet this constraint. At this time, you can also customize the hot and cold conditions. For example, the transaction status must be completed, or the transaction type cannot be a specific type. Through the combination of custom conditions and the latest modification time, it is possible to flexibly define what data should be compressed. This is the first point, how to judge hot and cold.

Key challenge two: How to effectively store the compressed data?

The second point is how to store the compressed target data. According to the overall design goal, we hope that the intrusion into the business should be as low as possible. We chose to do intra-block compression directly: compress all the rows in a block that meet the hot and cold judgment at one time, and store the compressed data packets in the current data block. This is not the optimal choice in terms of compression rate, but it is a better choice in terms of impact on business. Because even if the hot and cold judgment conditions are defined for the business, we still have a certain probability of accessing cold data. We hope to ensure that the cost of accessing cold data has a deterministic upper limit through the implementation of intra-block compression. This is the basic principle of intra-block compression. think.

5.png

Key challenge three: How to implement a competitive compression algorithm?

Why do selective compression? It's very simple. No compression algorithm can achieve data compression without affecting the business. There is no such black technology today. This is our basic technical judgment, so we must balance the compression rate and the impact on the business. The first thing we do is selective compression. The distribution of business data meets the typical 80-20 distribution policy. 80% of the data occupies 80% of the storage capacity, but only consumes 20% of the computing power. For example, in bank transactions, as time goes by, the access frequency of the entire order will decrease rapidly, which is a very typical business that meets the hot and cold characteristics.

If we do selective compression, we only compress the cold data that takes up 80% of the storage capacity but only consumes 20% of the computing power, which means that we have achieved 80% of the goal of saving storage costs; instead of compressing the cold data that only takes up 20% of the storage capacity, but hot data that consumes 80% of the computing power, means that we have achieved 80% of the goal in reducing the business impact on users. This is a very simple technical choice.

We have also looked at some compression algorithms. For example, LZ4 is the algorithm with the best performance. We used this algorithm at the beginning, but the big problem is that the compression rate is relatively low. If you carefully analyze the algorithm principle, LZ4 is an implementation based on the LZ77 algorithm, which treats the compressed data as a continuous byte stream, searches for a matching string from the current moment, finds the string length and offset to encode Replace the matched string to achieve the effect of compression. In principle, the LZ77 algorithm is very suitable for long texts, and relatively unsuitable for structured data, which contains a large number of numeric types and short texts, which are characteristics of databases.

We have done a lot of optimizations, such as differential encoding for numeric types, so the compression framework actually has two layers, the first layer encodes data, and the second layer uses the LZ77 algorithm. The native LZ77 algorithm has many optimizations for long text, including 3-byte encoding. We have done a lot of engineering optimization to make it easier for short text, such as two-byte short encoding, including built-in line boundaries. We cannot give many details here. There are actually two main optimization backgrounds: one is that the general compression algorithm is not particularly suitable for the scenario of relational database structured data, and the other is the engineering optimization we have done. In terms of scenarios, they are not necessarily optimal, but they are especially suitable for relational data.

6.png

Competitiveness Assessment

Finally, there is a simple evaluation, which is the evaluation of the compression rate of the current commercial database O* through the TPCC and TPCH tests. O*, like our GaussDB, also provides complete hot and cold judgment capabilities, but due to development reasons, it actually performs data compression first, and then makes hot and cold judgments, so the compression rate of the entire compression algorithm is relatively low; We use the standard TPCH data, and the test shows that our compression rate is 50% higher than that of O* on average, and these data can be directly verified.

7.png

Some other manufacturers, such as open source databases, and domestic manufacturers all provide compression solutions, but the common problem is that they do not make hot and cold judgments. For users, they can specify a table or a partition, and the data in it will either be compressed. Compress, or don't compress. Compression means storage cost savings, but performance will suffer, and no compression is another option. This seemingly simple choice is the most difficult for customers, which is why we see that there are many compression solutions today, but users do not use them, because no one knows what the consequences will be after enabling compression. This is a relatively big problem.

Here we also made a standard TPCC test evaluation, based on the GaussDB stand-alone version for selective compression. According to the semantics of TPCC, all orders that have been delivered will not be changed, but there is still a certain probability of being accessed. This is an access model that is very close to real business scenarios. Therefore, our compression algorithm chooses to compress flow-type data, such as order data, while some status-type data, such as inventory and accounts, are not compressed. In flow data, we only compress the orders that have been delivered, not compress Orders that have not been shipped. Judging from the final results, the impact on the business after the entire compression is about 1.5%. We believe that we are the first product in the industry that can still turn on compression at the peak performance of 1.5 million tpmC without basically degrading performance.

Next step: semantic compression

We have broken the boundaries of data encoding and compression algorithms, but the use of compression algorithms is essentially unchanged, that is, the entire data is regarded as a one-dimensional byte stream. But relational data is two-dimensional and structured data, so there are very rich associations between rows and columns of data. This association mainly comes from two scenarios. One is the association introduced by the business itself when modeling. For example, in order to eliminate connections, the data model is designed to be flat or low-normalized, which will introduce very common associations. The second is the layering of services through business service transformation, which is an association caused by the continuous transfer of data between different service layers. We use some algorithms to automatically discover the associations between such structured data, and find that these associations are not used for product recommendation or service governance, but hope to achieve the purpose of compression by eliminating these associations. In many scenarios, this semantic-based association elimination technology will provide better compression performance than general-purpose algorithms, which is where we will focus on building competitiveness later.

summary

Why do advanced compression features? Because we hope to achieve industry leadership in three areas.

First, in performance-sensitive scenarios, under the premise of providing a reasonable compression rate, the impact on business (the smaller the better) is industry-leading.

Second, in cost-sensitive scenarios, on the premise of providing reasonable compression and decompression performance, the compression rate (the higher the better) achieves industry-leading results.

Third, you may have noticed that the hot and cold judgment itself can not only do data compression, but also do many other tasks, such as multi-storage media and load perception. We hope that for the entire hot and cold judgment, including models and methods, we can In terms of the breadth of supported business areas, being able to be industry-leading is a basic purpose of our advanced compression features.

Extra!

cke_6464.jpeg

Huawei will hold the 8th HUAWEI CONNECT 2023 at the Shanghai World Expo Exhibition Hall and Shanghai World Expo Center on September 20-22, 2023. With the theme of "accelerating industry intelligence", this conference invites thought leaders, business elites, technical experts, partners, developers and other industry colleagues to discuss how to accelerate industry intelligence from business, industry, and ecology.

We sincerely invite you to come to the site, share the opportunities and challenges of intelligentization, discuss the key measures of intelligentization, and experience the innovation and application of intelligent technology. you can:

  • In 100+ keynote speeches, summits, and forums, collide with the viewpoint of accelerating industry intelligence
  • Visit the 17,000-square-meter exhibition area to experience the innovation and application of intelligent technology in the industry at close range
  • Meet face-to-face with technical experts to learn about the latest solutions, development tools, and hands-on
  • Seek business opportunities with customers and partners

Thank you for your support and trust as always, and we look forward to meeting you in Shanghai.

Official website of the conference: https://www.huawei.com/cn/events/huaweiconnect

Welcome to follow the "Huawei Cloud Developer Alliance" official account to get the conference agenda, exciting activities and cutting-edge dry goods.

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10101527