Business and technical requirements for mixed loads

table of Contents

Continuous optimization of transactional and analytical workloads

Diversified business means different business scenarios and indicators, and we can distinguish these businesses from different dimensions. In terms of execution time, there are real-time queries that require millisecond response, second-level operations, and minute or even hour-level complex query services. From the perspective of data scale, there are businesses that face small data volumes of hundreds of megabytes, and businesses that face big data up to terabytes or petabytes. From the perspective of data timeliness, there are not only new data services that have just been generated, but also historical data that needs to be analyzed for many years, or data services that need to be summarized and converted based on historical data. From the perspective of data operations, some businesses need to inquire, and some businesses require support for adding, deleting and changing. In terms of query statements only, some businesses need to perform simple single-table filtering queries efficiently, and some businesses require complex multi-table association analysis, or even advanced analysis combined with machine learning. From the perspective of the data model, there are not only services for relational data, but also services that support document formats such as JSON and XML, as well as services for unstructured text, video, and audio. In layman's terms, in order to meet so many different business scenarios, data products must accommodate hundreds of rivers like the ocean, which can be expanded to a very large scale, but also needs the special function of finding a needle in a haystack to quickly locate and process specific records.

Specifically, products that meet the following characteristics are ideal data products that can support mixed loads:

Open source

There have been many discussions on the benefits of open source products, and here are three crucial benefits that open source brings to mixed load businesses. Replacing multiple data products of an enterprise with one data product and using it to support multiple business scenarios is actually a major investment for the enterprise, which means investing in these business systems and related development, support, and operation and maintenance personnel. In this product, so many people's time will be invested in this product for many years to come. Open source products can provide the best protection for such major investments. About 10 years ago, my project team needed to write a plug-in in a company's closed source product to support our product. After developing for a period of time, it discovered a serious problem that would cause the entire product to collapse. Because the product and its plug-in framework are closed-source products, the reported errors are also very vague. The project team analyzed from various angles for three weeks and did not find out the reason. We even contacted the original technical support team, but because of our use The scenario is quite special. After several rounds of communication with the technical support team, the cause and solution of the problem are still not clearly located. In the end, the architecture has to be completely overturned and redesigned to avoid this problem through another implementation plan. In fact, internal R&D personnel who are very familiar with the product and its architecture can completely find the root cause of this problem, but for people in other organizations and departments, the product is a black box. When a problem is encountered, the user simply It is not clear about the operating logic and the specific cause of the error. The open source product is completely different, it is a white box, users can understand all its operating logic. Developers can analyze the specific reasons and possible solutions based on the code context where the error occurred, thereby avoiding the above-mentioned overthrowing and protecting the previous time and manpower investment.

The second benefit of open source is that the progress of product development is completely transparent. Once a customer uses a product, it is hoped that the investors and developers of the product can continue to invest in and improve the product. Some manufacturers may be very ambitious when describing the product route, but when it is actually implemented, it is another matter. The inventor of the Linux kernel once said: It is easy to say, please show me the code (talk is cheap, show me the code). For open source products, this is impossible. Users can not only view the code, but also see the daily submissions. Companies that dare to open source their products are actually very confident in their products and R&D capabilities. They dare to show the advantages of their products and all known problems, and firmly believe that they can do better than all competitors, otherwise they are easy to be Competitors copy and surpass.

The third benefit of open source is that users can customize extensions on open source products according to their own business logic or load characteristics. Well-designed open source products can be easily extended from different dimensions and at different stages for users. If you use closed-source products, when your business develops to a certain level that requires special features, you have no choice but to wait or switch to other products.

Strong consistency

One of the goals of database design is to completely delegate the responsibility of data management to the database. The application only cares about what data is needed, and does not need to care about how to access the data, nor does it need to worry about data loss, inconsistency, and conflict. Some data products provide only eventual consistency, or only guarantee the consistency of single-line operations, or only support a limited number of isolation modes. These weakened consistency support will lead to uncertain product behavior when multiple operations or a large amount of concurrency are involved, which requires users and applications to record, analyze, and process data consistency and effectiveness, making applications more complex. These problems will become more prominent when used by users with mixed loads and various business backgrounds. Only by providing data products with strong consistency can business applications be simplified and the consistency expected for transactional operations can be provided. ❏ Fine-grained resource management When a large number of statements of different sizes and types are running in the same cluster, how to ensure that so many statements can meet business requirements? This requires support from many aspects, one of the core elements is fine-grained resource management. These resources include CPU, memory, disk, and network. These resources need to be allocated to different users, organizations, and statements in a timely and dynamic manner. If a certain statement occupies too much of a certain resource, it will cause a large number of statements to wait at this bottleneck. Therefore, fine-grained resource management can effectively manage key resources. It can allocate restrictions at the resource group, user, and statement levels. It can also allocate resources at different operators in the statement and at different stages of the statement execution process. While making full use of system resources, it does not encroach on the resources of other sentences. Not all data products can manage the CPU, memory, disk and network equally. The disk and network communication of some data products is completely synchronized with the execution of the statement. In this way, when limiting the CPU resources of this statement, it also limits its disk and network at the same time. In addition to the above essential features, the following product features can also effectively solve the mixed load problem.

Row storage and column storage

For transactional operations, you usually need to access a small number of entire rows of data, so row storage is the most appropriate; for analytical operations, you usually need to access some columns of many rows in the table, then column storage can be used use. If you look at the typical data life cycle, you will find that row storage is the best when data is frequently accessed or repeatedly modified, and then slowly transition. When the frequency of access decreases or the data no longer changes, it is better to use column storage. . Therefore, to make the transactional and analytical operations of mixed loads perform well, data products must support row-based storage and column-based storage at the same time, and have better optimizations for these two types of storage, access, computing, etc. Completely transparent to users.

index

For fast operations in seconds and milliseconds, indexes are the best choice. It can help SQL statements to quickly locate the corresponding data blocks and data rows. The combination of partitions and indexes can filter out a large amount of irrelevant data during data scanning, so that only relevant data needs to be processed. For data products without indexes or incomplete index functions, you need to scan the entire table or scan a large amount of data. This is not a big problem when executing some analytical statements, but for many operations that only involve a small amount of data in a large table, the cost will be particularly high.

Relational and non-relational models

The data structure of the relational table can describe many problems in the real world very well, and it is a simple, efficient, and commonly used way of expression. But its disadvantage is that it is slightly less flexible. Although you can also add or delete columns, in some cases, the overhead of this type of structure change is relatively large. The non-relational model can well meet flexible and changeable needs. Taking JSON as an example, you can define a JSON column to store arbitrarily complex structures. After running for a period of time, as business needs, you can also include it in a JSON document Feel free to expand new fields. However, while JSON brings flexibility, it also has some inherent defects , such as the inability to constrain the field type and format from the data structure level. JSON documents may contain data that does not meet the requirements or do not match, and the application needs to do more consider. The relational model and the non-relational model have their own advantages, and should be considered comprehensively in actual use.

Continuous optimization of transactional and analytical workloads

Even if transactional and analytical workloads need to be mixed to run in the same system, transactional and analytical data products are also infiltrating each other, but the two types of workloads still have different requirements in many technical details. If a product wants to support these two loads well at the same time, it needs the product to have a good realization and accumulation in the operation optimization of these two categories, at the same time, it also needs to have research and development capabilities and continuous investment in these two aspects. Optimization, otherwise, as business evolves and technology iterations, the product will lose competitiveness in at least one direction and cannot provide the best user experience. In summary, it seems that many products can run transactional and analytical workloads, but it is not easy to support them well. Unlike most software, Greenplum originated from PostgreSQL, which can naturally support transaction-based load addition, deletion, modification, and check operations. At the same time, it has formulated a long-term strategy to merge the latest PostgreSQL version. In 2018 alone, it merged 6 PostgreSQL versions. The speed makes the entire community very excited. In the future, the rich transactional features contributed by the PostgreSQL community can also be used by Greenplum. Greenplum has been deeply involved in analytical workloads for a long time. It runs huge clusters with thousands of large customers, supporting complex enterprise business intelligence and advanced analysis. Some customers run a single SQL statement of several trillion in scale, thus It has accumulated powerful analysis features and has been tested by a large number of customers for decades. It can be said to be a leader in this field. Greenplum also implements MVCC-based distributed transactions, which can provide complete ACID semantics and support serialized transaction isolation.

At the same time, a new fine-grained resource management is realized based on the new Linux kernel feature cgroup. Greenplum supports B-tree indexes and bitmap indexes, and even supports row storage and column storage at the same time in a single table. In addition to providing structured relational tables, Greenplum also supports rich unstructured models, such as JSON, XML, key-value and text types, geospatial types, and custom types. In addition, a large number of auxiliary functions for manipulating these types are also provided. In the eyes of many existing users, Greenplum is the best choice for supporting mixed loads.