First choice for building real-time data warehouse, decryption of cloud-native data warehouse technology

Alibaba Cloud analytical database launched the basic version, which greatly reduced the threshold for users to build data warehouses. Highly compatible with MySQL, extremely low use cost and extremely high performance, so that small and medium-sized enterprises can easily build a set of real-time data warehouse to realize the online value of enterprise data.

The product series of AnalyticDB for MySQL includes a basic version (stand-alone version) and a cluster version. The basic version provides services for a single node, and the minimalist architecture greatly reduces the cost of the basic version. Storage and computing separation architecture, row and column mixed storage technology, lightweight index construction method and distributed hybrid computing engine ensure the powerful analysis performance of the basic version. It can build a real-time data warehouse with an annual cost of less than 10,000, without the need to set up a special big data team, saving millions of costs for enterprises.

1. Basic version technical architecture

The following is the basic version of the architecture diagram. The whole is composed of Coordinator and Worker, and their respective responsibilities are introduced as follows.

img

1.1 Coordinator: front-end control node, responsibilities include

(1) MySQL protocol layer access, SQL analysis

(2) Authentication and authentication, providing a more complete and detailed permission system model, whitelist and cluster-level RAM control, and auditing and compliance records of all SQL operations.

(3) Cluster management: member management, metadata, data consistency, route synchronization, backup and recovery (data and log management)

(4) Background asynchronous task management

(5) Transaction management

(6) Optimizer, execution plan generation

(7) Computing scheduling, responsible for task scheduling

1.2 Worker: storage and computing nodes, including

(1) Calculation module

Distributed MPP + DAG hybrid computing engine and optimizer have achieved higher complex computing power and mixed load management capabilities. Using the advantages of flexible scheduling of resources on the Alibaba Cloud computing platform, flexible scheduling of computing resources is achieved. The computing worker node can be pulled up individually, and can be expanded in minutes or even seconds in response to business needs to achieve the most efficient use of resources.

(2) Storage module

The storage module is more lightweight, with real-time writing and reading capabilities that carry greater throughput data. The writing performance is about 50% higher than the same specifications of the previous version. It is visible in milliseconds to meet customer real-time analysis needs.

Storage nodes provide full and incremental backup and recovery capabilities. Periodic snapshots and logs of cloud disks will be synchronized and saved in OSS in real time, providing higher security for user data and helping users to recover to the maximum extent when database problems occur.

(3)Worker Group

Worker nodes with storage modules are divided into worker groups. The cluster version provides three copies of storage. It works like a whole through the Raft distributed consistency protocol, allowing some of the Worker nodes to continue to provide failures. Services, the basic version only provides a single copy of the service.

2. Basic version optimizer

The optimizer is responsible for processing the syntax tree generated by Parser, and the optimal cost generated by the optimization algorithm is provided to the calculation engine. Plan cost directly affects query performance, so the optimizer is one of the core modules in the database. The basic version uses the same powerful optimizer as the cluster version, including multiple compound optimization techniques based on rules, costs, and modes.

image.png

Complex analytical queries often include multi-table joins, and the join order of the tables directly affects query performance. The AnalyticDB optimizer uses a join order optimization algorithm based on cost estimation and real-time sampling information, which can perceive the distribution of data stored in the underlying. The optimizer uses the AnalyticDB full index feature to improve the accuracy of filter factor estimation. For complex joins, the optimizer dynamically adjusts the join order based on data distribution information, and at the same time evaluates the cost of data reshuffling to select the optimal execution plan from the dimension of global cost.

The AnalyticDB optimizer adds cost estimation and iterative optimization on the basis of the classic rule-based optimizer (Rule-Based Optimizer), and integrates the Cascades CBO (Cost-Based Optimizer) optimization framework. The CBO search framework will call the Property Enforcement module to generate a distributed execution plan, and then call the cost estimation module to evaluate the cost for each candidate plan and select the optimal distributed execution plan. In order to further improve the optimization effect and efficiency of the join order, the AnalyticDB optimizer also uses history-based optimization technology (History-Based Optimizer), dynamic optimization technology (Pattern-Based Optimizer) based on common SQL patterns, and data-driven intelligence Technologies such as the Auto Analyze module automatically collect statistical information to provide accurate data support for the optimizer to search for the best plan.

In addition, the AnalyticDB optimizer also performs a series of optimization processing on the combined filter conditions, aggregation operators, and related subqueries that often appear in complex queries to improve performance. For example, the push-down optimization technology pushes the planned filter conditions and aggregation operators to the bottom module of the entire link as far as possible to execute, which not only improves the efficiency of the bottom operator, but also reduces the amount of data to be processed by the upstream operator and improves the overall query performance. For related sub-query statements, the optimizer rewrites related sub-queries into semantically equivalent non-related plans through relational algebra conversion, so that the calculation engine can be efficiently pipelined.

3. Basic version of the calculation engine

image.png

The AnalyticDB calculation engine adopts massive parallel processing MPP + DAG architecture and memory-based pipeline execution mode, which has the characteristics of high concurrency and low latency. In order to speed up the evaluation of complex expressions and optimize execution performance, the calculation engine generates JVM bytecode at runtime through Runtime Codegen, dynamically loads instances of generated objects, reduces virtual function calls during execution, and improves CPU-Intensive tasks. effectiveness. The calculation engine also uses the vectorized execution model to process expression evaluation, and uses the CPU SIMD instruction set to speed up the evaluation calculation.

4. Basic version storage engine

image.png

The AnalyticDB storage engine uses a mixed storage design. as the picture shows. For every k rows of data in a table (Row Group), each column of data is continuously stored in a separate Data Block, and the column block of each row group is continuously stored on disk. The data of the column Block in the row group can be sorted and stored according to the specified column, which can significantly reduce the number of random IOs of the disk when querying by this column. The unique advantage of this design is that it has both the advantages of row storage (suitable for OLTP point query) and column storage (suitable for OLAP multi-dimensional analysis), which satisfies the needs of different types of workloads:

  • For the OLTP type point query, you need to select a full row of detailed data. Under the mixed row and column design, the complete random reading of the column storage is converted into sequential reading
  • Multi-dimensional analysis for OLAP type: Not only solves the problem of reading and magnifying the downstream storage of massive data statistical analysis, but also converts the sequential reading of column storage into sequential skip reading when performing single-row IO, and converts random reading into sequential when performing multi-row IO read
  • Large write throughput: Random writes during column storage are converted to sequential writes

The AnalyticDB storage engine uses an intelligent full index to create an inverted index of each column of data from value to row number. When querying, the AND and OR of multiple SQL conditional expressions are converted into Boolean Query and indexed at the same time, and the row number of the result set that meets the where condition is obtained through the search. It supports fast multiplexing and can find the result that meets the condition at the millisecond level. set.

5. Basic version advantage

The basic version greatly reduces the user's threshold for building a data warehouse. Compared with the big data (Hadoop, Spark and EMR) and OLTP warehouse building methods, it has a high cost performance.

(1) Lower the threshold of use

The basic version has a minimum of 1.75 yuan / hour and 860 yuan / month. Compared with the cluster version, the starting price is reduced by about one third. The disk space is only 0.6 yuan / GB, and the upper limit of the disk space is 4T. It can be expanded at any time as needed, greatly reducing the threshold for complex analysis and construction of real-time data warehouses for SMEs.

(2) High performance

Under the same configuration, its data query performance is about 10 times that of MySQL, which helps users to solve the pain points of slow MySQL complex analysis.

(3) Rich specifications

The basic version supports four specifications: T8, T16, T32, and T52. The specifications can be selected and adjusted according to the different requirements of the business.

(4) Ecological transparency

The upstream and downstream ecosystems are fully compatible with the cluster version and transparent to users.

6. Suitable for customers

Especially suitable for the following groups:

(1) Hadoop / Spark and other small and medium-sized enterprises that are too complex and want to quickly realize data transformation;

(2) Report database query is slow, and SMEs with interactive BI analysis appeals;

(3) Users who need to quickly build a test environment for the selection of data warehouses;

(4) Learning groups, users who can quickly understand AnalyticDB for MySQL;

understand more

Watch the live broadcast: https://yq.aliyun.com/live/2527
Product details: https://promotion.aliyun.com/ntms/act/adbformysqljichuban.html

Original link
This article is the original content of the Yunqi community and may not be reproduced without permission.

Published 2315 original articles · 2062 thumbs up · 1.58 million views

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/105420059