[Kylin] (1) First encounter with Apache Kylin

1. What is Kylin?

The unicorn is also a sacred beast. The ancients believed that it was one of the four spirits, benevolent beasts, wherever they appear, there must be auspiciousness.

In the field of big data processing technology, the most common demand of users is to quickly obtain query results from the big data platform in a very simple way, and also hope that traditional business intelligence tools can be directly connected with the big data platform to use these tools Do data analysis. There have been many excellent SQL on Hadoop engines, including Hive, Impala, and SparkSQL. The emergence and application of these technologies have greatly reduced the difficulty for users to use the Hadoop platform.

In order to further satisfy the application scenario of "In the case of high concurrency and large data volume, the use of standard SQL query aggregation results can reach the millisecond level" application scenario, Apache Kylin came into being, incubated on eBay and finally contributed to the open source community. Apache Kylin is an open source OLAP engine based on the Hadoop big data platform initiated by a team of Chinese engineers at eBay in Shanghai in 2013.

Insert picture description here

It adopts multi-dimensional cube pre-calculation technology and uses the space-for-time method to suddenly increase the query speed of many minute-level or even hour-level big data to sub-second level, greatly improving the efficiency of data analysis, and filling the industry in this respect. blank space. This engine opens the door to interactive big data analysis on super-large data sets.

2. Why use Kylin?

Since the birth of Hadoop 10 years ago, the problems of big data storage and batch processing have been properly solved, and how to analyze data at a high speed has become the next challenge. As a result, a variety of "SQLon Hadoop" technologies emerged, represented by Hive, followed by Impala, Presto, Phoenix, Drill, and SparkSQL. Their main technique is the " massively parallel processing " (Massive Parallel Processing, MPP) and " column storage " (Columnar Storage). Large-scale parallel processing can mobilize multiple machines to perform parallel computing together, and use linearly increased resources in exchange for a linear decrease in computing time. Column-based storage stores records in columns. This not only allows you to read only the required columns during access, but also makes use of the storage device's ability to read continuously, which greatly increases the reading rate. These two key technologies have increased the SQL query speed on Hadoop from hours to minutes.

However, the minute-level query response is still far from the actual needs of interactive analysis . The analyst types in the query command, presses Enter, and needs to pour a cup of coffee, and quietly wait for the query result. After the results are obtained, the query can be adjusted according to the situation, and then the next round of analysis can be performed. Repeatedly, a specific scenario analysis often takes several hours or even days to complete, which is inefficient. This is because although massively parallel processing and columnar storage increase the speed of calculation and storage, it does not change the time complexity of the query problem itself, nor does it change the fact that query time and the amount of data increase linearly.

Assuming that it takes 1 minute to query 100 million records, it takes 10 minutes to query 1 billion records, and at least 1 hour and 40 minutes to query 10 billion records. Of course, many optimization techniques can be used to shorten the query time, such as faster storage, more efficient compression algorithms, and so on. But in general, the linear correlation between query performance and data volume cannot be changed. Although massively parallel processing allows the computing cluster to be expanded tenfold or one hundredfold in order to maintain the query speed at the minute level, how can it be easy to purchase and deploy tenfold or one hundredfold computing cluster, not to mention expensive hardware operations? Dimension cost.

In addition, for analysts, a complete and validated data model is more important than analysis performance. Directly accessing complicated raw data and performing related analysis is actually not a very friendly experience, especially on super-large data sets. , The analyst spends more energy on waiting for the query results, rather than on the more important building domain model.

3. How does Kylin solve key problems

The original intention of Apache Kylin is to solve the second-level query problem of hundreds of billions and trillions of records. The key is to break the rule that query time increases linearly with the amount of data. Thinking carefully about big data OLAP, two facts can be noticed.

  • Big data queries generally require statistical results, which are the statistical values ​​of multiple records calculated by an aggregate function. The original record is not necessary, or access frequency and probability are extremely low.
  • Aggregation is carried out by dimensions. Since the scope of business and analysis requirements are limited, meaningful dimensional aggregation combinations are also relatively limited, and generally do not increase with the expansion of data.

Based on the above two points, we can get a new idea-"pre-calculation". The aggregate results should be calculated in advance as much as possible, and the results of the budget should be used to obtain the query results at the query time, so as to avoid directly scanning the original records that may grow indefinitely.

For example, use the following SQL to query the highest-selling products on October 1st:

select item,sum(sell_amount) 
from sell_details
where sell_data=2016-10-1group by item
order by sum(sell_amount) desc; 

With the traditional method, you need to scan all the records, and then find the sales records on October 1, then aggregate the sales by commodities, and finally sort and return. If there are 100 million transactions on October 1, then the query must read and accumulate at least 100 million records, and the query speed will gradually decrease as sales increase in the future. If the daily transaction volume doubles to 200 million, the query execution time may also double.

The use of pre-calculated method will advance dimensionally [sell_date,item]calculated sum(sell_amount)and stored, find the sale of goods October 1 at the time of the query can be sorted directly returned. The maximum number of records will not read more than the dimension [sell_date,item]of the number of combinations. Obviously, this number will be much smaller than the actual sales records. For example, if 100 million transactions on October 1 contained 1 million items, there will be only 1 million records after pre-calculation, which is one percent of the original. And these records are already the result of product aggregation, so the runtime aggregation operation is omitted. From the perspective of future development, the query speed will only change with the growth of the date and the number of products, and it is no longer directly related to the total number of sales records. If the daily transaction volume is doubled to 200 million, but as long as the total number of commodities remains the same, the total number of pre-calculated result records will not change, and the query speed will not change.

" Pre- computation " is Kylin's third key technology for big data analysis in addition to "large-scale parallel processing" and "column storage".

Four, Kylin features

Kylin's main features include support for SQL interface, support for super large data sets, sub-second response, scalability, high throughput, BI tool integration, etc.

1) Standard SQL interface : Kylin uses standard SQL as its external service interface.

2) Support for very large data sets : Kylin's ability to support big data may be the most advanced of all current technologies. As early as 2015, eBay’s production environment was able to support tens of billions of records for second-level queries, and then there were hundreds of billions of records for second-level queries in mobile application scenarios.

3) Sub-second response : Kylin has excellent query response speed, thanks to pre-calculation. Many complex calculations,
such as connection and aggregation, are completed in the offline pre-calculation process, which greatly reduces the query time The amount of calculation required
increases the response speed.

4) Scalability and high throughput : A single node Kylin can achieve 70 queries per second, and it can also build a Kylin
cluster.

5) BI tool integration

Kylin can be integrated with existing BI tools, including the following.

  • ODBC : Integrate with Tableau, Excel, PowerBI and other tools
  • JDBC : Integration with Java tools such as Saiku and BIRT
  • RestAPI : Integration with JavaScript and Web pages

The Kylin development team also contributed Zepplin plugins, and you can also use Zepplin to access Kylin services.

Five, Kylin architecture

Insert picture description here

1)REST Server

REST Server is a set of entry points for application development, designed to implement application development for the Kylin platform. Such applications can provide queries, obtain results, trigger Cube construction tasks, obtain metadata, obtain user permissions, and so on. In addition, SQL queries can be implemented through the Restful interface.

2) Query Engine

When the Cube is ready, the query engine can retrieve and parse user queries. It will then interact with other components in the system to return corresponding results to the user.

3)Routing

Responsible for converting the execution plan generated by the parsed SQL into the query of the Cube cache. Cube is cached in hbase through pre-calculation. This part of the query can be completed in the second level and the millisecond level, and there are some original data of the query used by the operation ( Stored in Hadoop's HDFS through Hive query). This part of the query has a higher latency.

4) Metadata management tool (Metadata)

Kylin is a metadata-driven application. The metadata management tool is a key component for managing all metadata stored in Kylin, including the most important Cube metadata. The normal operation of all other components needs to be based on metadata management tools. Kylin's metadata is stored in hbase.

5) Task Engine (Cube Build Engine)

This set of engines is designed to handle all offline tasks, including Shell scripts, Java APIs, and Map Reduce tasks. The task engine manages and coordinates all tasks in Kylin, so as to ensure that each task can be effectively executed and solve the faults that occur during it.

Six, Kylin working principle

The working principle of Apache Kylin is essentially MOLAP (Multidimension On-Line Analysis Processing) Cube, which is multi-dimensional cube analysis. It is a very classic theory in data analysis, which is briefly introduced below.

6.1 Dimensions and measures

Dimension: the angle from which the data is observed . For example, employee data can be analyzed from the perspective of gender, or it can be more detailed and observed from the dimensions of entry time or region. A dimension is a set of discrete values, such as male and female in gender, or each independent date in the time dimension. Therefore, in statistics, records with the same dimension value can be aggregated, and then aggregate functions can be used for aggregation calculations such as accumulation, average, maximum and minimum.

Measurement: The statistical value being aggregated (observed), which is the result of the aggregation operation . For example, the number of employees of different genders in the employee data
, or the number of employees who joined in the same year.

6.2 Cube 和Cuboid

With dimensions and measures, all fields on a data table or data model can be classified. They are either dimensions or measures (which can be aggregated). So there is a Cube theory that pre-calculates based on dimensions and metrics.

Given a data model, we can aggregate all the dimensions on it. For N dimensions, there are 2n possibilities for combination. For each combination of dimensions, aggregate the measurement values, and then save the result as a materialized view called Cuboid. The Cuboid of all dimensions combined as a whole is called Cube.

Here is a simple example to illustrate. Suppose there is an e-commerce sales data set, where the dimensions include time [time],
product [item], region [location] and supplier [supplier], and the measurement is sales. Then there are 2 to 4 powers = 16 combinations of all dimensions, as shown in the following figure:

Insert picture description here

  • One-dimensional (1D) combinations are: [time], [item], [location] and [supplier] 4 kinds;
  • Two-dimensional (2D) combinations are: [time, item], [time, location], [time, supplier], [item, location],
    [item, supplier], [location, supplier] 3 kinds;
  • There are also 4 combinations of three dimensions (3D);

Finally, there are one for zero dimension (0D) and four dimension (4D), totaling 16 types.

Note: Each combination of dimensions is a Cuboid, and the total of 16 Cuboids is a Cube.

6.3 Core Algorithm

Kylin's working principle is to do Cube pre-calculation on the data model and use the calculated results to speed up queries:

1) Specify the data model, define dimensions and measures ;

2) Pre-compute Cube, calculate all Cuboids and save them as materialized views ;

The pre-calculation process is that Kylin reads the original data from Hive, calculates according to the dimensions we selected, and saves the result set to Hbase. The default calculation engine is MapReduce, and Spark can be selected as the calculation engine. The result of a build is called a segment. The construction process involves the creation of multiple Cuboids . The specific creation process is determined by the kylin.Cube.algorithm parameter. The parameter values ​​can be auto, layer and inmem . The default value is auto, that is, Kylin will dynamically select an algorithm by collecting data ( layer or inmem), if the user knows Kylin and its own data and clusters, he can directly set his favorite algorithm.

3) Execute query, read Cuboid, run, and generate query result .

6.3.1 Building an algorithm layer by layer (layer)

Insert picture description here

We know that an N-dimensional Cube consists of 1 N-dimensional sub-cube, N (N-1)-dimensional sub-cubes, N*(N-1)/2
(N-2)-dimensional sub-cubes,..., N one-dimensional subcubes and one 0-dimensional subcube are composed of 2^N subcubes in total
. In the layer-by-layer algorithm, the number of dimensions is reduced layer by layer. The calculation of each level ( except for the first The layer, which is aggregated from the original data ), is calculated based on the results of the previous layer. For example, the result of [Group by A, B] can be aggregated based on the result of [Group by A, B, C] by removing C; this can reduce double calculation; when the 0-dimensional Cuboid is calculated, The calculation of the entire Cube is completed.

Each round of calculation is a MapReduce task and is executed serially; an N-dimensional Cube requires at least
N+1 MapReduce jobs.

Algorithm advantages:

1) This algorithm makes full use of the capabilities of MapReduce and handles the complicated sorting and shuffling tasks in the middle, so the algorithm code is clear and simple, and easy to maintain;

2) Benefiting from the maturity of Hadoop, this algorithm has low cluster requirements and stable operation; during the internal maintenance of Kylin, it is rare to encounter errors in these steps; even when the Hadoop cluster is relatively busy, the task It can be done too.

Algorithm disadvantages:

1) When the Cube has more dimensions, the required MapReduce tasks will increase accordingly; because Hadoop task scheduling requires additional resources, especially when the cluster is large, the additional overhead caused by repeated tasks will be considerable;

2) This algorithm will output more data to Hadoop MapReduce; although the Combiner has been used to reduce
the data transmission from the Mapper end to the Reducer end, all data still needs to be sorted and combined through Hadoop MapReduce to be aggregated, and the cluster is added invisibly pressure;

3) There are many read and write operations on HDFS: Since the output of each layer of calculation will be used as the input of the next layer of calculation, these
Key-Values ​​need to be written to HDFS; when all calculations are completed, Kylin needs additional A round of tasks will convert these files into HBase's HFile format for importing into HBase;

Overall, the efficiency of this algorithm is low, especially when the Cube has a large number of dimensions.

6.3.2 Fast construction algorithm (inmem)

Also known as the "By Segment" or "By Split" algorithm, this algorithm has been introduced since 1.5.x, and
most of the aggregation is completed using the Mapper-side calculation, and then the result of the aggregation is given to Reducer, thereby reducing the pressure on network bottlenecks. The main idea of ​​the algorithm is to calculate the data block allocated by Mapper into a complete small Cube segment (including all cuboids); each Mapper outputs the calculated Cube segment to the Reducer for merging to generate a large Cube, This is the final result; the process is explained as shown in the figure.

Insert picture description here
Compared with the old algorithm, the fast algorithm has two main differences:

1) Mapper will use memory for pre-aggregation to calculate all combinations; each Key output by Mapper is different, which will reduce the amount of data output to Hadoop MapReduce;

2) One round of MapReduce will complete all levels of calculation, reducing the deployment of Hadoop tasks.

6.4 Principle

The working principle of Apache Kylin is to do Cube pre-calculation on the data model and use the calculated results to speed up queries. The specific working process is as follows.

  1. Specify the data model, define dimensions and measures.
  2. Precompute Cube, calculate all cuboids and save them as materialized views.
  3. When the query is executed, the Cuboid is read, calculated, and the query result is generated.

Because Kylin's query process does not scan the original records, but completes complex operations such as table associations and aggregations through pre-calculation, and uses the pre-calculated results to execute the query, so compared to non-pre-calculated query technology, its speed is average It is one to two orders of magnitude faster, and the advantage is even more obvious on very large data sets. When the data set reaches 100 billion or even trillion level, Kylin's speed can even surpass other non-precomputing technologies by more than 1,000 times.

Seven, summary

Kylin saves the calculation result set in HBase through pre-calculation, and the original row-based relational model is converted to columnar storage based on key-value pairs; through the combination of dimensions, it is used as the Rowkey of HBase, which eliminates the need for expensive query access The table scan of the table, which brings the possibility for high-speed and high-concurrency analysis; Kylin provides a standard SQL query interface, supports most SQL functions, and also supports seamless integration of ODBC/JDBC with mainstream BI products.

This article introduces the historical background and technical characteristics of Apache Kylin. In particular, it is based on the principle of pre-computed big data query. In theory, it can achieve O(1) constant query speed on any large data scale. This is also the key difference between Apache Kylin and traditional query technology, as shown in the figure below .

Insert picture description here
Traditional technologies, such as massively parallel computing and columnar storage, have query speeds at the O(N) level, which has a linear relationship with the scale of data. If the data scale increases by 10 times, the O(N) query speed will drop to one-tenth, which cannot meet the increasing data demand. Relying on Apache Kylin, we no longer have to worry about the slowdown in query speed as the amount of data grows, and we can be more confident in the face of future data challenges.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/107893132