Study Notes: Apache Kylin Overview

A, kylin solve what the key issues?

Apache Kylin original intention was to solve the problem-second query hundred billion trillion recorded one of the key is to break the query time increases linearly with the amount of data was this law.

Big Data OLAP, we can note two facts:

• Large data query to the general statistics , the statistical value is more than the aggregate function through calculation record. The original records are not required, or the frequency and very low probability of being accessed.

• the polymerization is carried out according to the dimensions, and polymerization dimension possibilities are limited, and generally do not grow linearly with the expansion data.

 

Based on the above two points, we get a new idea - "pre-calculation." We should as much as possible pre-calculated aggregation results , also try to use pre-computed results obtained query result in the query time, thus avoiding direct scanning the original records potentially infinite growth.

For example, use the following SQL query to the highest October 1 day sales of goods.

SELECT item, SUM(sell_amount) FROM sell_details WHERE sell_date='2016-10-01' GROUP BY item ORDER BY SUM(sell_amount) DESC

The traditional method requires scanning all the records , sales records found October 1, and then press the polymerization of sales of goods, and finally a sort returns.

If October 1 has 100 million deal, then the query is necessary to read and a total of at least 100 million records, and query speed will gradually decline with future sales , if the daily trading volume increased to 200 million, that query execution the time may be doubled.

The pre-calculated method will advance by dimension [sell_date, item] calculated SUM (sell_amount) and stores them down , find the sale of goods October 1 at the time of the query can be sorted directly return to the.

The maximum dimension of no more than the number of records read [sell_date, item] is the number of combinations. Obviously, this figure will be far less than the actual sales records, such as 100 million transaction Oct. 1 contains 100 kinds of commodities, so after the pre-computed only one million records, is one percent of the original. And these are already recorded by the result of the polymerization product, the polymerization operation is omitted runtime. From the future development point of view, query speed will vary depending on the date and the number of commodities increase, the total number of sales records are no longer directly linked. If the daily trading volume doubled to 200 million, but as long as the total number of goods the same, then the pre-computed results will not become total number of records, query speed will not change.

"Pre-calculation" is Kylin outside the "massively parallel processing" and "columnar storage", provides key technologies to third big data analysis.

 

Two, kylin works

2.1 Introduction to dimensions and metrics

In the description MOLAP Cube before, we need to introduce the dimension (dimension) and a measure (measure) of these two concepts.

Simply put, the dimension is the angle of observation data. Such as electricity supplier's sales data may be viewed from the dimension of time (left panel shown in FIG. 1-2), it may be further refined from time and the area to observe the right (FIG. 1-2 dimension shown).

Dimensions typically a discrete set of values, such as each individual date in the time dimension, or each on a separate commodity product dimension. Therefore, statistics can be the same dimension values ​​recorded gather together, do cumulative aggregate function, average polymerization deduplication counting calculation.

Is a measure of the statistical value to be polymerized , the polymerization is the result of the operation, it is generally continuous value, in FIG. 1-2 sales, or total number of items for sale. Analysts and estimates can be made by comparing the measure of the data evaluation, such as this year's sales growth compared to last year, how much, whether the expected growth rate, increase the proportion of different categories of goods is reasonable.

 

2.2 Cube和Cuboid

Understanding the dimensions and measures, can be classified in all fields on a data table or data model, they are either dimension, either metric (which may be polymerized). So there will be based on dimensions and metrics do precomputed Cube theory.

Given a data model, we can combine them all dimensions. For N dimensions, the possibility of all combinations are 2 N combination. Each combination of dimensions, which aggregates the metrics calculation results are stored in a materialized view of the operation is referred Cuboid. All dimension combinations Cuboid as a whole, is called Cube. So simply, a Cube is a collection of many dimensions by aggregation materialized views.

As a concrete example. Given a set of electronic business sales data, which have dimensions of time (Time), commodities (Item), location (Location) and suppliers (Supplier), there is a measure of sales (GMV). So, the combination of all there are two dimensions = 16 species (FIG), such as a combination of one dimension (1D) are [Time] [Item] [Location ] [Supplier] four kinds; combination of two dimensional (2D) are [Time, Item] [Time, Location] [Time, Supplier] [Item, Location] [Item, Supplier] [Location, Supplier] six kinds; combination of a three-dimensional (3D) also has four; Finally, dimension zero (0D combination) and tetrakis dimensional (4D) of each one of a total of 16 combinations.

Calculating Cuboid, is polymerized by dimension sales (GMV). If you use SQL to express calculate Cuboid [Time, Location], that is:

select Time, Location, Sum(GMV) as GMV from Sales group by Time, Location

  

The result of the calculation is saved as materialized views, materialized views all Cuboid is the general term for the Cube.

 

Works 2.3

Apache Kylin operating principle is the data model to do pre-computed Cube, and use the results of the calculation speed up queries. Process is as follows:

(1) specifies a data model, defined dimensions and measures.

(2) pre-computed Cube, Cuboid calculate all and save it as materialized views.

(3) when the query is executed, the read Cuboid, processing operation to generate a query result.

Since Kylin inquiry procedure does not scan an original recording, but by complex calculations done in advance precomputed association table, polymerization, and to execute the query using pre-computed results, and therefore its speed as compared to a non-pre-computed query technology generally one to two orders of magnitude faster . And the large data set which more obvious advantages. When the data sets reach 100 billion or even trillion level, Kylin speed may even surpass other non-technical Precomputed more than 1,000 times.

 

Three, kylin technical architecture

Apache Kylin system can be divided into two built offline and online inquiry section, its technical architecture shown in FIG. Online inquiry composed mainly of the upper half, built offline in the bottom half.

Look at the offline part construction. Can be seen from Figures 1-4, the data source on the left, the key is Hadoop, Hive, Kafka and the RDBMS, the user data stored therein to be analyzed.

The metadata definitions below build engine to extract data from the data source, and build Cube.

Data input in the form of a relational table, and must comply with the star model (Star Schema) or snowflake model (Snowflake Schema).

Users can choose to use or MapReduce Spark constructed.

Cube after building in a storage engine on the right, currently HBase is the default storage engine.

完成离线构建后,用户可以从上方查询系统发送SQL来进行查询分析。Kylin提供了多样的REST API、JDBC/ODBC接口。无论从哪个接口进入,最终SQL都会来到REST服务层,再转交给查询引擎进行处理。这里需要注意的是,SQL语句是基于数据源的关系模型书写的,而不是Cube。Kylin在设计时刻意对查询用户屏蔽了Cube的概念,分析师只需要理解简单的关系模型就可以使用Kylin,没有额外的学习门槛,传统的SQL应用也更容易迁移。查询引擎解析SQL,生成基于关系表的逻辑执行计划,然后将其转译为基于Cube的物理执行计划,最后查询预计算生成的Cube产生结果。整个过程不访问原始数据源。

  注意 对于查询引擎下方的路由选择,在最初设计时考虑过将Kylin不能执行的查询引导到Hive中继续执行。但在实践后发现Hive与Kylin的执行速度差异过大,导致用户无法对查询的速度有一致的期望,大多语句很可能查询几秒就返回了,而有些要等几分钟到几十分钟,用户体验非常糟糕。最后这个路由功能在发行版中默认被关闭。

Apache Kylin v1.5版本引入了“可扩展架构”的概念。图1-4所示为Rest Server、Cube Build Engine和数据源表示的抽象层。可扩展是指Kylin可以对其三个主要依赖模块——数据源、构建引擎和存储引擎,做任意的扩展和替换。在设计之初,作为Hadoop家族的一员,这三者分别是Hive、MapReduce和HBase。但随着Apache Kylin的推广和使用的深入,用户发现它们存在不足之处。

比如,实时分析可能会希望从Kafka导入数据而不是从Hive;而Spark的迅速崛起,又使我们不得不考虑将MapReduce替换为Spark以提高Cube的构建速度;至于HBase,它的读性能可能不如Cassandra等。可见,是否可以将某种技术替换为另一种技术已成为一个常见的问题。于是,我们对Apache Kylin v1.5版本的系统架构进行了重构,将数据源、构建引擎、存储引擎三大主要依赖模块抽象为接口,而Hive、MapReduce、HBase只是默认实现。其他实现还有:数据源还可以是Kafka、Hadoop或RDBMS;构建引擎还可以是Spark、Flink。资深用户可以根据自己的需要做二次开发,将其中的一个或者多个技术替换为更适合自身需要的技术。

这也为Kylin技术的与时俱进奠定了基础。如果将来有更先进的分布式计算技术可以取代MapReduce,或者有更高效的存储系统全面超越了HBase,Kylin可以用较小的代价将一个子系统替换掉,从而保证Kylin紧跟技术发展的最新潮流,保持最高的技术水平。

可扩展架构也带来了额外的灵活性,比如,它可以允许多个引擎并存。例如,Kylin可以同时对接Hive、Kafka和其他第三方数据源;抑或用户可以为不同的Cube指定不同的构建引擎或存储引擎,以期达到极致的性能和功能定制。

 

四、kylin特点

Apache Kylin的主要特点包括:

  • 支持SQL接口
  • 支持超大数据集
  • 秒级响应
  • 可伸缩性
  • 高吞吐率
  • BI及可视化工具集成

Guess you like

Origin www.cnblogs.com/shwang/p/12066525.html