What is Apache Kylin

1. What is Apache Kylin? 

In the current era of big data , more and more enterprises begin to use Hadoop to manage data, but existing business analysis tools (such as Tableau, Microstrategy, etc.)
often have great limitations, such as difficulty in horizontal expansion and inability to process ultra-large-scale data , Lack of support for Hadoop; there are still many obstacles to using Hadoop for data analysis, for example, most
analysts are only accustomed to using SQL, Hadoop is difficult to achieve fast interactive query and so on. The beast Apache Kylin is designed to solve these problems.

Apache Kylin, Chinese name Qi (shen) Lin (shou) is an important member of the Hadoop Zoo. Apache
Kylin is an open source distributed analysis engine, originally developed by eBay and contributed to the open source community . It provides SQL query interface and multi-dimensional analysis (OLAP) capabilities on Hadoop to support
large-scale data, can handle TB or even petabyte-level analysis tasks, query huge Hive tables at sub-second level, and support high concurrency.

Apache
Kylin was open sourced on github in October 2014, and soon joined the Apache incubator in November 2014. It officially graduated as an Apache top-level
project , and became the first Apache top-level project completely designed and developed by a Chinese team. . In March 2016, the
core development members of Apache Kylin founded Kyligence Company to better promote the rapid development of the project and the community.

Kyligence is a company focused on big data analysisAn innovative data technology company in the field, providing enterprise-level intelligent analysis
platforms and products based on Apache Kylin , as well as reliable, professional, source-level commercial support; and launching Apache Kylin developer training, and issuing the world's only Apache Kylin developer certification certificate . 2. The basic principles and architecture of Kylin Let's talk about the basic principles and architecture of Kylin. To put it simply, the core idea of ​​Kylin is precomputing, that is, precomputing the metrics that may be used in multi-dimensional analysis, and saving as cubes for direct access during query. Converting high-complexity aggregation operations, multi-table joins and other operations into queries on pre-computed results determines that Kylin can have good fast query and high concurrency capabilities . The figure above is an example of a Cube. Suppose we have 4 dimensions. Each node (called a Cuboid) in this Cube is a different combination of these 4 dimensions, and each combination defines a set of analyzed dimensions (such as group by), the aggregated results of the measure are stored on each Cuboid. When querying, find the corresponding Cuboid according to SQL, read the value of measure, and then return. In order to better adapt to the big data environment, Kylin reads source data from the most commonly used Hive in data warehouses, uses MapReduce as the engine for Cube construction, saves the precomputed results in HBase, and exposes Rest API/JDBC/ODBC to the outside world query interface. Because Kylin supports standard ANSI SQL, it can seamlessly interface with common analysis tools (such as Tableau, Excel, etc.). The following is the architecture diagram of Kylin.

















Speaking of Cube construction, Kylin provides an algorithm called Layer Cubing.
Simply put, it is to start from the Base Cuboid in the order of the number of dimensions from large to small, and then re-aggregate based on the results of the previous layer of Cuboid. The computation of each layer is a separate Map Reduce task. As shown below.

The calculation results of MapReduce are finally stored in HBase. The Rowkey of each row of records in HBase is composed of dimension, and the measure is stored in the
column
family. In order to reduce the storage cost, dimension and measure are encoded here. In the query phase, using the features of HBase column storage can ensure that Kylin has
good fast response and high concurrency.

With these precomputed results, when receiving the user's SQL request, Kylin will make a query plan for the SQL, and rewrite the join, Sum, Count Distinct and other operations that should be performed into Cube query operations.

Kylin provides a native web interface, where users can easily create and set Cubes, manage the progress of Cube construction, and provide SQL queries and basic result visualization.

According to public data, the query performance of Kylin is not only for individual SQL, but for the average performance of tens of thousands of SQL. In the production environment, 90% of ile queries can be returned within 3s. In the Apache Kylin

Meetup held last month, Internet companies such as Meituan , JD.com , and Baidu shared their usage. For example, in the case of JD Yunhai, a single cube has a maximum of 8 dimensions, and the maximum number of data bars is 400 million.
The large storage space is 800G, and 30 Cubes occupy about 4T of storage space. In terms of query performance, when the QPS is around 50, all queries are within 200ms on average, and when the QPS is around 200, the average
response time is within 1s.

Beijing Mobile also demonstrated the application case of Kylin in telecom operators on the meetup. From the data point of view, Kylin can obtain better query performance than Hive/SparkSQL under weaker hardware configuration. At present, more and more domestic and foreign companies use Kylin as an important component in the big data production environment, such as eBay, UnionPay, Baidu, China Mobile, etc. If you want to know more community cases and trends, you can log in to the Apache Kylin official website or Kyligence blog to view.

3. The latest features

of Kylin The latest version 1.5.x of Kylin has introduced many new features that people are looking forward to. The scalable architecture completely decouples the three major dependencies of Kylin (data source, Cube engine, and storage
engine ). Kylin will no longer directly depend on Hadoop/HBase/Hive, but will use Kylin as an extensible platform to expose abstract interfaces, and the specific implementation
will specify the data source, engine and storage used in the form of plug-ins.

Developers and users can connect Kylin to big data systems other than Hadoop/HBase/Hive through custom development, such as using Kafka instead of Hive as the data source, using
Spark instead of MapReduce as the computing engine, and using Cassandra instead of HBase as storage. will become simpler. This also ensures that Kylin can
evolve and keep up with the technology trend.

In Kylin
1.5.x, the HBase storage structure has also been adjusted, the large Cuboid is stored in shards, and the linear scan is improved to a parallel scan. Based on tens of thousands of queries, the test and comparison results show that the storage of shards
The storage structure can greatly speed up the original slower query by 5-10 times, but the speedup of the original faster query is not obvious, and the average speedup is about 2 times.

In addition, 1.5.x also introduced the Fast
cubing algorithm, which uses the Mapper-side calculation to complete most of the aggregation first, and then passes the aggregated results to the Reducer, thereby reducing the pressure on the network bottleneck. Experiments on more than 500 Cube tasks
show that after the introduction of Fast cubing, the overall Cube construction task is 1.5 times faster.

At present, the community is preparing for the release of Apache Kylin version 1.5.2, which is currently in the voting stage of the Apache Mailing list, and is expected to be officially downloaded on the Kylin official website this week.

In this 1.5.2 version, Kylin brings a total of
36 bug fixes, 33 feature improvements, and 6 new features. Some major functional improvements include the improvement of the computing efficiency of
HyperLogLog, the acceleration of the Convert data to hfile step during Cube construction, the optimization of the UI experience for function prompts, the support of hive view as a lookup table, and so on.

Another new news is that Kylin will support MapR and CDH Hadoop distributions, see KYLIN-1515 and KYLIN-1672 for details. The corresponding test versions are MapR5.1 and CDH5.7.

An important update is provided on the UI that allows users to make custom configuration at the Cube level to override the global configuration in kylin.properties. For example, if kylin.hbase.region.count.max is defined in the cube, the maximum number of region divisions of the cube in hbase can be set.

Another
important feature is Diagnosis. Users often encounter some thorny problems, such as Cube building task failure, SQL query failure, or Cube building time is too long, SQL query failure
Wait too long. However, because the operation and maintenance personnel do not have a deep understanding of the Kylin system, it is difficult to quickly locate the root cause. We
also often see many users asking for help in the mailing list. Due to the lack of sufficient information, it is difficult for the community to give succinct advice.

When the user encounters the problem of query and Cube/Model management, click the Diagnosis button on the System page, the system will automatically capture the information related to the current Project and package it into a
zip file to download to the user's local. This package will contain relevant Metadata, logs, HBase configuration, etc.
This package can also be attached when the user needs help on the mailing list.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326315470&siteId=291194637