Apache Kylin (Kirin)

Why Kylin?

       Hadoop to help us solve the mass data storage.

       Early use Hadoop MapReduce computation model, and too slow, can only do offline computing, real-time calculation can not be done with iterative calculation.

       Spark came into being, and promote the development of the Scala language, MapReduce computation model Spark enhance the performance model to calculate the number of times than Hadoop's MapReduce.

       In today's enterprise development, daily incremental data is in the hundreds MB, G is a unit of growth, in the face of such a large scale of data growth, and affect all aspects of operating costs, hardware costs, lower response speed, Spark enough to choke.

       In this case, companies are generally divided into the query ad hoc queries and custom queries.

       Ad hoc query: Hive, SparkSQL and other OLAP engine, while reducing the difficulty of data analysis to some extent, but they are only used for the scene ad hoc queries, the advantage is that users according to their needs, custom, flexible options for query, and ordinary biggest difference is that the common queries based on query custom application development query, but with the growth of data volume and complexity of the calculation, the response data can not be guaranteed .

       Real-time query: in most cases is to make real-time response to user operations, Hive query engine and so difficult to meet real-time queries, generally only on the data in the database to extract the calculation, then the result is stored in a relational database such as MySQL , finally provided to the user query, with increasing mass data back, at great cost in this way .

       Kylin different from the Hive and other large-scale parallel processing architecture, Kylin is pre-computed model, we define in advance the dimension of good inquiries, Kylin will help us to calculate and store the results to Hbase, when we go and query massive data analysis, provide sub-second returns.

       Kylin obviously uses a space for time strategy , first defined the various fields to cross-check these data into the database query good when we went to inquire at this time the amount of data is also less, if the query statements and pre-calculated is the same, so it can be returned directly, so Kylin queries quickly.

       Before reading the following, please read: https://blog.csdn.net/Su_Levi_Wei/article/details/89501304

 

What is Kylin?

       Apache Kylin (Extend OLAP Engine For Big Data) Chinese unicorn named, is an important member of the Hadoop ecosystem, it is an open source distributed analysis engine, originally developed by eBay, provides a SQL query interface on top of Hadoop and multidimensional analysis (OLAP) capabilities to support high concurrent processing of large-scale mass data TB to PB level, the ability to query a huge table in the Hive sub-second.

       Kylin in October 2014 at Github open source, in November 2014 joined Apache, 2015 Nian 11 Yue become a top-level project, is the first complete top-level Apache project developed by the design team in China, in March 2016 established the core developers Kylin Kyligence company to promote the development of projects and communities.

Cube & Cuboid

       Cube can be said that the core of Kylin, Kylin is by building Cube, thus achieving sub-second searching of huge amounts of data.

       Before building the Cube, the first data warehouse design and architecture, and to determine good to analyze the dimensions and metrics (measures), in accordance with defined dimensions and metrics (metric) you can build a Cube.

       Cube is a combination of all of the dimensions for a given data model, is calculated.

       For N dimensions, the possibility of a combined total of 2 to the power N for each combination of dimensions, the index (metric) is calculated to make the polymerization.

       Wherein each combination of dimensions called Cubeid, a value Cubeid contains all of the metrics in a particular dimension combinations.

       Below, it is a four-dimensional Cube build process.

       Suppose the sales data set on a point, wherein the dimensions include time, commodity, location, supplier four dimensions, sales indicators, then there are four combinations of all dimensions of the power of 2, just under the corresponding FIG.

       If you calculate in advance, so even when writing SQL table operation, and it would come out the results.

 

Cube & Cuboid build process

Kylin's core idea is pre-calculated, that is multi-dimensional analysis of indicators might be used (metric) is calculated, and the calculated results are saved as Cube, when for inquiries, direct access, aggregation operation to highly complex, multi-table joins and other operations is expected to translate into the results of the query count, which determines the ability of Kylin able to have a good fast queries and high concurrency.

Kylin further provides a method to construct called Layer Cubing Cube, this algorithm is the number of descending order, starting from Base Cuboid, administered sequentially polymerized layer on the results again according to the dimension (the Dimension), each layer the calculation is a separate MapReduce tasks.

       The Map and Reduce there is relatively simple, Mapper Cuboid or more layers as a result of an input, because each dimension value Key is spliced ​​together, wherein the dimensions to be polymerized to identify, remove its value to a new Key, Value and operation, and then outputs the new Key and Value, and further to sort, shuffle (shuffle) for all new Key, then Reduce, Reduce the input is a group of the same set of Key Value, calculated to make these polymeric Value , combined with the Key to complete the round output calculations.

      Each round of calculations are a MapReduce task, and is serial execution, a N-dimensional Cube, at least N times MapReduce Job.

      Kylin MapReduce is the final result of the calculation stored in HBase, for the span query (year, quarter, month, etc.) Kylin is using Cube's Data Segment partition storage management solutions.

      The RowKey Each row of HBase by the dimension (Dimension) composition, Cuboid indicators will save mapped Value in Column Family, in order to reduce the cost of storage, where the dimensions and metrics will be encoded.

       HBase query phase column using the stored characteristics can guarantee Kylin good fast response and high concurrency.

 

Kylin Technology Architecture

 

data source

Kylin supports a variety of data sources, the default data source is the Hive.

Storage Engine

       Kylin using pre-calculated manner, the default storage engine is pre-computed results HBase.

REST Server

       REST Server is an application-oriented development of the entry point, the application can provide a query, get the results, triggering Cube build tasks, access to metadata and user rights, etc., can also be achieved through SQL queries Restful interface.

Query Engine (query engine)

       When the Cube is ready, the query engine is able to obtain the user's query and parse the statement, and interact with other components, return the results corresponding to the user.

Routing (routing)

       Will parse SQL generated query execution plan cache into Cube, Cube solved by pre-computed cached in HBase, the user query using the router query optimization algorithms and HBase Coprocessor.

The Metadata (metadata)

       Kylin stored in the management of all metadata, including metadata Cube, the other components are on this basis, technical metadata and business metadata are stored in HBase Kylin's.

Cube Build Engine (task engine)

       Off-line processing and coordinating all tasks, including Shell script, JavaApi, MapReduc tasks.

Cube three constructs

       Construction Kylin Cube divided into three, the amount of the whole construct, incremental build, flow constructs.

       Construction of the full amount: every table full table of the Hive building, but this building is not commonly used in the real world, only used during initialization more, because most business scenarios, data is constantly fact table growth.

       Incremental build: that each Cube Hive table constructed only new part of the data, but not all the data, thus reducing the cost of building, a plurality of Segment Kylin into the Cube, each Segment to the start time and end time with identity.

       Incremental builds way to solve the problem of dynamic growth of business data, but can not meet-the-minute returns results in near real time demand, because they are using the incremental build Hive as the amount of data, data from ETL Hive in the timing of import (such as once a day), timeliness of data is self-evident the importance of data values.

       Construction of the full amount of the difference between incremental and build:

              The need to develop Partition Date Column (date data partitioning column) 1. Create a Model, is divided on the Cube with the date.

              The need to develop Partition Start Date, Start Time Cube that is the default when the first Segemnt 2. Create Cube.

              The official document: http://kylin.apache.org/docs20/tutorial/create_cube.html

                              http://kylin.apache.org/docs20/tutorial/cube_build_job.html

       Construction Flow: In order to solve the problem of growth of real time data streaming constructed using as a data source Kafka, Kafka constructed build engine timing pulls data from the design and the micro-batch Spark Streaming timing is very similar, this is Kylin 1.6 version exists.

 

Kylin properties

SQL interface

       Kylin main external interfaces are provided in the form of SQL, SQL easy to use feature greatly reduces the learning costs of Kylin.

Support massive data sets

       Whether Hive, SparkSQL, or Impala, which query time with the amount of data grows linearly with growth, and Apache Kylin use precomputed technology to break it, Kylin limitations on the size of the dataset depends on the dimensions of a and the number of base (the amount of data in the dimension table), instead of the data set size, so Kylin can better support query massive data sets.

Sub-second response

       Kylin technology is the use of precomputed, the query is very fast, because of the complexity of the connector, and aggregation during the construction operations in the Cube has been completed.

Horizontal expansion

       Apache Kylin can also use clustered to scale horizontally, but can only improve the ability to deploy multiple nodes Kylin processing of queries, but it can not be expected to enhance the computing power (algorithm).

Integrated Visualization

       Kylin provide the ability to integrate with BI tools such as Tableau, PowerBI / Excel, MSTR, QlikSense, Hub, SuperSet and so on.

Construction of multi-dimensional cube (Cube)

       The user can set defined in the data model in less than ten billion Kylin and build data cube.

 

Kylin server mode

       Examples Kylin is stateless, the runtime status (metadata) are stored in HBase (designated by the kylin.metadata.url conf / kylin.properties in) in the metadata, and therefore share a unified states (job in the table structure state, Cube status, etc.).

       Each instance Kylin conf / kylin.properties in both a Kylin.server.mode entry, when the specified mode.

       job: job engine is responsible for managing the cluster of jobs in the running instance

       query: just run query engine, is responsible for receiving and responding to SQL queries.

       all: In an example that is run job engine, you may run query engines.

       NOTE: Only one instance can run job engine (all or job mode), the remaining query mode requires similar default Master / Slave mode.

Published 103 original articles · won praise 34 · views 70000 +

Guess you like

Origin blog.csdn.net/Su_Levi_Wei/article/details/89516462