Kylin series (a) - Getting Started

 
 
 
Because usually only use kylin without knowing the principle, so to write this article. Not his original article, is seen a lot of information, checked a lot of blog, has his own understanding, that a part of the essence of the collection. Kylin considered himself to complete a study summary and summarize it. Finally, there is a link, you need to help yourself.
 
Foreword
Enterprise query can be divided into two kinds of ad hoc queries and custom queries. Many of OLAP engine including Hive, Presto, SparkSQL, although largely Chengdu can reduce the difficulty of data analysis, but they only apply to ad hoc queries scene. But with the amount of data and computational complexity grows, response time is not guaranteed, and in fact this is contrary to the needs of business, data analysts and business people need real-time feedback to the data in order to produce better business guide.
Kylin generation how to solve the massive data OLAP queries.
Kylin is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities of Hadoop above, it can be said Kylin is based on Hadoop platform.
Kylin built on top of Hadoop distributed computing platform that takes full advantage of MapReduce parallel processing capabilities and scalable infrastructure, efficient handling of large data volume (in fact, kylin in the cuboid are based on MapReduce tasks), according to data scale implementation architecture scalable.
Process:
- a data source (Hive, Kafla)
- Calculated: Construction Cube with MapRedue
- storage: Hbase
- the SQL query parsing: kylin SQL parser
Kylin using pre-calculated mode, the user simply advance defined the query dimension, Kylin will help us carry out the calculation, and the result is stored in HBase to provide sub-second return to query and analyze vast amounts of data, it is space for time solution.
In fact, exhaustive approach to the number of combinations of all of the dimensions may be involved are counted again. Sql analytical use, the advantage of the performance Hbase extract data from quite good results.
Mention, Apache Kylin is the first led by the Chinese top-level Apache project.
 
Key concept
 
database
Data Warehouse referred DW, namely data warehousing, BI is a core part of. The main is to integrate data from different data sources together, by way of multi-dimensional analysis for business decision support and report generation.
The use of traditional data warehouses and data warehouse are different. Traditional transaction-oriented data warehouse, analysis and DW-oriented. More traditional database business is to make real-time response, involving CRUD, so it is necessary to follow the three paradigms, needs ACID. The data in the data warehouse, mostly historical data, the main purpose is to provide support for business decisions, so there may be a lot of data redundancy, but is conducive to query multiple dimensions, providing more viewing angle for decision makers.
Traditional BI, the data is stored in the data warehouse Mysql, Sql Sever and other databases, and large areas of common data is Hive. Hive also Kylin default data source.
 
Difference between traditional cartridge and the number of large number of data warehouse
Passing mention this distinction between a traditional data warehouse and big data warehouse. Why it must be a large data warehouse.
Here are a few reasons:
1. Data source diversification
original data source may be more data from the transaction, but may include: behavioral data, financial data and so on.
2. skyrocketing volume of data
the original data source may be more simple, but after diverse data sources, data volume soared, stand-alone operations can not be met. Needs such as processing behavior log data, there is no way process. After using Hive, can be used to speed up the efficiency of distributed and partition effects.
3. Data Type
conventional number of bins can solve the problem of structured data, unstructured data could not be resolved. And a large number of data warehouses acceptable unstructured data Hbase, reads data using the hive external table.
4. serve
traditional number of positions may be more specific to executives, operations, finance personnel, not only for the large number of data warehouse above the crowd, but also may provide interface data for the various systems, such as recommendation systems, internal risk control systems.
The processing speed
large number of data warehouses distributed architecture using distributed computational efficiency than traditional cartridge number, and can be extended dynamically on demand, without having to worry about performance.
 
OLAP and OLTP
OLAP (online analytical process) online transaction processing, based on multi-dimensional analysis, decision support, and more used in several positions. OLTP (online transcation process), online transaction processing, database used for conventional, such as mysql \ Oracle \ sql Sever the like, to focus on the business feedback system add or delete a single line of data change search.
 
Dimensions and metrics
Dimension refers to the angle of observation data, such as orders for a table, the dimension has order generation time, region, product category, product, and so on.
Is a dimension generally discrete values, each separate time dimension, such as the date, the location of each area, it is possible to record the same polymerization time statistical dimensions together, the polymerization is calculated.
Statistics are aggregated value is a measure, which is the result of the operation, such as an order for his sales volume and sales amount is the two measures is the need for statistical aggregation of value.
 
Dimensions of base
It refers to the number of different values ​​of the dimension in the data set appears.
For example, a country is a dimension, if there are 200 different values, the dimension of the base 200.
Usually a base dimension will vary from tens to tens of thousands of individual dimensions such as "User ID" base more than millions or even tens of millions.
More than 1 million base dimension is often called ultra-high-cardinality dimensions.
Cube in all dimensions of the base determines the complexity of the Cube, if there are several dimensions of ultra-high base, then the probability Cube expansion will be high.
 
Fact and dimension tables
The fact table (factTable) refers to the fact table record stores, including the specific elements of each time, and what happened as a specific system records, sales records and inventory records.
Volume of the fact table is much larger than the other tables.
Dimension table (DimensionTable) is the description of the elements in the event fact table.
It holds a dimension attribute value, you can associate with the fact table: the equivalent of the property on the fact table frequently recurring extract, used as a table out of the norm.
Common dimension table: Date (date corresponding week month quarter attributes), location table (include national, provincial and state, city).
Benefit dimension table:
- reduces the size of the fact table
- easy to manage and maintain dimensions, changes to the dimension tables to the fact table does not have to be a lot of changes.
- dimension tables can be reused for multiple fact tables.
The way, and then Kylin species uses a star model, that is directly related to all of the dimension tables and fact table.
 
Star model
Star model is a multi-dimensional data model used in several kinds of data mining. He is characterized by the fact that only one table, and zero or more dimension tables, fact tables and dimension tables by foreign keys associated with the main table, there is no correlation between the dimension table.
(Note that in kylin, the primary key is the only dimension table and the fact table, in addition to join the associated field, and the field does not allow the same dimension table, and the field between the dimension and dimension tables can not be the same. Facts and and dimension tables associated field types join must be the same. this is the time to build cube is frequently encounter errors.)
 
Kylin in the dimension table design
Kylin dimension table is for certain requirements.
To have data consistency, primary key values ​​must be unique. That is the dimension table associated with the column must be unique in the dimension table, otherwise it will error. This will resolve the error mentioned in the subsequent kylin.
Dimension tables as small as possible, because the dimension tables will Kylin loading into memory to check for; default threshold is 300MB.
Change the low frequency, Kylin attempt to reuse the snapshot dimension table in each building, if often change, reuse fail, which often leads to sexual dimension table to create a quick find.
Do not be hive dimension table view, because the view is actually a logical structure, does not actually exist, each use need to be materialized, resulting in additional overhead of time.
 
Cube和Cuboid
Understand the dimensions and metrics, you can put all the fields on the data model category: they are either dimension, either measure, no third field. According to the definition of the dimensions and measures you can build a cube.
For a given data model, we can on all its dimensions combined for N dimensions, the possibility of a combined total of N-th power of 2 kinds. I.e., a N-dimensional cube, is a cube of an N dimensional subspace, N number (N-1) dimensional sub-cube, N * (N-1) / 2 th (N-2) dimensional sub-cube ... N 1-dimensional subcube and a 0-dimensional cube sub configuration. In fact, permutations and combinations.
For each combination of dimensions, it will be a measure of polymerization operation, and then save the results of the operation is a materialized view, called the cuboid. All cuboid combination of dimensions as a whole, is called Cube.
For example, assume the dimensions A, B, C, then the power of 2 2 3, i.e., eight kinds.
0 Dimension 0D: one kind of
a dimension 1D: [A] [B] [C]
two dimensional 2D: [AB] [AC] [BC]
three-dimensionality 3D: [ABC]
SQL expression calculation Cuboid [A, B]
select A,B,Sum(amount) from table1
group by A,B
The calculation results are stored as materialized views, materialized views all the general term cuboid is Cube.
 
Kylin technology architecture
Apache Kylin system can be divided into two built offline and online inquiry of the specific architecture is shown below:

First look built offline portion. As can be seen from the figure, the left bit data source, the current source is kylin Hive default data, the user holds the data to be analyzed. According to the definition of metadata, build engine data extracted from the data source, and build Cube. Data input in the form of a relational table, and in line with the star model. The main building technology to MapReduce. Cube building after saving storage engine on the right side, the current Kylin default storage engine is HBase.
After the completion of construction of off-line, the user can send a SQL query system from the top of the query analysis. Kylin provides a RESTful API, JDBC / ODBC interfaces for users to call. No matter from which the interface enters, SQL will eventually come REST services layer and then transferred to the query engine for processing. Query engine parses SQL, execution plan is generated based on the logic of relational tables, and then translated to the physical implementation plan based on the Cube, the final count is expected to generate a query Cube and produce results. The whole process does not access the original data source. If the query submitted by the user is not predefined in the Kylin, Kylin will return an error.
It is worth mentioning that, the data source Kylin, Cube memory execution engine and extracted three core modules out of the abstraction layer, which means that the three modules may be arbitrarily expanded and replaced.
 
Kylin core modules
 
REST Server
REST Server is an entry point for application development. Such applications can provide a query, get the results, triggering cube build tasks, access to metadata and user rights, and so on. In addition SQL queries can be achieved by Restful excuse.
 
Query Engine (Query Engine)
When the cube is ready to query engine is able to obtain and resolve user queries. He will then interact with other components in the system, so that the corresponding return results to the user.
 
Routing
SQL generation is responsible for analysis and conversion implementation plan to cache query cube, cube is pre-computed by caching hbase in.
 
Metadata management tool
Kylin is a metadata-driven applications. Metadata management tool XI critical component for all metadata stored in Kylin among management, including the most important of the cube metadata. The proper functioning of all other components are required to metadata management tool is based. Kylin metadata stored in the Hbase.
 
Task engine (Cube Build Engine)
This engine is designed to handle all the off-line tasks, including shell scripts, Java API and MapReduce tasks, and so on. The task engine to be managed and coordinated among all tasks Kylin, ensuring that each task can be effectively implemented and troubleshooting during arise.
 
Kylin Cube three configurations
Let's talk about Cube building. Construction of Cube Kylin divided into three: Construction of the whole amount, incremental build, flow constructs. The simplest is to build the full amount, that is, every build Hive table full table constructed. But the full amount of building in the real world is not commonly used, as most business scenarios shrimp, the fact that data are constantly growing, so in fact the most common way to build is incremental builds.
Examples presented here build a full amount. For example, the chain of family-related data must be built using the whole amount, such as performance-related turnover, because a single transaction time is 2 months old, there may be modified on a single middle, then the performance also will be modified. It can only be used to build the whole amount, dm table for the full amount of the whole year's performance. However, the range of dimensions less controlled, so much pressure. Overall, this is controlled by controlling the DM table to achieve, and to retain only a new segment in the kylin in.
Cube can make incremental build each building only the data in the new part of the Hive table, but not all the data, thus greatly reducing the construction cost. Cube into a plurality of Segment Kylin, with each Segment start time and end time are identified.
Incremental build differs from the total amount of construction:
You need to specify the Partition Date Column 1. Create a model, to segment the Cube with the date field.
 You need to specify the Partition Start Date, Start Time Cube that is the default when the first Segment 2. Create Cube.
You can see the article
Kylin Cube Creation
Kylin Cube Build and the Job Monitoring
Incremental builds way to solve the problem of dynamic growth of business data. But can not meet the demand for near real return results of minutes, because they use an incremental build Hive as a data source, the data from the ETL Hive timing of introduction (e.g., once a day). Timeliness of data is self-evident the importance of data values. Kylin flow given construction scheme.
Kafka stream constructed using as a data source, build engine timing pull to the data constructed from Kafka. This is a design and Spark Streaming micro-batch thought very much like. Note that streaming build presence in the 1.6 version.
Kylin provide:
- Cube can build Web UI interface and RESTful API
- can query / ODBC interface Web UI interface and RESTful API and JDBC data
The user can select the appropriate way to build and query according to their own circumstances.
 
Blog reference
Basic introduction kylin of
http://cxy7.com/articles/2018/06/09/1528544157772.html
https://www.jianshu.com/p/abd5e90ab051
http://www.liuhaihua.cn/archives/451581. html
The difference between the number of traditional warehouse and big data platform
https://blog.csdn.net/Gospelanswer/article/details/78208761
https://support.huaweicloud.com/dws_faq/dws_03_0005.html
Inmon Kimball Data Warehouse Architecture dispute
https://blog.csdn.net/paicMis/article/details/53236869
https://blog.csdn.net/yanshu2012/article/details/55254300
OLTP and OLAP presentation
https://www.cnblogs.com/hhandbibi/p/7118740.html
 

Guess you like

Origin www.cnblogs.com/zzjhn/p/11525025.html