Kylin from entry to proficiency and case practice series

1. Basic knowledge of Kylin

1.1. Understand the basic concepts, principles and architecture of Kylin

1.1.1, Kylin definition

Apache Kylin is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities on Hadoop/Spark to support ultra-large-scale data. It was originally developed by eBay Inc and contributed to the open source community. Query huge Hive tables within.

Based on Hadoop and HBase, Kylin can support the query and analysis of ultra-large-scale data, and has the advantages of low latency, high concurrency, and high scalability.

1.1.2, Kylin architecture

insert image description here

  1. REST Server: REST Server is a set of entry points for application development, designed to implement application development for the Kylin platform. Such applications can provide queries, get results, trigger cube build tasks, get metadata, get user permissions, and more. In addition, SQL queries can be implemented through the Restful interface.
  2. Query Engine: When the cube is ready, the query engine can obtain and parse user queries. It then interacts with other components in the system to return corresponding results to the user.
  3. Router (Routing): In the initial design, it was considered to guide the queries that Kylin could not execute to Hive for further execution, but after practice, it was found that the speed difference between Hive and Kylin was too large, so that users could not have consistent expectations for the query speed , it is likely that most queries will return results within a few seconds, while some queries will have to wait for minutes to tens of minutes, so the experience is very bad. This last routing feature is turned off by default in distributions.
  4. Metadata management tool (Metadata): Kylin is a metadata-driven application. The metadata management tool is a key component used to manage all metadata stored in Kylin, including the most important cube metadata. The normal operation of all other components needs to be based on metadata management tools. Kylin's metadata is stored in hbase.
  5. Task Engine (Cube Build Engine): This engine is designed to handle all offline tasks, including shell scripts, Java API, and Map Reduce tasks. The task engine manages and coordinates all the tasks in Kylin, so as to ensure that each task can be effectively executed and solve the faults that occur during it.

1.2. Familiar with the main features and advantages of Kylin

The main features of Kylin include support for SQL interface, support for ultra-large-scale data sets, sub-second response, scalability, high throughput, BI tool integration, etc.

  • Standard SQL interface: Kylin uses standard SQL as the interface for external services.
  • Support for very large data sets: Kylin's ability to support large data may be the most advanced of all technologies at present. As early as 2015, eBay's production environment was able to support second-level queries of tens of billions of records, and then there were cases of second-level queries of hundreds of billions of records in mobile application scenarios.
  • Sub-second response: Kylin has excellent query response speed, which benefits from precomputation. Many complex calculations, such as connection and aggregation, have been completed in the offline precomputation process, which greatly reduces the time required for querying. The amount of calculation increases the response speed.
  • Scalability and high throughput: single-node Kylin can achieve 70 queries per second, and Kylin clusters can also be built.
  • BI tool integration: Kylin can be integrated with existing BI tools, including the following.
    • ODBC: Integration with Tableau, Excel, PowerBI and other tools
    • JDBC: Integrate with Java tools such as Saiku and BIRT
    • RestAPI: Integration with JavaScript and Web pages
    • The Kylin development team also contributed Zepplin plugins, and Zepplin can also be used to access Kylin services.

1.3. Master the installation and configuration methods of Kylin

1.3.1, Kylin depends on the environment

Before installing Kylin, Hadoop, Hive, Zookeeper, and HBase must be deployed first, and the following environment variables HADOOP_HOME, HIVE_HOME, and HBASE_HOME need to be configured in /etc/profile . Remember to source to make it effective.

1.3.2, Kylin build

  1. Upload the Kylin installation package apache-kylin-3.0.2-bin.tar.gz

insert image description here
2. Decompression softwaretar -zxvf apache-kylin-3.0.2-bin.tar.gz -C /opt/model/

insert image description here

  1. Kylin compatibility issues

Modify /opt/model/kylin-3.0.2/bin/find-spark-dependency.shand exclude conflicting jar packages that need to be added:! -name '*jackson*' ! -name '*metastore*'
insert image description here

1.3.3, Kylin start

  1. Before starting Kylin, you need to startHadoop(hdfs,yarn,jobhistoryserver)、Zookeeper、Hbase,Hive

insert image description here

  1. Start Kylin
[song@hadoop102 kylin]$ bin/kylin.sh start

insert image description here

  1. View the web page http://hadoop102:7070/kylin, the user name is: ADMIN, the password is: KYLIN

insert image description here

  1. close kylin service
[song@hadoop102 kylin]$ bin/kylin.sh stop

2. Kylin data modeling and management

2.1. Related terms

2.2.1、OLAP

OLAP is the abbreviation of On-Line Analytical Processing, which is translated as "online analytical processing" in Chinese. It is an analysis technology and tool based on multidimensional data model, which is used for fast query, aggregation and calculation of large amounts of data to support business, Financial, sales and other decision-making needs.

Key features of OLAP technology include:

  • Multidimensionality: The OLAP data model is based on a multidimensional data space, which can simultaneously associate multiple different dimensions, such as time, location, product, user, etc., to support multidimensional data analysis and exploration.
  • Rapidity: The OLAP engine uses technologies such as pre-computation and data caching to return complex multi-dimensional analysis results at the second level, thereby improving analysis efficiency and decision-making speed.
  • Flexibility: OLAP tools provide flexible query and analysis methods, such as slicing, drilling, rotation, filtering, sorting, calculation, etc., which can help users dig and analyze data deeply and discover potential business values ​​and opportunities.
  • Visualization: OLAP tools can present analysis results in visual ways such as charts, reports, and dashboards, making it easier for users to understand and use the analysis results.
  • Business-oriented: OLAP technology emphasizes support for business needs, and can be integrated with BI systems and data warehouses of enterprises to provide repeatable and standardized analysis processes while implementing decision support.

The specific implementation methods of OLAP are as follows:

  • OLAP based on multidimensional arrays: This implementation is the earliest OLAP technology, which stores data in multidimensional arrays and performs operations such as slicing, drilling, and rotation according to dimensions. The advantage of this implementation is that the query speed is fast, but the disadvantage is that it is inflexible and difficult to deal with denormalized data.

  • OLAP based on relational database: This implementation is to realize OLAP function by extending and optimizing relational database. OLAP based on relational database usually adopts star schema or snowflake schema to organize database table structure to support multidimensional analysis and query. The advantage of this implementation is high flexibility and can handle denormalized data, but the disadvantage is that the query speed is slow.

  • OLAP based on MOLAP: This implementation is to realize OLAP functions by storing data on a dedicated MOLAP server. MOLAP servers typically employ techniques such as compression and indexing to support efficient querying and analysis. The advantage of this implementation is that the query speed is fast, but the disadvantage is that it requires additional hardware and software support.

  • ROLAP-based OLAP: This implementation method is to realize OLAP functions by using relational databases by converting OLAP queries and calculations into SQL query statements. ROLAP performs multidimensional analysis and calculations by querying the fact tables and dimension tables in the database. The advantage of this implementation is high flexibility, but the disadvantage is that the query speed is slow.

  • OLAP based on HOLAP: This implementation is a hybrid implementation based on MOLAP and ROLAP, which not only has the advantages of fast query speed and strong processing capability of MOLAP, but also has the flexibility and scalability of ROLAP. HOLAP can adaptively use MOLAP or ROLAP technology for query and analysis according to the situation of the data.

2.2.2. Dimensions and measures

  • Dimension: The angle from which to observe the data . For example, employee data can be analyzed from the perspective of gender, or it can be more detailed and observed from the dimension of entry time or region. A dimension is a discrete set of values, such as male and female for gender, or each individual date for a time dimension. Therefore, during statistics, records with the same dimension value can be aggregated together, and then aggregated functions can be used to perform aggregation calculations such as accumulation, average, maximum and minimum values.
  • Metric: The statistical value that is aggregated (observed), that is, the result of the aggregation operation . For example, the number of employees of different genders in the employee data, or how many employees joined the company in the same year.

2.2.3 Cubes and Cuboids

Cube refers to a physical representation of multidimensional data, consisting of multiple dimensions and measures, which can be implemented using multidimensional arrays or relational database tables. A cube usually consists of a fact table and several dimension tables, where the dimension table contains various categories for analysis, such as time, geographic location, product, etc., and the fact table contains measurement information related to these categories, Such as sales, profit, quantity, etc. The analysis and query of multidimensional data can be realized by performing operations such as slice (Slice), drill down (Drill Down) and rotation (Pivot) on Cube.

Cuboid (cuboid) refers to a subset of Cube, which only contains part of the dimension and measurement information, and can be used for analysis and query for specific problems. For example, if a sales cube is analyzed using the three dimensions of "time", "product" and "location", three different levels of Cuboid can be constructed, as follows:

  • 3D Cuboid: Contains all "time", "product" and "place" data
  • 2D Cuboid: Data containing any two dimensions, such as "time" and "product", "time" and "place", or "product" and "place"
  • 1-dimensional Cuboid: contains data of a single dimension, such as "time", "product" or "location"

By performing operations such as slicing, drilling, and rotating on Cuboid, data query and analysis of specific dimensions can be realized, so as to achieve more flexible and efficient multi-dimensional data analysis and query.

To give a simple example, suppose there is an e-commerce sales data set, where the dimensions include time [time], product [item], region [location] and supplier [supplier], and the measurement is sales. Then there are 24 = 16 combinations of all dimensions, as shown in the figure below:

insert image description here

  • One-dimensional (1D) combinations include: [time], [item], [location] and [supplier];
  • Two-dimensional (2D) combinations include: [time, item], [time, location], [time, supplier], [item, location], [item, supplier], [location, supplier];
  • There are also 4 combinations of three dimensions (3D);
  • Finally, there are zero dimensions (0D) and four dimensions (4D), each with a total of 16 types.

Each combination of dimensions is a Cuboid, and 16 Cuboids as a whole are a Cube.

2.1. Understand Kylin's data modeling method and OLAP analysis principle

Kylin's data modeling method is based on the Cube (multidimensional cube) concept.

Cube can be seen as a multidimensional array, which contains all data (indicators) and various attributes (dimensions). Each Cube corresponds to a SQL query template for querying OLAP data. Its essence is that the working principle is essentially MOLAP (Multidimensional Online Analytical Processing) Cube, that is, multidimensional cube analysis.

Kylin's Cube can be constructed from different data sources, supports Hive, HBase, Kafka and other data sources, and supports queries of various data types, such as SUM, AVG, MIN, MAX, COUNT, TOPN, etc.

In Cube data modeling, it is necessary to perform dimensional modeling and measurement modeling on the data.

  • Dimensional modeling mainly models business dimensions, such as time, users, and products.
  • Metric modeling is mainly aimed at modeling business indicators, such as sales volume, profit, and so on.

In Cube, each dimension has a corresponding hierarchy, and each hierarchy corresponds to one or more dimension columns, which are used to aggregate and group OLAP data.

The principle of OLAP analysis mainly includes three aspects: dimension, measure and aggregation.

  • Dimension is the basis of OLAP analysis, it is used to describe the attributes of data, such as time, place, user and so on.
  • Measures are the results of calculations in OLAP analysis, such as sales, costs, profits, and so on.
  • Aggregation is the core operation in OLAP analysis, which is used to summarize and aggregate measurement data to obtain higher-level analysis results.

Kylin's OLAP analysis principle mainly relies on Cube's multi-dimensional computing and distributed computing capabilities. During the query process, Kylin first extracts the required data fragments from the Cube by querying according to the dimensions and measures selected by the user, then uses the Hadoop cluster to perform parallel calculation and aggregation, and finally returns the results to the user. In this way, Kylin can support fast query and analysis of ultra-large-scale data, greatly reducing analysis time and cost, and has good scalability and ease of use.

2.2, Kylin use cases

2.2.1. Demand

Using dwd_order_detail in the gmall data warehouse as the fact table, and dim_user_info, dim_sku_info, and dim_base_province as the dimension tables, build a star schema and demonstrate how to use Kylin for OLAP analysis.

2.2.2. Create project

  1. Click the "+" in the image below.

insert image description here

  1. Fill in the project name and description information, and click the Submit button to submit.

insert image description here

2.2.3. Obtain data source

  1. Click on DataSource

insert image description here

  1. Click the button below to import the Hive table

insert image description here

  1. Select the form below and click the Sync button
dwd_order_detail
dim_sku_info
dim_user_info
dim_base_province

insert image description here

Note: Kylin cannot handle complex data types (Array, Map, Struct) in Hive tables , even if the fields of complex types are not involved in the calculation. Therefore, when loading Hive data sources, tables with complex data type fields cannot be directly loaded. In the dim_sku_info table, there are two fields of complex data types (platform attribute and sales attribute), so dim_sku_info cannot be loaded directly, and the following processing is required.

  • Create a view on the hive client as follows. This view has removed the fields of complex data types in the dim_sku_info table. In subsequent calculations, dim_sku_info is no longer used, but dim_sku_info_view is used.
hive (gmall)>
create view dim_sku_info_view
as
select
    id,
    price,
    sku_name,
    sku_desc,
    weight,
    is_sale,
    spu_id,
    spu_name,
    category3_id,
    category3_name,
    category2_id,
    category2_name,
    category1_id,
    category1_name,
    tm_id,
    tm_name,
    create_time
from dim_sku_info;
  • Reimport the dim_sku_info_view view in kylin

2.2.4, create a model

  1. Click Models, click the "+New" button, and click the "★New Model" button.

insert image description here

  1. Fill in the Model information and click Next

insert image description here

  1. specify fact table

insert image description here

  1. Select the dimension table, and specify the association conditions between the fact table and the dimension table, and click Ok

insert image description here
insert image description here

  1. After the dimension table is added, click Next

insert image description here

  1. Specify the dimension fields and click Next

insert image description here

  1. Specify the metric field and click Next

insert image description here

  1. Specify the fact table partition field (only supports time partition), click the Save button, and the model is created

insert image description here

2.2.5. Build cubes

  1. Click new, and click new cube

insert image description here

  1. Fill in the cube information, select the model that the cube depends on, and click next

insert image description here

  1. Select the desired dimension as shown in the image below

insert image description here

insert image description here

  1. Select the desired measure, as shown in the image below

insert image description here
insert image description here

  1. Cube automatic merge setting, cube needs to be built every day according to the date partition field, and the result of each build will be saved in a table in Hbase. In order to improve query efficiency, daily cubes need to be merged, and merge can be set here cycle.

insert image description here

  1. Kylin builds aggregation group optimization configuration

insert image description here

  1. Kylin related property configuration override

insert image description here

  1. Cube information overview, click Save, the Cube is created

insert image description here

  1. Build the Cube (computation), click the action button corresponding to the Cube, and select build

insert image description here

  1. Select the time interval to be built and click Submit

insert image description here

  1. Click Monitor to view the build progress

insert image description here

2.2.6. Problems that arise

How to deal with the duplicate key problem of the daily full dimension table and the zipper dimension table? According to the above process, you will find that the following errors occur during the cube construction process;

insert image description here
Analysis of the cause of the error: The reason for the above error is that the dimension table dim_user_info in the model is a zipper table and a dim_sku_info(dim_sku_info_view)daily full-scale table. Therefore, if the entire table is used as the dimension table, there will inevitably be multiple pieces of data corresponding to the same user_id or sku_id in the order details table. For the above problems, there are the following solutions.

Solution: Create a view for the zipper table and the daily full dimension table on the hive client, filter the data when creating the view, and ensure that the data retrieved from the view is a full and up-to-date data.

  1. Create a dimension table view
--拉链维度表视图
create view dim_user_info_view as select * from dim_user_info where dt='9999-99-99';

--全量维度表视图(注意排除复杂数据类型字段)
create view dim_sku_info_view
as
select
    id,
    price,
    sku_name,
    sku_desc,
    weight,
    is_sale,
    spu_id,
    spu_name,
    category3_id,
    category3_name,
    category2_id,
    category2_name,
    category1_id,
    category1_name,
    tm_id,
    tm_name,
    create_time
from dim_sku_info
where dt=date_add(current_date,-1);

--当前情形我们先创建一个2020-06-15的视图,由于之前已经创建了dim_sku_info_view,故无需重新创建,修改之前的视图即可。
alter view dim_sku_info_view
as
select
    id,
    price,
    sku_name,
    sku_desc,
    weight,
    is_sale,
    spu_id,
    spu_name,
    category3_id,
    category3_name,
    category2_id,
    category2_name,
    category1_id,
    category1_name,
    tm_id,
    tm_name,
    create_time
from dim_sku_info
where dt='2020-06-15';
  1. Import the newly created view in DataSource, the previous dimension table, and optionally delete it.
  2. Recreate the model and cube.

2.2.7. Automatically build cubes every day

Kylin provides a Restful API, so we can write the command to build the cube into a script, and pass the script to a scheduling tool such as azkaban or oozie to realize the scheduled scheduling function.

#!/bin/bash
cube_name=order_cube
do_date=`date -d '-1 day' +%F`

#获取00:00时间戳
start_date_unix=`date -d "$do_date 08:00:00" +%s`
start_date=$(($start_date_unix*1000))

#获取24:00的时间戳
stop_date=$(($start_date+86400000))

curl -X PUT -H "Authorization: Basic QURNSU46S1lMSU4=" -H 'Content-Type: application/json' -d '{"startTime":'$start_date', "endTime":'$stop_date', "buildType":"BUILD"}' http://hadoop102:7070/kylin/api/cubes/$cube_name/build

3. Cube construction algorithm

3.1. Cube Construction Algorithm

3.1.1. Layer-by-layer construction algorithm (layer)

insert image description here

An N-dimensional Cube consists of 1 N-dimensional sub-cube, N (N-1)-dimensional sub-cubes, N*(N-1)/2 (N-2)-dimensional sub-cubes, ..., N 1 A dimension sub-cube and a 0-dimensional sub-cube constitute a total of 2^N sub-cubes.

In the layer-by-layer algorithm, the calculation is performed by reducing the number of dimensions layer by layer, and the calculation of each layer (except the first layer, which is aggregated from the original data) is calculated based on the results of the previous layer. Construct from the high-level dimension to the bottom dimension of the dimension. For example, the result of [Group by A, B] can be aggregated based on the result of [Group by A, B, C] by removing C; this can reduce repeated calculations ; When the 0-dimensional Cuboid is calculated, the calculation of the entire Cube is completed.

Each round of calculation is a MapReduce task, which is executed serially; an N-dimensional Cube requires at least N MapReduce jobs.
insert image description here
Algorithm advantages:

  1. This algorithm makes full use of the advantages of MapReduce and handles the complex sorting and shuffle work in the middle, so the algorithm code is clear and simple, and easy to maintain;
  2. Benefiting from the maturity of Hadoop, this algorithm is very stable, and it can be guaranteed to be completed even when the cluster resources are tight.

Algorithm disadvantages:

  1. When the Cube has more dimensions, the required MapReduce tasks also increase accordingly; because Hadoop's task scheduling requires additional resources, especially when the cluster is large, the additional overhead caused by repeatedly submitting tasks will be considerable;
  2. Since the aggregation operation is not performed in the Mapper logic, the shuffle workload of each round of MR is heavy, resulting in low efficiency.
  3. There are many read and write operations on HDFS: Since the output of each layer of calculation will be used as the input of the next layer of calculation, these Key-Values ​​need to be written to HDFS; when all calculations are completed, Kylin needs an additional round The task converts these files into HBase HFile format to import into HBase;

Overall, the efficiency of the algorithm is low, especially when the Cube dimension is large.

3.1.2. Fast construction algorithm (inmem)

insert image description here

Also known as the "By Segment" or "By Split" algorithm, this algorithm has been introduced since 1.5.x. The main idea of ​​this algorithm is that each Mapper allocates the data blocks it is assigned , calculated as a complete small Cube segment (including all Cuboids). Each Mapper outputs the calculated Cube segment to the Reducer for merging to generate a large Cube, which is the final result . This flow is explained as shown in the figure.

insert image description here
Compared with the old algorithm, the fast algorithm has two main differences:

  1. Mapper will use memory for pre-aggregation to calculate all combinations; each key output by Mapper is different, which will reduce the amount of data output to Hadoop MapReduce, and Combiner is no longer needed;
  2. One round of MapReduce will complete all levels of calculations, reducing the deployment of Hadoop tasks.

4. Kylin Cube construction optimization

4.1, using derived dimensions (derived dimension)

Derived dimensions are used to exclude non-primary key dimensions on the dimension table from the effective dimensions and replace them with the primary key of the dimension table (actually the corresponding foreign key on the fact table). Kylin will record the mapping relationship between the primary key of the dimension table and other dimensions of the dimension table at the bottom layer, so that it can dynamically "translate" the primary key of the dimension table into these non-primary key dimensions during query, and perform real-time aggregation.

insert image description here
Although derived dimensions are very attractive, this does not mean that all dimensions on the dimension table must become derived dimensions. If the aggregation workload required from the primary key of the dimension table to a certain dimension table dimension is very large, then Derived dimensions are not recommended.

4.2, using the aggregation group (Aggregation group)

Aggregation Groups are a powerful pruning tool. Aggregation groups assume that all dimensions of a Cube can be divided into several groups according to business requirements (of course, it can also be a group). Since the dimensions in the same group are more likely to be used by the same query at the same time, it will show a closer relationship. Intrinsically related. The dimension set of each group is a subset of all dimensions of Cube. Different groups have their own set of dimension sets. They may or may not have the same dimensions as other groups. Each group independently contributes a batch of Cuboids that need to be materialized according to its own rules, and the union of the Cuboids contributed by all groups becomes the set of all Cuboids that need to be materialized in the current Cube. Different groups may contribute the same Cuboid, and the build engine will detect this and ensure that each Cuboid will only be materialized once no matter how many groups it appears in.

For the dimensions within each group, users can use the following three optional ways to define, and the relationship between them is as follows.

4.2.1. Mandatory dimension (Mandatory)

Mandatory dimension (Mandatory), if a dimension is defined as a mandatory dimension, then every Cuboid in all Cuboids generated by this group will contain this dimension. There can be 0, 1 or more mandatory dimensions in each grouping. According to the business logic of this grouping, the relevant query must be in the filter condition or grouping condition, so this dimension can be set as a mandatory dimension in this grouping.
insert image description here

4.2.2. Hierarchy

Hierarchy, each hierarchy contains two or more dimensions. Assuming that a level contains n dimensions D1, D2...Dn, then in any Cuboid generated by this group, these n dimensions will only be in the form of (), (D1), (D1, D2)...(D1, D2... Dn) One of these n+1 forms occurs. There can be 0, 1 or more levels in each group, and there should be no shared dimensions between different levels. According to the business logic of this grouping, multiple dimensions directly have a hierarchical relationship, so these dimensions can be set as hierarchical dimensions in this grouping.
insert image description here

4.2.3. Joint dimension (Joint)

Joint dimension (Joint), each joint contains two or more dimensions, if some columns form a joint, then in any Cuboid generated by the group, these joint dimensions either appear together or do not appear. There can be 0 or more unions in each grouping, but there should be no shared dimensions between different unions (otherwise they can be merged into one union). If according to the business logic of this grouping, multiple dimensions always appear in the query at the same time, you can set these dimensions as a joint dimension in this grouping.
insert image description here
These operations can be done in the Aggregation Groups area of ​​Cube Designer's Advanced Setting, as shown in the figure below.
insert image description here
The design of aggregation groups is very flexible and can even be used to describe some extreme designs.

Assuming that our business requirements are very simple and only need certain specific Cuboids, then multiple aggregation groups can be created, and each aggregation group represents a Cuboid. The specific method is to first include all the dimensions required by a certain Cuboid in the aggregation group, and then set these dimensions as mandatory dimensions. In this way, the current aggregation group can only produce the Cuboid we want.

For another example, sometimes there are some dimensions with a very large base in our Cube. If no special treatment is done, it will be combined with other dimensions to generate a large number of Cuboids containing it. Cuboids that contain high-cardinality dimensions are often very large in number of rows and volume, which will cause the expansion rate of the entire Cube to increase. If you know that this high-cardinality dimension will only be queried with several dimensions (not all dimensions) at the same time according to business requirements, then you can "isolate" this high-cardinality dimension through aggregation groups. We put this high-cardinality dimension into a separate aggregation group, and then put all other dimensions that may be queried with this high-cardinality dimension. In this way, this high-cardinality dimension is "isolated" in an aggregation group, and all dimensions that will not be queried together with it do not appear in any grouping with it, so there will be no redundant Cuboid generation. This also greatly reduces the number of Cuboids containing this high-cardinality dimension, which can effectively control the expansion rate of Cube.

4.3. Row Key Optimization

Kylin will combine all the dimensions into a complete Rowkey in order, and sort all the rows in Cuboid in ascending order according to this Rowkey.
A well-designed Rowkey will more effectively complete data query filtering and positioning, reduce the number of IOs, and improve query speed. The order of dimensions in the rowkey has a significant impact on query performance.

The design principles of Row key are as follows:

  1. Dimensions used for filtering come first.

insert image description here

  1. Dimensions with large cardinality are placed in front of dimensions with small cardinality
    insert image description here

5. Kylin BI tool integration

5.1、JDBC

  1. Create a new project and import dependencies
    <dependencies>
        <dependency>
            <groupId>org.apache.kylin</groupId>
            <artifactId>kylin-jdbc</artifactId>
            <version>3.0.2</version>
        </dependency>
    </dependencies>
  1. coding
package com.song;

import java.sql.*;

public class TestKylin {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //Kylin_JDBC 驱动
        String KYLIN_DRIVER = "org.apache.kylin.jdbc.Driver";

        //Kylin_URL
        String KYLIN_URL = "jdbc:kylin://hadoop102:7070/FirstProject";

        //Kylin的用户名
        String KYLIN_USER = "ADMIN";

        //Kylin的密码
        String KYLIN_PASSWD = "KYLIN";

        //添加驱动信息
        Class.forName(KYLIN_DRIVER);

        //获取连接
        Connection connection = DriverManager.getConnection(KYLIN_URL, KYLIN_USER, KYLIN_PASSWD);

        //预编译SQL
        PreparedStatement ps = connection.prepareStatement("SELECT sum(sal) FROM emp group by deptno");

        //执行查询
        ResultSet resultSet = ps.executeQuery();

        //遍历打印
        while (resultSet.next()) {
    
    
            System.out.println(resultSet.getInt(1));
        }
    }
}
  1. Result display
    insert image description here

5.2、Zepplin

5.2.1. Zepplin installation and startup

  1. Will zeppelin-0.8.0-bin-all.tgzupload to Linux
  2. zeppelin-0.8.0-bin-all.tgzof decompression/opt/model
[song@hadoop102 sorfware]$ tar -zxvf zeppelin-0.8.0-bin-all.tgz -C /opt/model/
  1. modify name
[song@hadoop102 module]$ mv zeppelin-0.8.0-bin-all/ zeppelin
  1. start up
[song@hadoop102 zeppelin]$ bin/zeppelin-daemon.sh start

You can log in to the web page to view, the default port number of the web is 8080http://hadoop102:8080

insert image description here

5.2.2. Configure Zepplin to support Kylin

  1. Click anonymous in the upper right corner to select Interpreter

insert image description here

  1. Search for the Kylin plugin and modify the corresponding configuration

insert image description here

  1. After the modification is complete, click Save to complete

insert image description here

  1. Case Practice

Requirements: Query employee details and display them using various charts

  1. Click on Notebook to create a new note

insert image description here

  1. Fill in the Note Name and click Create

insert image description here
insert image description here

  1. execute query

insert image description here

  1. Result display

insert image description here

  1. Other chart formats
    insert image description here
    insert image description here
    insert image description here

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/prefect_start/article/details/130881946