Local data warehouse project (4) - ad hoc query

1 background

This article describes the content related to ad hoc query of local data warehouse projects, mainly involving ad hoc query tools including Presto, Druid, and Kylin.
This article is based on the articles "Local Data Warehouse Project (1) - Detailed Process of Local Data Warehouse Construction" and "Local Data Warehouse Project (2) - Detailed Process of Building System Business Data Warehouse" and " Local Data Warehouse Project (3) - Data Visualization and Task Scheduling》

2 Presto

2.1 Presto concept

Presto is an open source distributed SQL query engine. The data volume supports GB to PB. It is mainly used for processing second-level query scenarios.

2.2 Presto Architectureinsert image description here

2.3 Advantages and disadvantages of Prestoinsert image description here

2.4 Presto installation

2.4.1 Presto Server installation

Official website address
https://prestodb.github.io/
Download address
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/
1) Upload the installation package and decompress it, modify the directory name after decompression

tar -zxvf presto-server-0.196.tar.gz
mv presto-server-0.196 presto-server
  1. Create data and etc directories
[root@wavehouse-1 presto-server]# pwd
/root/soft/presto-server
[root@wavehouse-1 presto-server]# mkdir data
[root@wavehouse-1 presto-server]# mkdir etc
  1. Create a jvm.config file in the etc directory
    and add the following content:
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
  1. Presto can support multiple data sources, called catalog in Presto, here we configure a data source that supports Hive, configure a Hive catalog
mkdir etc/catalog
vim catalog/hive.properties

Add the following content to hive.properties:

connector.name=hive-hadoop2
hive.metastore.uri=thrift://wavehouse-1:9083
  1. Distribute the presto installation package to each node of the cluster
  2. After distribution, create a new node.properties file in the etc directory of each node
    and add the following content. Note: The node.id of different nodes is set to a different value, and hexadecimal is used here.
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/opt/module/presto/data
  1. Presto is composed of a coordinator node and multiple worker nodes. Configure it as a coordinator on the master node and as a worker on other nodes
vim etc/config.properties

Add the following content to the master node

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery-server.enabled=true
discovery.uri=http://wavehouse-1:8881

Other nodes add the following content

coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery.uri=http://wavehouse-2:8881

8) Start Hive Metastore

nohup bin/hive --service metastore >/dev/null 2>&1 &

9) All nodes with presto installed start presto

#前台启动
bin/launcher run

or

#后台启动
bin/launcher start

insert image description here
insert image description here

2.4.2 Presto Command Line Client Installation

Download address:
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/

  1. Upload the downloaded presto-cli-xxxx-executable.jar to the installation presto folder of the master node
  2. Modify the name and grant executable permission
    insert image description here
    3) Put in the jar package that supports lzo compression
    Since the data warehouse data is compressed by lzo, Presto needs to read the lzo format data when reading the data, so you need to put the lzo jar package into presto
cp /root/soft/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar ./
  1. start up
./presto-cli --server wavehouse-1:8881 --catalog hive --schema default

5) Presto command line operation
Presto command line operation is equivalent to Hive command line operation. Each table must be added schema.
insert image description here

select * from hive.gmall.ads_back_count limit 10;

insert image description here

2.4.3 Presto Visual Client Installation

  1. Upload yanagishima-18.0.zip to the soft directory
  2. unzip
unzip yanagishima-18.0.zip
  1. Enter the conf folder, write yanagishima.properties
    and add the following content
jetty.port=7080
presto.datasources=chen-presto
presto.coordinator.server.chen-presto=http://wavehouse-1:8881
catalog.chen-presto=hive
schema.chen-presto=default
sql.query.engines=presto

insert image description here
4) start

nohup bin/yanagishima-start.sh >y.log 2>&1 &
  1. Visit http://wavehouse-1:7080
    insert image description here
    insert image description here

2.4.4 Efficiency comparison

Execute the same sql, respectively on the hive side and the Presto side

2.4.4.1

select count(*) from hive.gmall.dws_uv_detail_day

hive uses the TEZ engine,
insert image description here
ignoring the time it takes to start TEZ for the first time, hive's TEZ query time is 6.89 seconds

Presto query
insert image description here
presto takes 0.99 seconds, performance improvement, second-level query.

2.4.4.2

select max(dt) from hive.gmall.dws_uv_detail_day

Hive query takes 4.65 seconds
insert image description here
Presto query
insert image description here
presto takes 0.92 seconds, performance improvement, second-level query.

Note: Due to the current local virtual machine, the memory is 4G, and the performance is limited. If the memory is 64G+ in the actual production environment, the performance will be better!
insert image description here

2.5 Presto optimization

2.5.1 Reasonably set partitions

Similar to Hive, Presto reads partitioned data based on metadata information. Reasonable partitioning can reduce the amount of Presto data read and improve query performance.

2.5.2 Using columnar storage

Presto has specifically optimized the reading of ORC files. Therefore, when creating tables used by Presto in Hive, it is recommended to store them in ORC format. Compared with Parquet, Presto supports ORC better.

2.5.3 Using compression

Data compression can reduce the IO bandwidth pressure of data transmission between nodes. For ad hoc queries that need fast decompression, Snappy compression is recommended.

3 Druid

3.1 Introduction to Druid

Druid is a fast columnar distributed data storage system that supports real-time analysis. Compared with traditional OLAP systems, it has significantly improved performance in processing PB-level data, millisecond-level queries, and real-time data processing.

3.2 Druid features and application scenarios

① Columnar storage
② Scalable distributed system
③ Large-scale parallel processing
④ Real-time or batch ingestion
⑤ Self-healing, self-balancing, easy to operate
⑥ Effective statement and or pre-calculation of data
⑦ Bitmap compression algorithm applied to data results

Application scenarios:
① Suitable for real-time input of cleaned records, but no update operation is required
② Suitable for supporting wide tables without joining (that is, one table)
③ Suitable for summarizing basic statistical indicators, using a field Indicates that
④ is suitable for applications with high real-time requirements

3.3 Druid framework

insert image description here

3.4 Druid data structure

Complementary to the Druid architecture is its data structure based on DataSource and Segment, which together contribute to Druid's high-performance advantages.
insert image description here

3.5 Druid installation

3.5.1 Download the installation package

Download the latest version installation package from https://imply.io/get-started

3.5.2 Installation and deployment

1) Upload imply-2.7.10.tar.gz to the /opt/software directory of hadoop102, and decompress it

tar -zxvf imply-2.7.10.tar.gz

2) Modify the name of imply-2.7.10 to imply
3) Modify the configuration file
(1) Modify the ZK configuration of Druid

vim imply/conf/druid/_common/common.runtime.properties

insert image description here
(2) Modify the parameters of the startup command so that it does not verify and does not start the built-in ZK

vim imply/conf/supervise/quickstart.conf

insert image description here
4) Start
(1) Start Zookeeper

./zkServer.sh statrt

(2) start imply

bin/supervise  -c conf/supervise/quickstart.conf

3.5.3 Web page use

1) Log in to wavehouse-1:9095 to view
insert image description here
2) Click Load data->Click Apache Kafka
to set the kafka cluster and topic
insert image description here
3) Confirm the data sample format
insert image description here
4) Load data, there must be a time field
5) Select the item to be loaded
insert image description here
6) Create Database table name
insert image description here
7) Confirm the configuration
insert image description here
8) Connect to topic_start of Kafka
insert image description here
9) Select SQL to query indicators

select sum(uid) from "topic_start"

insert image description here

4 Kylin

4.1 Introduction to Kylin

Apache Kylin is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities on Hadoop/Spark to support ultra-large-scale data. It was originally developed by eBay and contributed to the open source community. It can query huge Hive tables in sub-seconds.

4.2 Kylin architecture

insert image description here
1) REST Server
REST Server is a set of entry points for application development, designed to implement application development for the Kylin platform. Such applications can provide queries, get results, trigger cube build tasks, get metadata, get user permissions, and more. In addition, SQL queries can be implemented through the Restful interface.
2) Query Engine (Query Engine)
When the cube is ready, the query engine can obtain and parse user queries. It then interacts with other components in the system to return corresponding results to the user. 3) When the router (Routing) was originally designed, it was considered to guide the queries that Kylin could not execute to Hive for further execution, but after practice, it was found that the speed difference between Hive and Kylin was too large, which made it impossible for users to have a consistent view of the query speed
.
Expectation, it is likely that most queries return results within a few seconds, while some queries have to wait minutes to tens of minutes, so the experience is very bad. This last routing feature is turned off by default in distributions.
4) Metadata management tool (Metadata)
Kylin is a metadata-driven application. The metadata management tool is a key component used to manage all metadata stored in Kylin, including the most important cube metadata. The normal operation of all other components needs to be based on metadata management tools. Kylin's metadata is stored in hbase.
5) Task engine (Cube Build Engine)
This engine is designed to handle all offline tasks, including shell scripts, Java API and Map Reduce tasks, etc. The task engine manages and coordinates all the tasks in Kylin, so as to ensure that each task can be effectively executed and solve the faults that occur during it.

4.3 Features of Kyllin

The main features of Kylin include support for SQL interface, support for ultra-large-scale data sets, sub-second response, scalability, high throughput, BI tool integration, etc.
1) Standard SQL interface: Kylin uses standard SQL as the interface for external services.
2) Support for very large data sets: Kylin's ability to support large data may be the most advanced of all technologies at present. As early as 2015, eBay's production environment was able to support second-level queries of tens of billions of records, and then there were cases of second-level queries of hundreds of billions of records in mobile application scenarios.
3) Sub-second response: Kylin has excellent query response speed, which benefits from precomputation. Many complex calculations, such as connection and aggregation, have been completed during offline precomputation, which greatly reduces query time The amount of calculation required improves the response speed.
4) Scalability and high throughput: Single-node Kylin can achieve 70 queries per second, and Kylin clusters can also be built.
5) BI tool integration
Kylin can be integrated with existing BI tools, including the following.
ODBC: Integrate with tools such as Tableau, Excel, and PowerBI
JDBC: Integrate with Java tools such as Saiku and BIRT
RestAPI: Integrate with JavaScript and Web pages
Kylin development team also contributed Zepplin plug-ins, and you can also use Zepplin to access Kylin services

4.4 Kylin installation

Before installing Kylin, Hadoop, Hive, Zookeeper, and HBase must be deployed first, and the following environment variables HADOOP_HOME, HIVE_HOME, and HBASE_HOME need to be configured in /etc/profile. Remember to source them to make them take effect.
See this article for details on HBASE installation
1) Download the Kylin installation package
Download address: http://kylin.apache.org/cn/download/
2) Unzip apache-kylin-2.5.1-bin-hbase1x.tar.gz
3) Start
(1) Before starting Kylin, you need to start Hadoop (hdfs, yarn, jobhistoryserver), Zookeeper, Hbase
(2) Start Kylin

bin/kylin.sh start

See the following page to indicate that kylin started successfully
insert image description here
4) Visit the URL
to view the web page at http://wavehouse-1:7070/kylin
User name: ADMIN, password: KYLIN (the system has been filled in)
insert image description here

4.5 Kylin use

Using dwd_payment_info in the gmall data warehouse as the fact table, dwd_order_info_his, dwd_user_info as the dimension table, build a star schema, and demonstrate how to use Kylin for OLAP analysis.

4.5.1 Create a project

  1. Select the '+' button
    insert image description here
  2. Fill in the project name description information
    insert image description here

4.5.2 Get data source

  1. Select datasource
    2) Select import table
    insert image description here
  2. Select the required data table and click the Sync button
    insert image description here

4.5.3 Create a model

1) Click Models, click the "+New" button, and click the "★New Model" button.
insert image description here
2) Fill in the Model information and click Next
insert image description here
3) Specify the fact table
insert image description here

4) Select the dimension table, and specify the association conditions between the fact table and the dimension table, click Ok
insert image description here
After adding the dimension table, click Next
insert image description here
5) Specify the dimension field, and click Next
insert image description here
6) Specify the measurement field, and click Next
insert image description here
7) Specify the fact table Partition field (only supports time partition), click the Save button, the model is created
insert image description here

4.5.4 Build cubes

1) Click new, and click new cube
insert image description here
2) Fill in the cube information, select the model that the cube depends on, and click next
insert image description here
3) Select the required dimension, as shown in the figure below
insert image description here
insert image description here
4) Select the required measurement value, as shown in the figure below
insert image description here
5 ) cube automatic merge setting, the cube needs to be built every day according to the date partition field, and the result of each build will be saved in
a table in Hbase. In order to improve the query efficiency, the daily cube needs to be merged, which can be set here Merge cycle.
insert image description here
6) Kylin advanced configuration (optimization related, temporarily skipped)
insert image description here
7) Kylin related attribute configuration coverage
insert image description here
8) Cube information overview, click Save, the Cube is created
insert image description here
9) Build the Cube (calculation), click the action button corresponding to the Cube, and select build
insert image description here
10 ) Select the time interval to be built and click Submit
insert image description here

11) Click Monitor to view the construction progress
insert image description here

4.5.6 Use Advanced

After executing the above process, the error is found as follows:
insert image description here
Error reason: The reason for the above error is that the dimension table dwd_order_info_his in the model is a zipper table, and dwd_user_info is a daily full-scale table, so if the entire table is used as a dimension table, the same order_id will inevitably appear Or the problem that user_id corresponds to multiple pieces of data. There are two solutions:
Solution 1: Create a temporary table of the dimension table in hive, which only stores the latest complete data of the dimension table. When creating a model in kylin Select this temporary table as a dimension table.
Option 2: The idea is the same as Option 1, but the physical temporary table is not used, and the view (view) is used to achieve the same function.

4.5.7 Adopt Scheme 2

(1) Create a dimension table view

CREATE VIEW dwd_user_info_view as select * from dwd_user_info
WHERE dt='2023-01-04';
CREATE VIEW dwd_order_info_view as select * from dwd_order_info
WHERE dt='2023-01-04';

(2) Import the newly created view in DataSource, and the previous dimension table can be selectively deleted.
insert image description here
After modification:
insert image description here
(3) Recreate model and cube
(4) Wait for reconstruction
insert image description here
(5) Query results
Example 1:

select user_level,sum(TOTAL_AMOUNT) from DWD_PAYMENT_INFO t1 join DWD_USER_INFO_VIEW t2 on t1.USER_ID = t2.ID
group by user_level

insert image description here
It can be found that the current time is 0.15 seconds, and it can be returned at the sub-second level.

Example 2: Add a gender dimension query

select user_level,gender,sum(TOTAL_AMOUNT) from DWD_PAYMENT_INFO t1 join DWD_USER_INFO_VIEW t2 on t1.USER_ID = t2.ID
group by user_level,gender

insert image description here
It only takes 0.09 seconds and can return at sub-second level.

4.5.8 Kylin BI tools

4.5.8.1JDBC

Import maven dependencies into the project to develop, so I won’t go into details here

  <dependencies>
        <dependency>
            <groupId>org.apache.kylin</groupId>
            <artifactId>kylin-jdbc</artifactId>
            <version>2.5.1</version>
        </dependency>
    </dependencies>

4.5.8.2 Zepplin

1) Zepplin installation and startup
(1) Upload zeppelin-0.8.0-bin-all.tgz to Linux
(2) Unzip zeppelin-0.8.0-bin-all.tgz
(3) Modify the name
(4) Start

bin/zeppelin-daemon.sh start

insert image description here
You can log in to the web page to view, the default port number of the web is 8080
http://wavehouse-1:8080
insert image description here

2) Configure Zepplin to support Kylin
(1) Click anonymous in the upper right corner and select Interpreter
insert image description here
(2) Search for the Kylin plug-in and modify the corresponding configuration
insert image description here
(3) Click Save to complete the modification
3) Create a new note
insert image description here
(2) Fill in the Note Name and click Create
insert image description here

(3) Enter SQL to query
insert image description here
(4) View query results
insert image description here

5 summary

5.1 Ad hoc query comparison

Druid/Impala/Presto/Es/Kylin/Spark SQL comparison
insert image description here
The following is to build a data warehouse project based on CDH. For details, see "CDH Data Warehouse Project (1) - Detailed Process of CDH Installation and Deployment"

Guess you like

Origin blog.csdn.net/Keyuchen_01/article/details/128585789