1 background
This article describes the content related to ad hoc query of local data warehouse projects, mainly involving ad hoc query tools including Presto, Druid, and Kylin.
This article is based on the articles "Local Data Warehouse Project (1) - Detailed Process of Local Data Warehouse Construction" and "Local Data Warehouse Project (2) - Detailed Process of Building System Business Data Warehouse" and " Local Data Warehouse Project (3) - Data Visualization and Task Scheduling》
2 Presto
2.1 Presto concept
Presto is an open source distributed SQL query engine. The data volume supports GB to PB. It is mainly used for processing second-level query scenarios.
2.2 Presto Architecture
2.3 Advantages and disadvantages of Presto
2.4 Presto installation
2.4.1 Presto Server installation
Official website address
https://prestodb.github.io/
Download address
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/
1) Upload the installation package and decompress it, modify the directory name after decompression
tar -zxvf presto-server-0.196.tar.gz
mv presto-server-0.196 presto-server
- Create data and etc directories
[root@wavehouse-1 presto-server]# pwd
/root/soft/presto-server
[root@wavehouse-1 presto-server]# mkdir data
[root@wavehouse-1 presto-server]# mkdir etc
- Create a jvm.config file in the etc directory
and add the following content:
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
- Presto can support multiple data sources, called catalog in Presto, here we configure a data source that supports Hive, configure a Hive catalog
mkdir etc/catalog
vim catalog/hive.properties
Add the following content to hive.properties:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://wavehouse-1:9083
- Distribute the presto installation package to each node of the cluster
- After distribution, create a new node.properties file in the etc directory of each node
and add the following content. Note: The node.id of different nodes is set to a different value, and hexadecimal is used here.
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/opt/module/presto/data
- Presto is composed of a coordinator node and multiple worker nodes. Configure it as a coordinator on the master node and as a worker on other nodes
vim etc/config.properties
Add the following content to the master node
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery-server.enabled=true
discovery.uri=http://wavehouse-1:8881
Other nodes add the following content
coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery.uri=http://wavehouse-2:8881
8) Start Hive Metastore
nohup bin/hive --service metastore >/dev/null 2>&1 &
9) All nodes with presto installed start presto
#前台启动
bin/launcher run
or
#后台启动
bin/launcher start
2.4.2 Presto Command Line Client Installation
Download address:
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/
- Upload the downloaded presto-cli-xxxx-executable.jar to the installation presto folder of the master node
- Modify the name and grant executable permission
3) Put in the jar package that supports lzo compression
Since the data warehouse data is compressed by lzo, Presto needs to read the lzo format data when reading the data, so you need to put the lzo jar package into presto
cp /root/soft/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar ./
- start up
./presto-cli --server wavehouse-1:8881 --catalog hive --schema default
5) Presto command line operation
Presto command line operation is equivalent to Hive command line operation. Each table must be added schema.
select * from hive.gmall.ads_back_count limit 10;
2.4.3 Presto Visual Client Installation
- Upload yanagishima-18.0.zip to the soft directory
- unzip
unzip yanagishima-18.0.zip
- Enter the conf folder, write yanagishima.properties
and add the following content
jetty.port=7080
presto.datasources=chen-presto
presto.coordinator.server.chen-presto=http://wavehouse-1:8881
catalog.chen-presto=hive
schema.chen-presto=default
sql.query.engines=presto
4) start
nohup bin/yanagishima-start.sh >y.log 2>&1 &
- Visit http://wavehouse-1:7080
2.4.4 Efficiency comparison
Execute the same sql, respectively on the hive side and the Presto side
2.4.4.1
select count(*) from hive.gmall.dws_uv_detail_day
hive uses the TEZ engine,
ignoring the time it takes to start TEZ for the first time, hive's TEZ query time is 6.89 seconds
Presto query
presto takes 0.99 seconds, performance improvement, second-level query.
2.4.4.2
select max(dt) from hive.gmall.dws_uv_detail_day
Hive query takes 4.65 seconds
Presto query
presto takes 0.92 seconds, performance improvement, second-level query.
Note: Due to the current local virtual machine, the memory is 4G, and the performance is limited. If the memory is 64G+ in the actual production environment, the performance will be better!
2.5 Presto optimization
2.5.1 Reasonably set partitions
Similar to Hive, Presto reads partitioned data based on metadata information. Reasonable partitioning can reduce the amount of Presto data read and improve query performance.
2.5.2 Using columnar storage
Presto has specifically optimized the reading of ORC files. Therefore, when creating tables used by Presto in Hive, it is recommended to store them in ORC format. Compared with Parquet, Presto supports ORC better.
2.5.3 Using compression
Data compression can reduce the IO bandwidth pressure of data transmission between nodes. For ad hoc queries that need fast decompression, Snappy compression is recommended.
3 Druid
3.1 Introduction to Druid
Druid is a fast columnar distributed data storage system that supports real-time analysis. Compared with traditional OLAP systems, it has significantly improved performance in processing PB-level data, millisecond-level queries, and real-time data processing.
3.2 Druid features and application scenarios
① Columnar storage
② Scalable distributed system
③ Large-scale parallel processing
④ Real-time or batch ingestion
⑤ Self-healing, self-balancing, easy to operate
⑥ Effective statement and or pre-calculation of data
⑦ Bitmap compression algorithm applied to data results
Application scenarios:
① Suitable for real-time input of cleaned records, but no update operation is required
② Suitable for supporting wide tables without joining (that is, one table)
③ Suitable for summarizing basic statistical indicators, using a field Indicates that
④ is suitable for applications with high real-time requirements
3.3 Druid framework
3.4 Druid data structure
Complementary to the Druid architecture is its data structure based on DataSource and Segment, which together contribute to Druid's high-performance advantages.
3.5 Druid installation
3.5.1 Download the installation package
Download the latest version installation package from https://imply.io/get-started
3.5.2 Installation and deployment
1) Upload imply-2.7.10.tar.gz to the /opt/software directory of hadoop102, and decompress it
tar -zxvf imply-2.7.10.tar.gz
2) Modify the name of imply-2.7.10 to imply
3) Modify the configuration file
(1) Modify the ZK configuration of Druid
vim imply/conf/druid/_common/common.runtime.properties
(2) Modify the parameters of the startup command so that it does not verify and does not start the built-in ZK
vim imply/conf/supervise/quickstart.conf
4) Start
(1) Start Zookeeper
./zkServer.sh statrt
(2) start imply
bin/supervise -c conf/supervise/quickstart.conf
3.5.3 Web page use
1) Log in to wavehouse-1:9095 to view
2) Click Load data->Click Apache Kafka
to set the kafka cluster and topic
3) Confirm the data sample format
4) Load data, there must be a time field
5) Select the item to be loaded
6) Create Database table name
7) Confirm the configuration
8) Connect to topic_start of Kafka
9) Select SQL to query indicators
select sum(uid) from "topic_start"
4 Kylin
4.1 Introduction to Kylin
Apache Kylin is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities on Hadoop/Spark to support ultra-large-scale data. It was originally developed by eBay and contributed to the open source community. It can query huge Hive tables in sub-seconds.
4.2 Kylin architecture
1) REST Server
REST Server is a set of entry points for application development, designed to implement application development for the Kylin platform. Such applications can provide queries, get results, trigger cube build tasks, get metadata, get user permissions, and more. In addition, SQL queries can be implemented through the Restful interface.
2) Query Engine (Query Engine)
When the cube is ready, the query engine can obtain and parse user queries. It then interacts with other components in the system to return corresponding results to the user. 3) When the router (Routing) was originally designed, it was considered to guide the queries that Kylin could not execute to Hive for further execution, but after practice, it was found that the speed difference between Hive and Kylin was too large, which made it impossible for users to have a consistent view of the query speed
.
Expectation, it is likely that most queries return results within a few seconds, while some queries have to wait minutes to tens of minutes, so the experience is very bad. This last routing feature is turned off by default in distributions.
4) Metadata management tool (Metadata)
Kylin is a metadata-driven application. The metadata management tool is a key component used to manage all metadata stored in Kylin, including the most important cube metadata. The normal operation of all other components needs to be based on metadata management tools. Kylin's metadata is stored in hbase.
5) Task engine (Cube Build Engine)
This engine is designed to handle all offline tasks, including shell scripts, Java API and Map Reduce tasks, etc. The task engine manages and coordinates all the tasks in Kylin, so as to ensure that each task can be effectively executed and solve the faults that occur during it.
4.3 Features of Kyllin
The main features of Kylin include support for SQL interface, support for ultra-large-scale data sets, sub-second response, scalability, high throughput, BI tool integration, etc.
1) Standard SQL interface: Kylin uses standard SQL as the interface for external services.
2) Support for very large data sets: Kylin's ability to support large data may be the most advanced of all technologies at present. As early as 2015, eBay's production environment was able to support second-level queries of tens of billions of records, and then there were cases of second-level queries of hundreds of billions of records in mobile application scenarios.
3) Sub-second response: Kylin has excellent query response speed, which benefits from precomputation. Many complex calculations, such as connection and aggregation, have been completed during offline precomputation, which greatly reduces query time The amount of calculation required improves the response speed.
4) Scalability and high throughput: Single-node Kylin can achieve 70 queries per second, and Kylin clusters can also be built.
5) BI tool integration
Kylin can be integrated with existing BI tools, including the following.
ODBC: Integrate with tools such as Tableau, Excel, and PowerBI
JDBC: Integrate with Java tools such as Saiku and BIRT
RestAPI: Integrate with JavaScript and Web pages
Kylin development team also contributed Zepplin plug-ins, and you can also use Zepplin to access Kylin services
4.4 Kylin installation
Before installing Kylin, Hadoop, Hive, Zookeeper, and HBase must be deployed first, and the following environment variables HADOOP_HOME, HIVE_HOME, and HBASE_HOME need to be configured in /etc/profile. Remember to source them to make them take effect.
See this article for details on HBASE installation
1) Download the Kylin installation package
Download address: http://kylin.apache.org/cn/download/
2) Unzip apache-kylin-2.5.1-bin-hbase1x.tar.gz
3) Start
(1) Before starting Kylin, you need to start Hadoop (hdfs, yarn, jobhistoryserver), Zookeeper, Hbase
(2) Start Kylin
bin/kylin.sh start
See the following page to indicate that kylin started successfully
4) Visit the URL
to view the web page at http://wavehouse-1:7070/kylin
User name: ADMIN, password: KYLIN (the system has been filled in)
4.5 Kylin use
Using dwd_payment_info in the gmall data warehouse as the fact table, dwd_order_info_his, dwd_user_info as the dimension table, build a star schema, and demonstrate how to use Kylin for OLAP analysis.
4.5.1 Create a project
- Select the '+' button
- Fill in the project name description information
4.5.2 Get data source
- Select datasource
2) Select import table
- Select the required data table and click the Sync button
4.5.3 Create a model
1) Click Models, click the "+New" button, and click the "★New Model" button.
2) Fill in the Model information and click Next
3) Specify the fact table
4) Select the dimension table, and specify the association conditions between the fact table and the dimension table, click Ok
After adding the dimension table, click Next
5) Specify the dimension field, and click Next
6) Specify the measurement field, and click Next
7) Specify the fact table Partition field (only supports time partition), click the Save button, the model is created
4.5.4 Build cubes
1) Click new, and click new cube
2) Fill in the cube information, select the model that the cube depends on, and click next
3) Select the required dimension, as shown in the figure below
4) Select the required measurement value, as shown in the figure below
5 ) cube automatic merge setting, the cube needs to be built every day according to the date partition field, and the result of each build will be saved in
a table in Hbase. In order to improve the query efficiency, the daily cube needs to be merged, which can be set here Merge cycle.
6) Kylin advanced configuration (optimization related, temporarily skipped)
7) Kylin related attribute configuration coverage
8) Cube information overview, click Save, the Cube is created
9) Build the Cube (calculation), click the action button corresponding to the Cube, and select build
10 ) Select the time interval to be built and click Submit
11) Click Monitor to view the construction progress
4.5.6 Use Advanced
After executing the above process, the error is found as follows:
Error reason: The reason for the above error is that the dimension table dwd_order_info_his in the model is a zipper table, and dwd_user_info is a daily full-scale table, so if the entire table is used as a dimension table, the same order_id will inevitably appear Or the problem that user_id corresponds to multiple pieces of data. There are two solutions:
Solution 1: Create a temporary table of the dimension table in hive, which only stores the latest complete data of the dimension table. When creating a model in kylin Select this temporary table as a dimension table.
Option 2: The idea is the same as Option 1, but the physical temporary table is not used, and the view (view) is used to achieve the same function.
4.5.7 Adopt Scheme 2
(1) Create a dimension table view
CREATE VIEW dwd_user_info_view as select * from dwd_user_info
WHERE dt='2023-01-04';
CREATE VIEW dwd_order_info_view as select * from dwd_order_info
WHERE dt='2023-01-04';
(2) Import the newly created view in DataSource, and the previous dimension table can be selectively deleted.
After modification:
(3) Recreate model and cube
(4) Wait for reconstruction
(5) Query results
Example 1:
select user_level,sum(TOTAL_AMOUNT) from DWD_PAYMENT_INFO t1 join DWD_USER_INFO_VIEW t2 on t1.USER_ID = t2.ID
group by user_level
It can be found that the current time is 0.15 seconds, and it can be returned at the sub-second level.
Example 2: Add a gender dimension query
select user_level,gender,sum(TOTAL_AMOUNT) from DWD_PAYMENT_INFO t1 join DWD_USER_INFO_VIEW t2 on t1.USER_ID = t2.ID
group by user_level,gender
It only takes 0.09 seconds and can return at sub-second level.
4.5.8 Kylin BI tools
4.5.8.1JDBC
Import maven dependencies into the project to develop, so I won’t go into details here
<dependencies>
<dependency>
<groupId>org.apache.kylin</groupId>
<artifactId>kylin-jdbc</artifactId>
<version>2.5.1</version>
</dependency>
</dependencies>
4.5.8.2 Zepplin
1) Zepplin installation and startup
(1) Upload zeppelin-0.8.0-bin-all.tgz to Linux
(2) Unzip zeppelin-0.8.0-bin-all.tgz
(3) Modify the name
(4) Start
bin/zeppelin-daemon.sh start
You can log in to the web page to view, the default port number of the web is 8080
http://wavehouse-1:8080
2) Configure Zepplin to support Kylin
(1) Click anonymous in the upper right corner and select Interpreter
(2) Search for the Kylin plug-in and modify the corresponding configuration
(3) Click Save to complete the modification
3) Create a new note
(2) Fill in the Note Name and click Create
(3) Enter SQL to query
(4) View query results
5 summary
5.1 Ad hoc query comparison
Druid/Impala/Presto/Es/Kylin/Spark SQL comparison
The following is to build a data warehouse project based on CDH. For details, see "CDH Data Warehouse Project (1) - Detailed Process of CDH Installation and Deployment"