Presto installation introduction and use

table of Contents

Introduction to Presto

Presto architecture

Advantages and disadvantages of Presto

Performance comparison between Presto and Impala

Presto installation

Presto command line client installation

Presto Visual Client installation

About not supporting lzo

Presto optimized data storage

Presto optimized query SQL


Introduction to Presto

Presto is an open source distributed SQL query engine with a data volume ranging from GB to PB bytes. It is mainly used to process second-level queries.

Note: Although Presto can parse SQL, it is not a standard database. It is not a substitute for Mysql or Oracle, nor can it be used to process online transactions (OLAP).

Presto architecture

Presto consists of a Connrdinator and multiple Workers.

Advantages and disadvantages of Presto

advantage:

  • Presto is based on memory operation, which reduces hard disk IO and makes calculation faster.
  • Able to connect to multiple data sources, cross-data source connection table query, query a large number of website access records from Hive, and then match device information from Mysql

Disadvantages:

Presto can handle PB-level massive data analysis, but Presto does not calculate the PB-level data in the memory, but according to the scene, such as Count, AVG and other aggregation operations, it is calculated while reading the data, and the memory is cleared. After calculation, this memory consumption is not high. However, even table check may generate a large amount of temporary data, so the speed will be slower.

Performance comparison between Presto and Impala

The test concluded that the performance of Impala is slightly ahead of Presto, but Presto is very rich in data source support, including Hive, graph databases, traditional relational databases, Redis, etc.

Presto installation

1. Download

Official website address: https://prestodb.github.io/

Download link : https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.196/ presto-server-0.196.tar.gz

2. Install and unzip

tar -zxvf presto-server-0.196.tar.gz -C /opt/module/

3. Modify the name

mv presto-server-0.196/ presto

4. Enter the presto directory and create a folder for storing data and a folder for storing configuration files

mkdir data

mkdir etc

5. Add the jvm.config configuration file in the presto/etc directory

vim jvm.config
添加如下内容
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

6. Presto can support multiple data sources, which is called catalog in Presto. Here we configure data sources that support Hive and configure a Hive catalog

#创建catalog
mkdir catalog
#配置hive.properties
vim hive.properties
#添加如下内容
connector.name=hive-hadoop2
hive.metastore.uri=thrift://bigdata02:9083

7. Distribute presto to other nodes

xsync soon

8. After distribution, enter the presto/etc path of the three hosts. Configure node properties, node  id is different for each node .

#每个节点都需要修改
vim node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/opt/module/presto/data

9. Presto is composed of a coordinator node and multiple worker nodes. Configure it as a coordinator on 01, and configure it as a worker on other nodes 02 and 03.

First configure 01 coordinator node

vim config.properties
添加内容如下
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery-server.enabled=true
discovery.uri=http://bigdata02:8881

Configure worker nodes on 02 and 03

#02上配置
vim config.properties
添加内容如下
coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery.uri=http://bigdata02:8881

#03上配置

vim config.properties
添加内容如下
coordinator=false
http-server.http.port=8881
query.max-memory=50GB
discovery.uri=http://bigdata02:8881

 

10. In 01's /opt/module/hive directory, start Hive Metastore, use hadoop role

nohup bin/hive --service metastore >/dev/null 2>&1 &

11. Start Presto Server on 01, 02, and 03 respectively

Start Presto in the foreground, and the console displays logs

 bin/launcher run

Start Presto in the background

bin/launcher start

12. Log view path presto/data/var/log

Presto command line client installation

1. Download

https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.196/presto-cli-0.196-executable.jar

2. Upload presto-cli-0.196-executable.jar to the /presto folder of 01

3. Modify the file name

mv presto-cli-0.196-executable.jar  prestocli

4. Increase execution authority

chmod +x prestocli

5. Start prestocli

./prestocli --server bigdata02:8881 --catalog hive --schema default

6. Presto command line operation

The command line operation of Presto is equivalent to the Hive command line operation. A schema must be added to each table.

E.g:

select * from schema.table limit 100

Presto Visual Client installation

1. Upload yanagishima-18.0.zip to the /opt/module directory of 01

2. Unzip yanagishima

unzip yanagishima-18.0.zip cd yanagishima-18.0

3. Enter the /opt/module/yanagishima-18.0/conf folder and write the yanagishima.properties configuration

vim yanagishima.properties
	添加如下内容
jetty.port=7080
presto.datasources=hadoop-presto
presto.coordinator.server.hadoop-presto=http://bigdata02:8881
catalog.hadoop-presto=hive
schema.hadoop-presto=default
sql.query.engines=presto

4. Start yanagishima in the /opt/module/yanagishima-18.0 path

nohup bin/yanagishima-start.sh >y.log 2>&1 &

5. Start the web page

http://bigdata02:7080

6. View the table structure

For example, execute select * from hive.gmall.ads_user_topic limit 5. In this sentence, the word Hive can be deleted. It is the Catalog configured above.

There is a copy key behind each table, click to copy the complete table name, and then enter the SQL statement in the box above, ctrl+enter key or execute the Run button to view the displayed result

About not supporting lzo

A lot of our data is in the lzo format, presto does not support this data format, but if we use columnar storage for data, the storage uses lzo, which can be supported by configuration, we can use hadoop Copy the lzo package under presto, distribute to other presto nodes, and then restart presto to support it.

cp /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar /opt/module/presto/plugin/hive-hadoop2/

切换到/opt/module/presto/plugin/hive-hadoop2/目录 :cd /opt/module/presto/plugin/hive-hadoop2/

分发到其他的节点:xsync hive-hadoop2

Presto optimized data storage

1. Reasonably set up partitions

Similar to Hive, Presto will read partition data based on metadata information. A reasonable partition can reduce the amount of Presto data read and improve query performance.

2. Use columnar storage

Presto has made specific optimizations for ORC file reading. Therefore, when creating tables used by Presto in Hive, it is recommended to store in ORC format. Compared to Parquet, Presto supports ORC better.

3. Use compression

Data compression can reduce the pressure on IO bandwidth caused by data transmission between nodes. For ad hoc queries that require fast decompression, Snappy compression is recommended.

Presto optimized query SQL

1. Only select the used fields

Due to the columnar storage, selecting the required field can speed up the reading of the field and reduce the amount of data. Avoid using * to read all fields.

2. The filter condition must add the partition field

For partitioned tables, the partition field is preferred for filtering in the where statement. acct_day is the partition field, and visit_time is the specific visit time.

3. Group By statement optimization

Reasonable arrangement of the field order in the Group by statement will improve performance to a certain extent. Sort the fields in the Group By statement in descending order according to the distinct data of each field.

4. Use Limit when order by

Order by needs to scan data to a single worker node for sorting, resulting in a single worker requiring a lot of memory. If you are querying Top N or Bottom N, using limit can reduce sort calculation and memory pressure.

5. Put the large table on the left when using the Join statement

The default algorithm of join in Presto is broadcast join, which splits the table on the left of the join into multiple workers, and then copies the entire table data on the right of the join to each worker for calculation. If the amount of data in the table on the right is too large, a memory overflow error may be reported.

 

note:

  • Presto does not support insert overwrite syntax, you can only delete first and then insert into.
  • Presto currently supports the Parquet format and queries, but does not support insert.

  • For the time function Timestamp, when you need to compare, you need to add the Timestamp keyword, and MySQL can directly compare Timestamp.

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/110186767