01 Presto Overview: Feature Advantages and Disadvantages Scene Architecture

insert image description here

Keywords: MPP, multi-source ad hoc query, unified SQL execution engine, distributed SQL engine, data analysis

1. What is Presto?

  • Presto is an open source distributed parallel computing (MPP) engine, which is mainly applicable to the second-level analysis and query scenario requirements of GB~PB data sources in the field of big data.
  • Presto was created to solve the problem that the MapReduce model is too slow and cannot display HDFS data through tools such as BI
  • Presto is a computing engine. It does not store data. It mainly obtains third-party data through rich Connectors and supports expansion.

官网介绍:
Apache Presto is a distributed parallel query execution engine, optimized for low latency and interactive query analysis. Presto runs queries easily and scales without down time even from gigabytes to petabytes.
A single Presto query can process data from multiple sources like HDFS, MySQL, Cassandra, Hive and many more data sources. Presto is built in Java and easy to integrate with other data infrastructure components. Presto is powerful, and leading companies like Airbnb, DropBox, Groupon, Netflix are adopting it.

2. Advantages and disadvantages of Presto

2.1. Advantages

  • Presto supports standard SQL, lowering the threshold for analysts and developers
  • Presto supports pluggable Connector, which can connect to multiple data sources, and supports cross-data source association queries. eg: Hive, Mysql, Oracle, Kafka, MongoDB, Es, Postgresql…
  • Presto is a low-latency, high-concurrency in-memory computing engine, and its execution efficiency is much higher than that of Hive
  • Simple deployment and rich monitoring

insert image description here
insert image description here
insert image description here
insert image description here

2.2. Disadvantages

  1. Memory limitation: Although it can handle PB-level massive data analysis, it does not mean that Presto puts PB-level calculations in memory. Instead, according to the scenario, aggregation operations such as count and avg are calculated while reading data, then clearing the memory, and then reading the data before calculating. This kind of memory consumption is not high. However, even table query may generate a large amount of temporary data, so the speed will slow down, but Hive will be better at this time.

  2. There is no fault tolerance. When a query is distributed to multiple workers for execution, when a worker fails to query for various reasons, after the master perceives it, the entire query will also fail

  3. Parallel query, because all tasks are executed in parallel, if one of the Workers queries slowly for various reasons, the entire query will become very slow

  4. Concurrency limit, because of full memory operation + memory limit, the amount of data that can be processed at the same time is limited, resulting in insufficient concurrency capability

3. Presto application scenarios

  • Accelerates Hive queries. The execution model of Presto is a pure memory MPP model, which is at least 5 times faster than the MapReduce model of disk shuffle used by Hive.
  • Unified SQL execution engine. Presto is compatible with the ANSI SQL standard, and can connect to multiple RDBMS and data warehouse data sources, and use the same SQL syntax and SQL Functions on these data sources.
  • Bring SQL execution capabilities to those storage systems that do not have SQL execution capabilities. For example, Presto can bring SQL execution capabilities to HBase, Elasticsearch, and Kafka, and even local files, memory, JMX, and HTTP interfaces. Presto can also do it.
  • Build a virtual unified data warehouse to realize multi-data source federated query. If the data sources that need to be calculated are scattered in different RDBMSs, data warehouses, or even other RPC systems, Presto can directly associate these data sources together for analysis (SQL Join), without the need to copy data from the data source and centralize them together .
  • Data migration and ETL tools. Presto can connect to multiple data sources, plus it has rich SQL Functions and UDFs, which can easily help data engineers pull (E), transform (T), and load (L) data from one data source to another data source.

4. Presto data model

  • Catalog : the data source. Both Hive and Mysql are data sources, and Presto can connect to multiple Hives and multiple Mysqls
  • Schema: Similar to DataBase, there are multiple Schemas under a Catalog
  • Table: Data table, which has the same meaning as our commonly used database table, there are multiple data tables under a Schema

Example query:

SELECT * from hive.dwd.table_a a 
JOIN mysql.dim.user_type_dim b
WHERE a.id = b.id

5. Presto architecture

Presto is based on MS (master-slave) architecture and consists of a Coordinator node, a Discovery node, and multiple Worker nodes, as shown below: The
insert image description here
above figure is composed of different components. The table below describes each component in detail.

insert image description here

5.1 Execution process

1. Submit SQL : Users can enter SQL in the SQL Client (CLI/JDBC/HTTP), which is responsible for submitting SQL Query to the Presto cluster (Coordinator)

2. Generate execution plan and task scheduling : After the Coordinator receives the SQL, it parses the SQL syntax into an abstract syntax tree AST through the SQL syntax parser. The syntax conforms to the SQL syntax. It will go through a logical query planner component and query through the connector The schema column name and column type in the metadata, etc., correspond to the abstract syntax number, generate a physical syntax tree node, and then get a logical query plan, distribute it to the distributed logical planner, split the Stage and Task, schedule distributed execution Task to Presto Worker.

3. Execution : Presto Worker is responsible for executing the received HttpRemoteTask, determines which Operators and their execution order according to the execution plan, and then completes the calculation of all Operators through TaskExecutor and Driver. If the first Operator to be executed is SourceOperator, the current Task will first pull data from the External Storage System and then perform subsequent calculations. If the last executed Operator is TaskOutputOperator, the current Task will output the calculation result to the OutputBuffer, waiting for the Stage that depends on the current Stage to pull the settlement result. After all the Tasks in all Stages of the entire Query are executed, the final results are returned to the SQL Client.

PS:
External Storage System : Since Presto does not store data itself, the data and metadata involved in the calculation come from external storage systems, such as distributed systems such as HDFS and AWS S3. In the practical experience of enterprises, HiveMetaStore is often used to store metadata, HDFS is used to store data, and Presto is used to perform calculations to speed up the query speed of Hive tables.

Execution plan : It describes the detailed steps and details of SQL execution. The SQL execution engine can complete the entire calculation process as long as it executes according to the execution plan, as shown in the figure below:

insert image description here


References:

  1. https://prestodb.io/
  2. https://www.tutorialspoint.com/apache_presto/apache_presto_architecture.htm

Guess you like

Origin blog.csdn.net/qq_31557939/article/details/129237382