The road to learning Presto--01.Overall introduction

introduce

Presto is a distributed SQL query engine for querying large data sets distributed across one or more different data sources. A complete installation includes a Coordinator and multiple Workers. Queries are submitted by the client and submitted to the Coordinator from the Presto command line CLI. The Coordinator parses, analyzes and executes the query plan, and then distributes the processing queue to the Worker.

Presto is a completely memory-based distributed big data query engine, and all queries and calculations are executed in memory.

The input of Presto is a SQL statement; the output is the specific SQL execution result.

Presto can connect to different data sources, such as MySQL, Hive, etc.

Presto can optimize the SQL query process, including optimizing the execution plan of SQL itself, and using distributed queries to improve concurrency.

Presto is not a database and cannot handle online transactions.

basic concept

process type

Coordinator

  • effect:
    • Externally, it is responsible for managing the connection between the cluster and the client, and receiving client query requests.
    • Perform SQL syntax analysis, query plan generation and optimization, and schedule query tasks.
    • The management node of the cluster. A built-in discovery server tracks the status of Worker nodes.
  • Deployment situation: Generally deployed as a separate node in the cluster; if required for testing, a node can be deployed on the same node as the Worker.
  • Communication method: Use RESTful interface to interact with clients and Workers.

Worker

  • Role: The working node of the cluster. Used to execute decomposed query tasks and process data.
  • Deployment situation: Generally, multiple worker nodes are deployed in a cluster.
  • Communication method: Use RESTful interface to interact with Coordinator and other Workers.

interactive relationship

  • Status management:
    • Workers will send RESTful heartbeats to the Coordinator at regular intervals to inform the Coordinator's discovery server that they are still alive.
  • data processing:
    • Workers: Responsible for pulling data from connectors and interacting with other Workers for intermediate data processing.
    • Coordinator: Responsible for pulling results from Workers and returning the final results to the client.
    • After receiving the client query, the Coordinator selects the appropriate Workers from the list of surviving Workers to run the Task.

data source

Connector

Function: Presto can access a variety of different data sources through connectors. The connector is equivalent to the driver for database access.

  • Each connector implements standard access to data sources by implementing Presto's SPI interface.
  • How to access data source through connector?
    • Create a configuration file under $PRESTO_HOME/etc/catalog/: example.properties. (The suffix must be properties)
    • Set the attribute connector.name, a required attribute. The catalog manager uses this configuration attribute to create a connector that accesses the corresponding data source.
    • Supports using multiple catalogs to use the same connector to access two similar data sources. For example, you can configure two catalogs (both using Hive connectors) in a presto cluster to access two Hive clusters.

Catalog

Catalog can contain multiple schemas and access specified data sources by using specified connectors. For example, access the Hive data source by configuring the Hive catalog.

Schema

Function: Used to manage tables, similar to the database in Mysql. A catalog and a schema can uniquely determine a set of queryable tables.

Table

Similar to the concept of traditional relational databases. The mapping from data sources to expressions is specified by the connector.

interrelationship

When accessing a data table, the full name of the table starts with Catalog (that is, the prefix of the catalog configuration, such as the specific value of xxx in xxx.properties). For example, hive.test_data_schema.testthe specified tabletest_data_schema is located in the schema , and the schema is located  in the catalog.testtest_data_schemahive

Query execution model

When Presto executes a SQL statement, it will parse these statistics into corresponding queries, and then execute the query in the distributed cluster.

Statement

Presto supports ANSI standard SQL statements, which include clauses, expressions, and predicates.

Why does Presto distinguish between statement and query points? In Presto, statements refer to the textual representation of SQL statements entered by the user. When the statement is executed, Presto will create a query execution and query plan to execute the corresponding query, and the query plan is executed distributedly on a series of Worker nodes. 【This is necessary because, in Presto, statements simply refer to the textual representation of a SQL statement. When a statement is executed, Presto creates a  query  along with a query plan that is then distributed across a series of Presto workers.】

Query

When Presto receives a SQL statement, it converts it into a query execution (Query) and creates a query plan (query plan). Among them, the query plan is a series of related stages running on Prestor Workers. 【When Presto parses a statement, it converts it into a query and creates a distributed query plan which is then realized as a series of interconnected stages running on Presto workers. When you retrieve information about a query in Presto, you receive a snapshot of every component that is involved in producing a result set in response to a statement.】

The difference between Statement and Query: Statement is the SQL text input to Presto; Query is a series of configurations and components instantiated for executing Statement. A Query contains concepts such as stages, tasks, splits, connectors and corresponding data sources. [A statement can be thought of as the SQL text that is passed to Presto, while a query refers to the configuration and components instantiated to execute that statement. A query encompasses stages, tasks, splits, connectors, and other components and data sources working in concert to produce a result.】

Stage

When Presto executes a Query, it splits the query into multiple stages with hierarchical relationships. For example, when Presto queries 100 million records from Hive and performs data aggregation, Presto will create a series of stages to execute the corresponding distributed query, and create a root stage to aggregate the query output of the above stages, and then the results will be After aggregation, it is output to the Coordinator and further output to the user.

There is a tree-shaped hierarchical structure between the stages of a Query. Each Query has a Root stage, which is used to aggregate the output data of all other Stages. Stage is just a logical concept used by the coordinator to model distributed query plans, and it will not be executed on Presto Workers itself.

Task

As mentioned earlier, stages do not run directly on Presto Workers. They are run by decomposing them into a series of task tasks that run on Presto Workers.

Presto runs through Tasks.

  • A distributed query plan is broken down into a series of stages.
  • A stage is broken down into a series of tasks that are executed in parallel.
  • Each task is decomposed into one or more parallel drivers, and each driver acts on a splist. In this way, each task can process one or more splits in parallel. Each task has corresponding input and output.

Split

A split is a small slice of the entire larger data set. The lower-level stages in the distributed query plan obtain splits from the data source, and the intermediate stages located at the higher level obtain data from other stages.

When Presto executes a Query, the Coordinator will ask a connector to obtain a list of all splits in a table. Then, the Coordinator will select the appropriate node to run the corresponding task to process the split. 【When Presto is scheduling a query, the coordinator will query a connector for a list of all splits that are available for a table. The coordinator keeps track of which machines are running which tasks and what splits are being processed by which tasks.】

Driver

A Task contains one or more drivers. Drivers process data, aggregate it by tasks and pass it to a task in the downstream stage. A Driver is a collection of operators that act on a split.

Driver is the lowest parallel processing unit of Presto architecture.

Each driver has an input and an output.

【Tasks contain one or more parallel drivers. Drivers act upon data and combine operators to produce output that is then aggregated by a task and then delivered to another task in another stage. A driver is a sequence of operator instances, or you can think of a driver as a physical set of operators in memory. It is the lowest level of parallelism in the Presto architecture. A driver has one input and one output.】

Operator

An Operator code performs an operation on a Split. An Operator reads the data in a Split in sequence, applies the calculations and operations represented by the Operator to the split, and generates output.

Exchange

Exchange is used for data exchange between different Presto nodes. Task production data is placed in the output cache, and data can also be consumed from other tasks through the exchange client.

relation

  • 1 Query <-> Multiple Stages
  • 1 Stage <-> Multiple parallel Tasks
  • 1 Task <-> Multiple Drivers
  • 1 Driver <-> 1 Split
  • 1 Driver <-> a list of Operators

Overall structure

  • Coordinator: 1. How to ensure reliability?
  • Worker: multiple
  • CLI: Server that deploys the Presto command line client.
  • Application Client: A program using the Presto JDBC driver.

Guess you like

Origin blog.csdn.net/TangYuG/article/details/132756315