Presto simple query engine analysis

Hive query process analysis

1

The role of the individual components

Window submit data manipulation: UI (user interface) (User Interface)
Driver (engine): responsible for receiving data manipulation, to achieve a session handle, and provide the basis JDBC / ODBC is execute and FETCH API
Metastore (metadata): Hive metadata stores all related information and tables HDFS file storage directory, generally use MySQL or derby storage
compiler (compiler): SQL query parsing, generating phased implementation plan (including MapReduce, metadata operations)
execution engine (execution engine) : implementation of compiler-generated execution plan. The implementation plan is a stage of DAG


Query Process

1 the STEP : Driver's UI interface to execute call

the Step 2 : Driver create a session handle for the query, and sends a query to the compiler to generate an execution plan,

the STEP 3, 4 : compiler access to the relevant metadata from Metastore

the STEP 5 : Check the metadata , query predicate resize partitions, parsing SQL, execution plan based on

the Step 6,6.1,6.2,6.3 : generated by the compiler implementation plan is the stage of DAG, each stage will involve Map / Reduce job, or HDFS metadata operation file operations.
In the Map / Reduce stage, the implementation plan contains operating Map tree (tree operation performed on the Mappers) and reduce operating tree (Reduce operations performed on the tree Reducers),
Execution Engine will submit a proper implementation of the various stages of assembly.
. 7 STEP, and. 8. 9 : at each task (mapper / reducer), the associated table or the intermediate output from the deserializer HDFS reading lines, and transferred via the relevant operator tree.
Once the output produced by the zero generated serializer HDFS file (This situation only occurs not only reduce the Map), the resulting file HDFS zero for subsequent execution plan Map / Reduce stage.
For DML operations, the temporary file is moved to the final position of the table. The program reads ensure that no dirty data (file rename operations are atomic in HDFS),
to query the contents of the temporary file is read directly from the HDFS Execution Engine, as part Fetch from Driver API


Presto inquiry process analysis

In the Map / Reduce Perform phase plan includes tree operation Map operation and reduce tree Perform on Mappers

1


The role of the individual components

Client (Client): Submit operation window data

Discovery Server (service discoverer): List Server available storage
Coordinator (Coordinator): receiving data manipulation, parsing SQL statements, query plan generation, distribution task to Worker Machine
Connector Plugin (connection plug): Connect Storagr, provide metadata, support Hive, Kafka, MySQL, MonogoDB, Redis, JMX and other data sources, you can customize
Worker (performer): query execution plan


Query Process

1, Client uses the HTTP protocol to send a query request
2 by Discovery Server may discover available Server
. 3, Coordinator construct the query plan (through Anltr3 resolved Metadata information AST (Abstract Syntax Tree) and to obtain the original data Connector generates distribution plan and plan execution)
. 4, to send the task Coordinator Workers
. 5, the Worker by reading data Connector plug
6, Worker tasks performed in the memory (the Worker is a memory-type computing engine)
and then after the response 7, Worker data back to the Coordinator, summary client


Presto contrast with the Hive

1


Difference:
MapReduce each disk write operations are required, before each stage need to wait for a completed stage began to perform,  
and Presto will be converted to SQL stage, each stage and performed by multiple tasks, each in turn divided into tasks more split.  
All task is performed in a parallel manner, between the data pipeline stage is performed in the form of streaming,  
the transmission between the data is carried out as a Memory-to-Memory through the network, no disk io operation.  
This is also the performance of Presto 5-10 times faster than Hive decisive reason


Presto shortcomings

1, no fault tolerance , when a query is distributed to multiple Worker to perform, when there is a Worker query failed for various reasons, Master perceived after the entire query will fail

2, memory limitations , since Presto is pure in-memory computing, so when memory is not enough, Presto will not result dump to disk, so the query will fail (said to have been supported by the latest version of Presto disk write operation)

3, parallel query , because all of the task are executed in parallel, if one of Worker queries are slow due to various reasons, then the whole query will become very slow

4, concurrency limits , since the operation of the whole memory + memory limits, limited amount of data can be processed simultaneously, thus resulting in insufficient concurrency


Application Mob project
http://gitlab.code.mob.com/mobdata-plat/dbcloud-api



Guess you like

Origin blog.51cto.com/14192352/2412947