Review of old knowledge - what SQL goes through from submission to execution | JD Cloud technical team

1. What is SQL

SQL (Structured Query Language: Structured Query Language) is an advanced procedural programming language that allows users to work on high-level data structures. It is a data query and programming language and a standard computer language of (ANSI). . but... There are still many different versions of the SQL language. In order to be compatible with the ANSI standard, they must jointly support some major commands (such as SELECT, UPDATE, DELETE, INSERT, WHERE, etc.) in a similar way. ).

In standard SQL, SQL statements include four types

DML (Data Manipulation Language): Data manipulation language, used to define database records (data).

DCL (Data Control Language): Data control language, used to define access permissions and security levels.

DQL (Data Query Language): Data query language, used to query records (data).

DDL (Data Definition Language): Data definition language, used to define database objects (libraries, tables, columns, etc.)

2. How to execute SQL

2.1 mysql

Taking mysql as an example, the sql execution process is roughly divided into the following nodes (mysql server layer code, excluding engine layer transactions/log and other operations):

mysqlLex : mysql's own lexical analysis program, developed in C++ language, performs word segmentation based on the input statement, and parses the meaning of each word segmentation. The essence of word segmentation is the matching process of regular expressions. The source code is in sql/sql_lex.cc

Bision : Perform grammar parsing based on the grammar rules defined by mysql. Grammar parsing is the process of generating a syntax tree. The core is how to involve appropriate storage structures and related algorithms to store and traverse all information

During syntax parsing, a syntax tree is generated:

mysql analyzer: SQL parsing, extracts and parses keywords/non-keywords, and generates a parsing syntax tree. If a syntax error is analyzed, an exception will be thrown: ERROR: You have an error in your SQL syntax. At the same time, this stage Some verification will also be done. If the field does not exist, an exception will be thrown: unknown column in field list.

Points of reference:

a. Syntax tree generation rules

b. mysql optimization rules

2.2 hive sql

Hive is a data warehouse analysis system built based on Hadoop. It provides rich SQL query methods to analyze data stored in the Hadoop distributed file system. It can map structured data files into a database table and provide Complete SQL query function, you can convert SQL statements into MapReduce tasks for running, and use your own SQL to query and analyze the content required. This set of SQL is referred to as Hive SQL, which makes it easy for users who are not familiar with mapreduce to use SQL language to query and summarize ,analyze data

hive architecture diagram:

Driver:

Input the sql string, parse the sql string, convert it into an abstract syntax tree, and then convert it into a logical plan. Then use optimization tools to optimize the logical plan, and finally generate a physical plan (serialization, deserialization, UDF function), Hand it over to the Execution execution engine and submit it to MapReduce for execution (input and output can be local or HDFS/Hbase). See the hive architecture in the figure below.

The execution process of hiveSql is as follows:

After the sql is written, it is just the splicing of some strings, so it needs to go through a series of parsing processes before it can finally become a job executed on the cluster.

(1) Parser: Parses SQL into AST (Abstract Syntax Tree), and performs syntax verification. The essence of AST is still a string.

(2) Analyzer: Grammar analysis, generate QB (query block)

(3) Logicl Plan: Analyze the logical execution plan and generate a bunch of Operator Trees

(4) Logical optimizer: optimize the logical execution plan and generate a bunch of optimized Operator Trees

(5) Physical plan: Analyze physical execution plan and generate tasktree

(6) Physical Optimizer: Optimize the physical execution plan and generate an optimized tasktree. This task is the job executed on the cluster.

Conclusion: After the above six steps, ordinary string SQL is parsed and mapped into execution tasks on the cluster. The two most important steps are logical execution plan optimization and physical execution plan optimization (red circle in the figure)

Antlr : Antrl is a language recognition tool, developed based on Java, that can be used to construct domain languages. It provides a framework that can construct language recognizers through grammatical descriptions containing Java, C++, or C# actions. Compiler and interpreter. Antlr completes the process of hive lexical analysis, syntax analysis, semantic analysis, and intermediate code generation.

AST syntax tree example:

Extended learning:

a. It can be seen from the execution mechanism of hivesql that hive is not suitable for online transaction processing and cannot provide real-time query functions; it is most suitable for batch processing jobs based on large amounts of immutable data.

b. Antlr parsing process

c. hive optimization rules

2.3 flink sql

Flink SQL is the highest level abstraction in Flink and can be divided into SQL --> Table API --> DataStream/DataSetAPI --> Stateful Stream Processing

Flink SQL includes DML data manipulation language, DDL data language, and DQL data query language, but does not include DCL language.

(1) First, the bottom layer of FlinkSQL uses the apache Calcite engine to process SQL statements. Calcite will use javaCC for SQL parsing. javaCC generates a series of java codes based on the Parser.jj file defined in Calcite. The generated java code will SQL is converted into an AST abstract syntax tree (that is, SQLNode type).

(2) The generated SqlNode abstract syntax tree is an unverified abstract syntax tree. At this time, the SQL Validator will obtain the metadata information in the Flink Catalog to verify the sql syntax. The metadata information check includes table names, field names, Function names, data types, etc. are checked. Then generate a verified SqlNode.

(3) After reaching this step, the SQL is only parsed into fixed nodes of the Java data structure, and the relationship between the relevant nodes and the type information of each node are not given.

Therefore, it is also necessary to convert SqlNode into a logical plan, that is, LogicalPlan. During the conversion process, the SqlToOperationConverter class will be used to convert SqlNode into Operation. Operation will perform operations such as creating tables or deleting tables according to SQL syntax. At the same time, FlinkPlannerImpl. The rel() method will convert SQLNode into a RelNode tree and return RelRoot.

(4) Step 4 will execute the Optimize operation to optimize the logical plan according to the predefined optimization rules RelOptRule.

There are two types of optimizers RelOptPlanner in Calcite, one is HepPlanner based on rule optimization (RBO), and the other is VolcanoPlanner based on cost optimization (CBO). Then the optimized RelNode is obtained, and the optimized logical plan is converted into a physical plan based on the rules in Flink.

(5) In step 5, when executing the execute operation, the transformation will be generated through the code, and then each node will be traversed recursively to convert the DataStreamRelNode into a DataStream. During this period, the translateToPlan method rewritten in the DataStreamUnion, DataStreamCalc, and DataStreamScan classes will be called recursively. Recursively calling translateToPlan of each node actually uses CodeGen elements to compile various operators of Flink, which is equivalent to directly using Flink's DataSet or DataStream to develop programs.

(6) Finally, it is further compiled into an executable JobGraph and submitted for running.

Flink SQL uses Apache Calcite as parser and optimizer

Calcite: A dynamic data management framework that has many functions of typical database management systems, such as SQL parsing, SQL verification, SQL query optimization, SQL generation and data connection queries, etc., but it omits some key functions, such as Calcite and Does not store relevant metadata and basic data, does not fully include related algorithms for processing data, etc.

Extended learning:

a. flink sql optimization rules

3. Common SQL parsing engines

parsing engine Development language scenes to be used Summarize
antlr java presto 1. Contains three main functions: lexical analyzer, syntax analyzer, and tree parser 2. Supports definition of domain language
calcite javaCC considerable 1. Abstract syntax tree 2. Supports the use of FreeMarker template engine to extend the syntax 3. Able to create queries with the database

Continuously replenishing...

4. Summary

In the actual work process, relevant SQL optimization will be involved, such as automatically changing the complex nested SQL background written by non-R&D business teachers to non-nested execution to improve query performance. It supports redisSQL and parses it into the background in standard SQL format. The redis command executed. The open source jsqlparser framework is currently used to implement the parsing of the syntax tree. The advantage is that the operation is simple. It only splits the SQL statement and parses it into the hierarchical structure of the Java class. It supports the visitor mode and has nothing to do with the database. The disadvantage is that it only It supports common SQL syntax sets. If you want to extend the syntax, you need to change the source code, which will affect the intrusion and maintainability of the code. If you want to do a good job in sql parsing and optimization, you still need to have a deep understanding of the execution principles of sql and understand each sql. The characteristics, advantages and disadvantages of the engine. Think about the problem from the perspective of architecture.

If a worker wants to do his job well, he must first sharpen his tools.

Author: JD Technology Li Danfeng

Source: JD Cloud Developer Community Please indicate the source when reprinting

Microsoft launches new "Windows App" Xiaomi officially announces that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Vite 5. Alibaba Cloud 11.12 is officially released. The cause of the failure is exposed: Access Key service (Access Key) anomaly. GitHub report: TypeScript replaces Java and becomes the third most popular. The language operator’s miraculous operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems ByteDance: using AI to automatically tune Linux kernel parameters Microsoft open source Terminal Chat Spring Framework 6.1 officially GA OpenAI former CEO and president Sam Altman & Greg Brockman joins Microsoft
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10149749