lexical analysis
overview
Lexical analysis and grammatical analysis are completed by Lex and Yacc, which are scan.l and gram.y files in the postgres source code. These two files pre-generate the scan.c and gram.c files respectively, and they are combined with the C file required by the lexical and grammatical analysis module to form the entire module of lexical analysis and grammatical analysis.
Among them, the file generation and calling relationship required for lexical analysis and syntax analysis is shown in the figure below.
The lexical analyzer scan.l is responsible for identifying identifiers, SQL keywords, etc. For each keyword or identifier found, a token will be generated and passed to the analyzer;
The grammar analyzer gram.y contains a set of grammar rules and actions to be executed when the rules are triggered;
The raw_parser function (under src/backend/parser/parser.c) mainly implements lexical analysis and syntax analysis by calling the pre-generated base_yyparse function using Lex and Yacc.
Important source files and calling relationships
Source File | illustrate |
---|---|
parser.c | The entry of lexical and grammatical analysis, the function is raw_parse; after grammatical and lexical analysis of the query statement, return the analysis tree |
gram.y | Define the grammatical structure, write in Yacc language, compile Lex into gram.c file |
gram.h | Defines the numeric number of the keyword |
scan.l | Define the lexical structure, write in Lex language, and generate scan.c file after Yacc compilation |
kwlist.h | The Keyword List is defined. Contains a list of reserved words and keywords used in the PostgreSQL database system. |
kwlookup.h | Define the keywords of the SQL statement (in previous versions, this file also defined the structure ScanKeyWord) |
kwlookup.c | Provide the ScanKeywordLookup function, which judges whether the input string is a keyword, and if so, returns the pointer of the current identifier to the corresponding word in the keyword list, and uses the Hash index method to search (the previous version uses the binary method to search) |
scansup.c | Provides several functions used in lexical analysis. The downcase_truncate_identifier function converts uppercase English characters to lowercase characters. The truncate_identifier function truncates identifiers that exceed the maximum identifier length. The scanner_isspace function determines whether the input character is a blank character. |
lexical analyzer
What lexical analysis usually does is look for patterns of characters in the input. It uses regular expressions to match input strings and convert them into corresponding tokens. Regular expressions are a concise and clear description of patterns. A rule that matches a regular expression, and then executes the corresponding action. In fact, it is to extract various reserved words, operators, special symbols and other language elements occupied by programming languages.
parser
The task of the parser is to find the relationship between the input tokens. A common relational expression is a parse tree.
Execution of query statement
Definition of SELECT statement in gram.y
SelectStmt: select_no_parens %prec UMINUS
| select_with_parens %prec UMINUS
;
select_with_parens:
'(' select_no_parens ')' {
$$ = $2; }
| '(' select_with_parens ')' {
$$ = $2; }
;
select_no_parens:
simple_select {
$$ = $1; }
| select_clause sort_clause
{
insertSelectOptions((SelectStmt *) $1, $2, NIL,
NULL, NULL,
yyscanner);
$$ = $1;
}
……
| with_clause select_clause opt_sort_clause select_limit opt_for_locking_clause
{
insertSelectOptions((SelectStmt *) $2, $3, $5,
$4,
$1,
yyscanner);
$$ = $2;
}
;
Use SelectStmt to represent, defined as the SELECT statement without brackets (select_no_parens) and with brackets (select_no_parens).
A SELECT statement without parentheses can be defined as a simple SELECT statement (simple_select) or as other statements such as (select_clause). This grammatical analysis of the entire statement is actually to split the statement into many small grammatical units, and then analyze these small grammatical units.
We use the simple SELECT syntax
simple_select:
SELECT opt_all_clause opt_target_list
into_clause from_clause where_clause
group_clause having_clause window_clause
{
SelectStmt *n = makeNode(SelectStmt);
n->targetList = $3;
n->intoClause = $4;
n->fromClause = $5;
n->whereClause = $6;
n->groupClause = ($7)->list;
n->groupDistinct = ($7)->distinct;
n->havingClause = $8;
n->windowClause = $9;
$$ = (Node *) n;
}
……
;
simple_select is the core part of the SELECT statement. From the syntax of simple_select, there are the following sentences:
words | describe |
---|---|
targetList | target attribute |
intoClause | SELECT INTO |
fromClause | FROM clause |
whereClause | WHERE clause |
groupClause | GROUP BY clause |
havingClause | HAVING clause |
windowClause | window clause |
In previous versions, DISTINCT was used to remove duplicate rows
After successfully matching the simple_select syntax structure, a SelectStmt structure will be created
typedef struct SelectStmt
{
NodeTag type;
/*
* 以下字段仅在表示 "叶子" SelectStmts 中使用。
*/
List *distinctClause; /* NULL,DISTINCT ON 表达式列表,或
* lcons(NIL, NIL) 表示所有 (SELECT DISTINCT) */
IntoClause *intoClause; /* SELECT INTO 的目标 */
List *targetList; /* 目标列表(ResTarget 列表) */
List *fromClause; /* FROM 子句 */
Node *whereClause; /* WHERE 条件 */
List *groupClause; /* GROUP BY 子句 */
bool groupDistinct; /* 是否 GROUP BY DISTINCT? */
Node *havingClause; /* HAVING 条件表达式 */
List *windowClause; /* WINDOW window_name AS (...), ... */
/*
* 在表示 VALUES 列表的 "叶子" 节点中,上述字段都为 null,代之以这个字段。
* 需要注意子列表的元素只是表达式,没有 ResTarget 修饰。
* 此外,列表元素可以是 DEFAULT(表示为 SetToDefault 节点),不论 VALUES 列表的上下文如何。
* 解析分析将根据是否有效来拒绝该情况。
*/
List *valuesLists; /* 未变换的表达式列表的列表 */
/*
* 以下字段在 "叶子" SelectStmts 和上层 SelectStmts 中都使用。
*/
List *sortClause; /* 排序子句(SortBy 列表) */
Node *limitOffset; /* 要跳过的结果元组数 */
Node *limitCount; /* 要返回的结果元组数 */
LimitOption limitOption; /* 限制类型 */
List *lockingClause; /* FOR UPDATE(LockingClause 列表) */
WithClause *withClause; /* WITH 子句 */
/*
* 以下字段仅在上层 SelectStmts 中使用。
*/
SetOperation op; /* 集合操作类型 */
bool all; /* 是否指定 ALL */
struct SelectStmt *larg; /* 左子节点 */
struct SelectStmt *rarg; /* 右子节点 */
/* 最终在此处添加用于 CORRESPONDING 规范的字段 */
} SelectStmt;
It defines every aspect of a SELECT query, from target list, FROM clause, WHERE condition, sorting, grouping, and more.
target attribute
The target attribute is the attribute list to be queried in the SELECT statement, corresponding to the identifier target_list in the grammar definition. target_list is composed of several target_el, and target_list is defined as aliased expression, expression and '*', etc.
target_list:
target_el {
$$ = list_make1($1); }
| target_list ',' target_el {
$$ = lappend($1, $3); }
;
target_el: a_expr AS ColLabel
{
$$ = makeNode(ResTarget);
$$->name = $3;
$$->indirection = NIL;
$$->val = (Node *) $1;
$$->location = @1;
}
| a_expr BareColLabel
{
$$ = makeNode(ResTarget);
$$->name = $2;
$$->indirection = NIL;
$$->val = (Node *) $1;
$$->location = @1;
}
| a_expr
{
$$ = makeNode(ResTarget);
$$->name = NULL;
$$->indirection = NIL;
$$->val = (Node *) $1;
$$->location = @1;
}
| '*'
{
ColumnRef *n = makeNode(ColumnRef);
n->fields = list_make1(makeNode(A_Star));
n->location = @1;
$$ = makeNode(ResTarget);
$$->name = NULL;
$$->indirection = NIL;
$$->val = (Node *) n;
$$->location = @1;
}
;
When target_el is matched, create a ResTarget structure. This structure stores all the information of the attribute
typedef struct ResTarget
{
NodeTag type;
/*
* 列名或 NULL
*/
char *name;
/*
* 下标、字段名和 '*' 的子列表,或 NIL
*/
List *indirection;
/*
* 要计算或分配的值表达式
*/
Node *val;
/*
* 标记位置,如果位置未知则为 -1
*/
int location;
} ResTarget;
from clause
from_clause consists of the FROM keyword and from_list. The from_list is composed of several identifiers table_ref, and each table_ref represents each subitem separated by commas in the FROM clause, representing a table or a subquery appearing in FROM.
from_clause:
FROM from_list {
$$ = $2; }
| /*EMPTY*/ {
$$ = NIL; }
;
from_list:
table_ref {
$$ = list_make1($1); }
| from_list ',' table_ref {
$$ = lappend($1, $3); }
;
table_ref: relation_expr opt_alias_clause
{
$1->alias = $2;
$$ = (Node *) $1;
}
……
;
The simplest and most basic form of a subterm (table_ref) in a FROM clause is a relational expression (realation_expr)
relation_expr:
qualified_name
{
/* inheritance query, implicitly */
$$ = $1;
$$->inh = true;
$$->alias = NULL;
}
| extended_relation_expr
{
$$ = $1;
}
;
extended_relation_expr:
qualified_name '*'
{
/* inheritance query, explicitly */
$$ = $1;
$$->inh = true;
$$->alias = NULL;
}
| ONLY qualified_name
{
/* no inheritance */
$$ = $2;
$$->inh = false;
$$->alias = NULL;
}
| ONLY '(' qualified_name ')'
{
/* no inheritance, SQL99-style syntax */
$$ = $3;
$$->inh = false;
$$->alias = NULL;
}
;
The relational expression relation_expr is defined as qualified_name, qualified_name with ONLY relation word, etc., and finally qualified_name is defined as relation_name.
qualified_name:
ColId
{
$$ = makeRangeVar(NULL, $1, @1);
}
| ColId indirection
{
$$ = makeRangeVarFromQualifiedName($1, $2, @1, yyscanner);
}
;
After matching the final identifier relation_name, create a RangeVar structure to store the information of the relationship
typedef struct RangeVar
{
NodeTag type;
/*
* 目录(数据库)名称,或 NULL
*/
char *catalogname;
/*
* 模式名称,或 NULL
*/
char *schemaname;
/*
* 关系/序列名称
*/
char *relname;
/*
* 是否扩展关系的继承?是否递归处理子级?
*/
bool inh;
/*
* 参见 pg_class.h 中的 RELPERSISTENCE_*
*/
char relpersistence;
/*
* 表别名和可选列别名
*/
Alias *alias;
/*
* 标记位置,如果位置未知则为 -1
*/
int location;
} RangeVar;
keyword lookup function
The code performs the lookup by computing a hash and comparing it to the hash in the keyword list. If the hashes match, then the characters are compared character by character to check if there is an exact match. .
int
ScanKeywordLookup(const char *str,
const ScanKeywordList *keywords)
{
size_t len;
int h;
const char *kw;
/*
* 如果字符串太长以至于不可能是任何关键字,立即拒绝。这样可以避免在长字符串上进行无用的哈希和小写转换操作。
*/
len = strlen(str);
if (len > keywords->max_kw_len)
return -1;
/*
* 计算哈希函数。我们假设它是生成不区分大小写的结果的。由于它是一个完美哈希函数,只需要匹配它所标识的特定关键字。
*/
h = keywords->hash(str, len);
/* 如果结果超出范围,则表示没有匹配 */
if (h < 0 || h >= keywords->num_keywords)
return -1;
/*
* 逐字符比较以查看是否匹配,对输入字符应用基于 ASCII 的小写转换。
*/
kw = GetScanKeyword(h, keywords);
while (*str != '\0')
{
char ch = *str++;
if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
if (ch != *kw++)
return -1;
}
if (*kw != '\0')
return -1;
/* 成功匹配! */
return h;
}
In previous versions, this function used a binary search, a technique typically used in performance-critical contexts to quickly search among a large number of keys.
In most cases, hash matching is better suited for finding large numbers of keywords, especially when query speed is critical. However, hash value matching may require some additional processing to handle hash collisions. Binary search is suitable for sorted key lists and may be an appropriate choice if frequent insertion or deletion operations are not required.