lexical analysis

lexical analysis

overview

Lexical analysis and grammatical analysis are completed by Lex and Yacc, which are scan.l and gram.y files in the postgres source code. These two files pre-generate the scan.c and gram.c files respectively, and they are combined with the C file required by the lexical and grammatical analysis module to form the entire module of lexical analysis and grammatical analysis.
Among them, the file generation and calling relationship required for lexical analysis and syntax analysis is shown in the figure below.
insert image description here

The lexical analyzer scan.l is responsible for identifying identifiers, SQL keywords, etc. For each keyword or identifier found, a token will be generated and passed to the analyzer;

The grammar analyzer gram.y contains a set of grammar rules and actions to be executed when the rules are triggered;

The raw_parser function (under src/backend/parser/parser.c) mainly implements lexical analysis and syntax analysis by calling the pre-generated base_yyparse function using Lex and Yacc.

Important source files and calling relationships

Source File illustrate
parser.c The entry of lexical and grammatical analysis, the function is raw_parse; after grammatical and lexical analysis of the query statement, return the analysis tree
gram.y Define the grammatical structure, write in Yacc language, compile Lex into gram.c file
gram.h Defines the numeric number of the keyword
scan.l Define the lexical structure, write in Lex language, and generate scan.c file after Yacc compilation
kwlist.h The Keyword List is defined. Contains a list of reserved words and keywords used in the PostgreSQL database system.
kwlookup.h Define the keywords of the SQL statement (in previous versions, this file also defined the structure ScanKeyWord)
kwlookup.c Provide the ScanKeywordLookup function, which judges whether the input string is a keyword, and if so, returns the pointer of the current identifier to the corresponding word in the keyword list, and uses the Hash index method to search (the previous version uses the binary method to search)
scansup.c Provides several functions used in lexical analysis. The downcase_truncate_identifier function converts uppercase English characters to lowercase characters. The truncate_identifier function truncates identifiers that exceed the maximum identifier length. The scanner_isspace function determines whether the input character is a blank character.

lexical analyzer

What lexical analysis usually does is look for patterns of characters in the input. It uses regular expressions to match input strings and convert them into corresponding tokens. Regular expressions are a concise and clear description of patterns. A rule that matches a regular expression, and then executes the corresponding action. In fact, it is to extract various reserved words, operators, special symbols and other language elements occupied by programming languages.

parser

The task of the parser is to find the relationship between the input tokens. A common relational expression is a parse tree.

Execution of query statement

Definition of SELECT statement in gram.y

SelectStmt: select_no_parens			%prec UMINUS
			| select_with_parens		%prec UMINUS
		;

select_with_parens:
			'(' select_no_parens ')'				{
    
     $$ = $2; }
			| '(' select_with_parens ')'			{
    
     $$ = $2; }
		;
select_no_parens:
			simple_select						{
    
     $$ = $1; }
			| select_clause sort_clause
				{
    
    
					insertSelectOptions((SelectStmt *) $1, $2, NIL,
										NULL, NULL,
										yyscanner);
					$$ = $1;
				}
			
            ……
			| with_clause select_clause opt_sort_clause select_limit opt_for_locking_clause
				{
    
    
					insertSelectOptions((SelectStmt *) $2, $3, $5,
										$4,
										$1,
										yyscanner);
					$$ = $2;
				}
		;

Use SelectStmt to represent, defined as the SELECT statement without brackets (select_no_parens) and with brackets (select_no_parens).
A SELECT statement without parentheses can be defined as a simple SELECT statement (simple_select) or as other statements such as (select_clause). This grammatical analysis of the entire statement is actually to split the statement into many small grammatical units, and then analyze these small grammatical units.

We use the simple SELECT syntax

simple_select:
			SELECT opt_all_clause opt_target_list
			into_clause from_clause where_clause
			group_clause having_clause window_clause
				{
    
    
					SelectStmt *n = makeNode(SelectStmt);

					n->targetList = $3;
					n->intoClause = $4;
					n->fromClause = $5;
					n->whereClause = $6;
					n->groupClause = ($7)->list;
					n->groupDistinct = ($7)->distinct;
					n->havingClause = $8;
					n->windowClause = $9;
					$$ = (Node *) n;
				}
			……
		;

simple_select is the core part of the SELECT statement. From the syntax of simple_select, there are the following sentences:

words describe
targetList target attribute
intoClause SELECT INTO
fromClause FROM clause
whereClause WHERE clause
groupClause GROUP BY clause
havingClause HAVING clause
windowClause window clause

In previous versions, DISTINCT was used to remove duplicate rows

After successfully matching the simple_select syntax structure, a SelectStmt structure will be created

typedef struct SelectStmt
{
    
    
	NodeTag		type;

	/*
	 * 以下字段仅在表示 "叶子" SelectStmts 中使用。
	 */
	List	   *distinctClause; /* NULL,DISTINCT ON 表达式列表,或
								 * lcons(NIL, NIL) 表示所有 (SELECT DISTINCT) */
	IntoClause *intoClause;		/* SELECT INTO 的目标 */
	List	   *targetList;		/* 目标列表(ResTarget 列表) */
	List	   *fromClause;		/* FROM 子句 */
	Node	   *whereClause;	/* WHERE 条件 */
	List	   *groupClause;	/* GROUP BY 子句 */
	bool		groupDistinct;	/* 是否 GROUP BY DISTINCT? */
	Node	   *havingClause;	/* HAVING 条件表达式 */
	List	   *windowClause;	/* WINDOW window_name AS (...), ... */

	/*
	 * 在表示 VALUES 列表的 "叶子" 节点中,上述字段都为 null,代之以这个字段。
	 * 需要注意子列表的元素只是表达式,没有 ResTarget 修饰。
	 * 此外,列表元素可以是 DEFAULT(表示为 SetToDefault 节点),不论 VALUES 列表的上下文如何。
	 * 解析分析将根据是否有效来拒绝该情况。
	 */
	List	   *valuesLists;	/* 未变换的表达式列表的列表 */

	/*
	 * 以下字段在 "叶子" SelectStmts 和上层 SelectStmts 中都使用。
	 */
	List	   *sortClause;		/* 排序子句(SortBy 列表) */
	Node	   *limitOffset;	/* 要跳过的结果元组数 */
	Node	   *limitCount;		/* 要返回的结果元组数 */
	LimitOption limitOption;	/* 限制类型 */
	List	   *lockingClause;	/* FOR UPDATE(LockingClause 列表) */
	WithClause *withClause;		/* WITH 子句 */

	/*
	 * 以下字段仅在上层 SelectStmts 中使用。
	 */
	SetOperation op;			/* 集合操作类型 */
	bool		all;			/* 是否指定 ALL */
	struct SelectStmt *larg;	/* 左子节点 */
	struct SelectStmt *rarg;	/* 右子节点 */
	/* 最终在此处添加用于 CORRESPONDING 规范的字段 */
} SelectStmt;

It defines every aspect of a SELECT query, from target list, FROM clause, WHERE condition, sorting, grouping, and more.

target attribute

The target attribute is the attribute list to be queried in the SELECT statement, corresponding to the identifier target_list in the grammar definition. target_list is composed of several target_el, and target_list is defined as aliased expression, expression and '*', etc.

target_list:
			target_el								{
    
     $$ = list_make1($1); }
			| target_list ',' target_el				{
    
     $$ = lappend($1, $3); }
		;

target_el:	a_expr AS ColLabel
				{
    
    
					$$ = makeNode(ResTarget);
					$$->name = $3;
					$$->indirection = NIL;
					$$->val = (Node *) $1;
					$$->location = @1;
				}
			| a_expr BareColLabel
				{
    
    
					$$ = makeNode(ResTarget);
					$$->name = $2;
					$$->indirection = NIL;
					$$->val = (Node *) $1;
					$$->location = @1;
				}
			| a_expr
				{
    
    
					$$ = makeNode(ResTarget);
					$$->name = NULL;
					$$->indirection = NIL;
					$$->val = (Node *) $1;
					$$->location = @1;
				}
			| '*'
				{
    
    
					ColumnRef  *n = makeNode(ColumnRef);

					n->fields = list_make1(makeNode(A_Star));
					n->location = @1;

					$$ = makeNode(ResTarget);
					$$->name = NULL;
					$$->indirection = NIL;
					$$->val = (Node *) n;
					$$->location = @1;
				}
		;

When target_el is matched, create a ResTarget structure. This structure stores all the information of the attribute

typedef struct ResTarget
{
    
    
	NodeTag		type;

	/*
	 * 列名或 NULL
	 */
	char	   *name;

	/*
	 * 下标、字段名和 '*' 的子列表,或 NIL
	 */
	List	   *indirection;

	/*
	 * 要计算或分配的值表达式
	 */
	Node	   *val;

	/*
	 * 标记位置,如果位置未知则为 -1
	 */
	int			location;
} ResTarget;

from clause

from_clause consists of the FROM keyword and from_list. The from_list is composed of several identifiers table_ref, and each table_ref represents each subitem separated by commas in the FROM clause, representing a table or a subquery appearing in FROM.

from_clause:
			FROM from_list							{
    
     $$ = $2; }
			| /*EMPTY*/								{
    
     $$ = NIL; }
		;

from_list:
			table_ref								{
    
     $$ = list_make1($1); }
			| from_list ',' table_ref				{
    
     $$ = lappend($1, $3); }
		;

table_ref:	relation_expr opt_alias_clause
				{
    
    
					$1->alias = $2;
					$$ = (Node *) $1;
				}
        ……

The simplest and most basic form of a subterm (table_ref) in a FROM clause is a relational expression (realation_expr)

relation_expr:
			qualified_name
				{
    
    
					/* inheritance query, implicitly */
					$$ = $1;
					$$->inh = true;
					$$->alias = NULL;
				}
			| extended_relation_expr
				{
    
    
					$$ = $1;
				}
		;
extended_relation_expr:
			qualified_name '*'
				{
    
    
					/* inheritance query, explicitly */
					$$ = $1;
					$$->inh = true;
					$$->alias = NULL;
				}
			| ONLY qualified_name
				{
    
    
					/* no inheritance */
					$$ = $2;
					$$->inh = false;
					$$->alias = NULL;
				}
			| ONLY '(' qualified_name ')'
				{
    
    
					/* no inheritance, SQL99-style syntax */
					$$ = $3;
					$$->inh = false;
					$$->alias = NULL;
				}
		;

The relational expression relation_expr is defined as qualified_name, qualified_name with ONLY relation word, etc., and finally qualified_name is defined as relation_name.

qualified_name:
			ColId
				{
    
    
					$$ = makeRangeVar(NULL, $1, @1);
				}
			| ColId indirection
				{
    
    
					$$ = makeRangeVarFromQualifiedName($1, $2, @1, yyscanner);
				}
		;

After matching the final identifier relation_name, create a RangeVar structure to store the information of the relationship

typedef struct RangeVar
{
    
    
	NodeTag		type;

	/*
	 * 目录(数据库)名称,或 NULL
	 */
	char	   *catalogname;

	/*
	 * 模式名称,或 NULL
	 */
	char	   *schemaname;

	/*
	 * 关系/序列名称
	 */
	char	   *relname;

	/*
	 * 是否扩展关系的继承?是否递归处理子级?
	 */
	bool		inh;

	/*
	 * 参见 pg_class.h 中的 RELPERSISTENCE_*
	 */
	char		relpersistence;

	/*
	 * 表别名和可选列别名
	 */
	Alias	   *alias;

	/*
	 * 标记位置,如果位置未知则为 -1
	 */
	int			location;
} RangeVar;

keyword lookup function

The code performs the lookup by computing a hash and comparing it to the hash in the keyword list. If the hashes match, then the characters are compared character by character to check if there is an exact match. .

int
ScanKeywordLookup(const char *str,
				  const ScanKeywordList *keywords)
{
    
    
	size_t		len;
	int			h;
	const char *kw;

	/*
	 * 如果字符串太长以至于不可能是任何关键字,立即拒绝。这样可以避免在长字符串上进行无用的哈希和小写转换操作。
	 */
	len = strlen(str);
	if (len > keywords->max_kw_len)
		return -1;

	/*
	 * 计算哈希函数。我们假设它是生成不区分大小写的结果的。由于它是一个完美哈希函数,只需要匹配它所标识的特定关键字。
	 */
	h = keywords->hash(str, len);

	/* 如果结果超出范围,则表示没有匹配 */
	if (h < 0 || h >= keywords->num_keywords)
		return -1;

	/*
	 * 逐字符比较以查看是否匹配,对输入字符应用基于 ASCII 的小写转换。
	 */
	kw = GetScanKeyword(h, keywords);
	while (*str != '\0')
	{
    
    
		char		ch = *str++;

		if (ch >= 'A' && ch <= 'Z')
			ch += 'a' - 'A';
		if (ch != *kw++)
			return -1;
	}
	if (*kw != '\0')
		return -1;

	/* 成功匹配! */
	return h;
}

In previous versions, this function used a binary search, a technique typically used in performance-critical contexts to quickly search among a large number of keys.

In most cases, hash matching is better suited for finding large numbers of keywords, especially when query speed is critical. However, hash value matching may require some additional processing to handle hash collisions. Binary search is suitable for sorted key lists and may be an appropriate choice if frequent insertion or deletion operations are not required.

Guess you like

Origin blog.csdn.net/weixin_47895938/article/details/132457599