make install -j

cd ~/postgresql-9.1.3 

make -j

make install -j

$HOME/pgsql/bin/initdb -D $HOME/pgsql/data --locale=C

$HOME/pgsql/bin/postgres -D $HOME/pgsql/data 

$HOME/pgsql/bin/psql postgres -c 'CREATE DATABASE similarity;'

$HOME/pgsql/bin/psql -d similarity -f ./similarity_data.sql

$HOME/pgsql/bin/psql similarity -c "SELECT ra.address, ap.address, ra.name, ap.phone FROM restaurantaddress ra, addressphone ap WHERE levenshtein_distance(ra.address, ap.address) < 4 AND (ap.address LIKE '%Berkeley%' OR ap.address LIKE '%Oakland%') ORDER BY 1, 2, 3, 4;" > ../levenshtein.txt

$HOME/pgsql/bin/psql similarity -c "SELECT rp.phone, ap.phone, rp.name, ap.address FROM restaurantphone rp, addressphone ap WHERE jaccard_index(rp.phone, ap.phone) > .6 AND (ap.address LIKE '%Berkeley%' OR ap.address LIKE '%Oakland%') ORDER BY 1, 2, 3, 4; " > ../jaccard.txt

$HOME/pgsql/bin/psql similarity -c "SELECT ra.name, rp.name, ra.address, ap.address, rp.phone, ap.phone FROM restaurantphone rp, restaurantaddress ra, addressphone ap WHERE jaccard_index(rp.phone, ap.phone) >= .55 AND levenshtein_distance(rp.name, ra.name) <= 5 AND jaccard_index(ra.address, ap.address) >= .6 AND (ap.address LIKE '%Berkeley%' OR ap.address LIKE '%Oakland%')ORDER BY 1, 2, 3, 4, 5, 6;" > ../combined.txt

$HOME/pgsql/bin/pg_ctl -D $HOME/pgsql/data stop
1 query statement can be divided into 4 parts through `exec_simple_query()` of `postgres.c`.

1. Convert the SQL query statement entered by the user into the original syntax tree `raw_parsetree_list`. By calling the function `pg_parse_query()`

```c
parsetree_list = pg_parse_query(query_string);
```

Return the list of raw syntax trees.

2. Semantic analysis and query rewriting to generate `querytree_list`.

```c
List *
pg_analyze_and_rewrite(Node *parsetree, const char *query_string,
                       Oid *paramTypes, int numParams)

stmt_list = pg_analyze_and_rewrite(parsetree,
                                           sql,
                                           NULL,
                                           0);
```

Split a `select` statement into multiple parts, convert parse tree into query tree

3. Generate and optimize the query plan `plan_query`.

```c
List *
pg_plan_queries(List *querytrees, int cursorOptions, ParamListInfo boundParams)
    
stmt_list = pg_plan_queries(stmt_list, 0, NULL);
```

Generate a query plan based on the query tree. Among them, the optimization function is called,

```c
/* call the optimizer */
    plan = planner(querytree, cursorOptions, boundParams);
```

Calculate the possible cost values ​​of different paths according to the statistical information of tables and indexes, and finally select the optimal one.

4. Execute the query.

```c
if (IsA(stmt, PlannedStmt) &&
    ((PlannedStmt *) stmt)->utilityStmt == NULL)
{
    QueryDesc  *qdesc;

    qdesc = CreateQueryDesc((PlannedStmt *) stmt,
                                            sql,
                                            GetActiveSnapshot(), NULL,
                                            dest, NULL, 0);

    ExecutorStart(qdesc, 0);
    ExecutorRun(qdesc, ForwardScanDirection, 0);
    ExecutorFinish(qdesc);
    ExecutorEnd(qdesc);

    FreeQueryDesc(qdesc);
}
else
{
    ProcessUtility(stmt,
                    sql,
                    NULL,
                    false,    /* not top level */
                    dest,
                    NULL);
}
```

### 2.2 Source Code Analysis

1. `src/backend/utils/fmgr/funcapi.c`

Add the implementation of the objective function.

2. `src/include/catalog/pg_proc.h `

Registration of the objective function.

3. `src/backend/executor/execMain.c `

Implement the functions in the query execution process in 2.1.4

```c
ExecutorStart(qdesc, 0);
ExecutorRun(qdesc, ForwardScanDirection, 0);
ExecutorFinish(qdesc);
ExecutorEnd(qdesc);
```

4. `src/backend/executor/execScan.c`

Scan the tuples of the relation and return the correct tuple.

5. `src/backend/executor/execProcnode.c `

 Initialize the node type, get the tuple, and clean up.

6. `src/backend/executor/execTuples.c`

The tuple slot is processed for tuple-related resource management, whether it is temporary tuple memory.


Overview
In this experiment, one connection corresponds to one backend process. The job of this process is to accept the SQL statement sent by the client
and return the query result to the client.
After SQL enters the backend process,
the parser tree and query tree are generated by the parser, where the grammar check will be performed
and then entered into the rewriter to rewrite the tree according to certain rules, such as rewriting the query containing the view to the The table query
optimizer generates a plan tree plan tree, where the program is responsible for calculating the instruction overhead and exporting the best execution method to the
plan tree
Finally, the executor executes the table (and index) according to the method specified by the plan tree Access, execute queries (including calculation
constraints), and return a row of results to the client
. It is worth mentioning that the planner checks different possible connection methods to find the one with the least overhead. Here, join includes nested loop join
Block Nested Loop Join.
Specific analysis and examples
Take levenshtein_distance, that is,
the tableless query as an example. The path optimization of multi-table query even involves genetic algorithm, which is skipped this time due to space limitations.
PostgresMain
calls PostgresMain first, and enters the CommandRead state after completing the initialization of the time zone. The ReadCommand
function will parse the query stored as a string and return identifiers such as 'Q' 'P' 'B'. For this example,
'Q' is returned.
Afterwards, the query will be screened by switch case, the length of the query will be judged to be legal, the encoding method will be changed
( pg_client_to_server ), and finally the data will be passed to the exec_simple_query function for execution.
exec_simple_query
select levenshtein_distance('apply', 'apple');
pg_parse_query
first enters pg_parse_query to process the query statement, and this function returns parsetree_list (the initial form of the syntax analysis tree
). The main function inside this function is the raw_parser function. Here, the scanner uses the POSTGRESQL BISON rules/actions defined in gram.y
for grammatical analysis.
The path of gram.y is src/backend/parser/gram.y, which needs attention.
The specific tree is defined in src/include/nodes/parsenodes.h. The root node of the tree has a different data structure, the select statement
corresponds to SelectStmt, the update statement corresponds to UpdateStmt, and so on.
Observing the SelectStmt structure, part of the code is actually quite straightforward.
In addition, there are pointers to left and right nodes and other members, which will not be described in detail due to space limitations.
Another thing to note is that pg_parse_query() returns a list of parse trees, and then the code will enter a for loop to
take out each tree in the list for analysis, rewriting and execution.
The function pg_analyze_and_rewrite
accepts a parsetree and returns a list of Qurey structures.
The Qurey structure itself is also the data structure of the root node of the query tree, which helps to understand why the return type of parse_analyze()
is Query*.
The root node of the query tree includes: command type commandType (select, insert, update, delete, utility),
whether hasSubLinks contains subqueries, whether there is an aggregate function hasAggs, the return value list head pointer targetList and
many other information.
According to my analysis, the rtable pointer points to the list of tables used in the query; the jointtree pointer points to the
fromlist and where clause information that has been expressed as a tree.
// scanner, src/backend/parser/parser.c line 52
yyresult = base_yyparse(yyscanner);
List *distinctClause; /* (SELECT DISTINCT) expression list or NULL */
IntoClause *intoClause; /* SELECT INTO / CREATE TABLE AS's target table */
List *targetList; /* Pointer to the query result list */
// Below is the pointer to the query clause
List *fromClause; /* FROM */
Node *whereClause; /* WHERE */
List *groupClause; /* GROUP BY */
Node *havingClause; /* HAVING */
List *windowClause; /* WINDOW */
WithClause *withClause; /* WITH */
parse_analyze() , according to the different SQL statements, the transform$Stmt function will be called, where $ refers to
statement identifiers such as select and update. The return value of this type of function is also Query* type, which will be used as the return value of parse_analyze() after simple processing, indicating that
a query tree has been constructed.
pg_rewrite_query() takes a query tree and returns a list of query trees.
This operation implies one of the functions of rewrite, which is to convert the view into a query based on the basic table. In this way, a Query
will generate other Query again, and the list has to be used.
pg_plan_queries
This function accepts a list of rewritten query trees and returns a planttree_list (also a list). Inside the function,
querytree_list is looped and its elements are processed one by one by pg_plan_query(). The return value data structure
PlannedStmt* of this function is the pointer to the root node of the plantree. This is a pointer to a structure PlannedStmt, which stores
a lot of information needed by the executor.
The processing process is roughly as follows
Some pointer settings
Preprocessing (preprocess_expression and preprocess_qual_conditions,
reduce_outer_joins function and the code block that converts having to where, etc.). This stage is quite obvious,
starting from line 422 of planner.c, preprocess_expression() is called in large numbers, and the targetList offset is calculated
and optimized in advance, and the outer connection is converted into a natural connection, etc.
Formally handle main planning.
When creating an access path to a table, calculate the cost of sequential scans, index scans, and bitmap scans . The access path and cost corresponding to the scanning method will be saved, and the size will be compared at the end. The smallest result will
be saved in a new RelOptInfo structure. In addition, the time consumption of the order by clause is also calculated. Finally, the planntree is generated by using the saved
path with the minimum cost. The above is based on the analysis of the internal structure of the subquery_planner function
called in the standard_planner function (which is called by planner() and is the default option when no planner is specified) . Since the example does not involve table access, when optimizing the targetList in the preprocessing stage, the function in the example is accessed to complete the calculation of the value. The specific calling path is as follows: // Analysis and rewriting code framework // pg_analyze_and_rewrite() // (1) Perform parse analysis.






if (log_parser_stats)
ResetUsage();
query = parse_analyze(parsetree, query_string, paramTypes, numParams);
if (log_parser_stats)
ShowUsage("PARSE ANALYSIS STATISTICS");
// (2) Rewrite the queries, as necessary
querytree_list = pg _rewrite_query(query );
preprocess_expression() => eval_const_expressions() =>
eval_const_expressions_mutator() =>
// Enter iteration, single processing. We can clearly see that the function name has no plural
expression_tree_mutator() => simply_function() => evaluate_function() => evaluate_expr() =>
ExecEvalExprSwitchContext
() => ExecEvalFunc() => ExecMakeFuncResult() =>
// The following is a macro
FunctionCallInvoke(fcinfo) => end, has reached levenshtein_distance()
After the preprocessing is finished, the return value of the function has been calculated, and it will be put into the expression_tree and sent to the formal
processing stage.
As for the cost estimation, it is performed by the query_planner function called by subquery_planner. From the call of
make_one_rel(), a formal RelOptInfo structure is created to store cost and access path data, and
the elements that are focused on in query_planner() are as follows:
Some examples of functions that set specific cost constants are:
These functions are It is beneficial to manually help postgresql to estimate the query cost more correctly, which I personally think is quite important.
Multi-table query will use dynamic programming or genetic algorithm (when there are too many tables) to optimize the query cost.
Execution
The main functions are PortalDefineQuery PortalStart PortalRun. After starting the portal, the executor
processes the nodes from the planttree bottom-up (calls the corresponding processing function).
An example of a processing function is the function ExecIndexScan that performs an index scan, and its input parameter is only an IndexScanState
*, pointing to a node. Inside this function, ExecScan() is called to complete the scan.

Guess you like

Origin blog.csdn.net/qq_68591679/article/details/131345353