An article explaining SQL generation tools

The SQL generation tool can be used to test  the compatibility of Parser  with other database products. By parsing  the productions in the YACC  grammar file, the corresponding  SQL  statement is generated, and then the database is used to execute the  SQL . Based on the results, it is judged whether the statement is compatible with other database syntax.

0 1 Tool usage

Grammar file preprocessing

The purpose of preprocessing is to remove irrelevant content in the grammar file and retain only the productions of each statement. You can obtain the grammar rules in the grammar file (without Action) through the command bison -v sql.y, and then remove the generated content in the generated file. Useless parts, such as terminal symbol list, non-terminal symbol list, state transition table, etc., are as follows:

The content of the generated sql.output file is as follows, we only retain its " Syntax " section:

Note: For the reserved " Syntax " section, its serial number also needs to be removed.

 

For the above process, we encapsulate it through the preprocessing script preprocess.sh so that the processed files meet the requirements of the tool. The generated file format is as follows, and the output .output file is the preprocessed grammar file.

 

SQL statement generation

After generating a syntax file that meets the conditions, you can use the tool to generate SQL. The tool supports the following parameters:

•-b: specifies the syntax file, required. The syntax file is the file generated after processing by the preprocess.sh script.

•-n: Specify the name of the production to be generated, required

•-R: Randomly generated mode, optional, default is enumeration mode

•-o: Specify the save file to generate SQL statements, optional, default is report.csv

•-N: Limit the number of SQL statements generated, optional, no limit by default

 

0 2 tool implementation

This tool contains two packages: yacc_parser and sql_generator, which are responsible for completing Token parsing and SQL generation respectively.

Expression method of production

type SeqInfo struct {
    Items []string
}
type Production struct {
    Head  string    // 产生式头部
    Alter []SeqInfo     // 产生式 body
}

​​​​​​​

Token analysis

The function Tokenize is used to tokenize the characters in the read grammar file, and each call will return a Token. This function only handles simple delimiters and quotation marks, and does not implement regular matching of the standard lexical analyzer.

The Parse function calls the Tokenize function, returning one Token each time. After returning, the Parse function assembles a series of Tokens into Production based on the current status and Token type.

 

SQL generation

There are two modes for SQL generation:

1. Traverse the body list of the specified production in Production and enumerate to generate SQL statements;

2. Randomly select the body list of the specified production in Production and randomly generate SQL statements.

 

1. Enumeration

The implementation of enumeration is to use a linked list to save the Token to be resolved. Each time a Token is taken from the head of the linked list, and the number of occurrences of the Token is incremented, and then based on whether the number of occurrences of the Token in the record in each sub-expression is greater than the specified times, filter subexpressions that can continue to be deduced.

On the other hand, two arrays are used to record the subscript (choice) of the currently taken subexpression and the subscript (max) of the current maximum subexpression to record, so that the next expression can be incremented by choice.

After filtering, select the right subexpression of the production at the choice position and insert all its Tokens into the head of the linked list, and then determine whether the head is literal or keyword. If so, take out the head and put it into the SQL array. If not, continue Loop through linked list.

When processing reaches the end of the current production (the judgment method is choice>max), a "carry" will be attempted at this time, that is, the last digit of the currently recorded position array is incremented.

For example: the max array is 1 2 1 3, the choice array is 0 0 0 3, then the choice array after carry is 0 0 1 0, which means that the last position has been traversed, and now the second to last position needs to be incremented, and the last position Zero, continue reading the next permutation and combination.

The generation process is implemented through recursion. For example, for the following production, the processing logic is as shown in the figure:

show_tables_stmt: SHOW TABLES FROM name '.' name with_comment
                | SHOW TABLES FROM name with_comment
                | SHOW TABLES with_comment

with_comment: WITH COMMENT
            | %empty

name: IDENT

 

According to the recorded choice value, select the choice subexpression of the production until a SQL is generated. Then carry the choice array and continue with the next round of selection.

2. Random

The random generation mode is similar to the enumeration generation mode, except that it does not sequentially traverse each Token in the production body list, but randomly selects a Token as part of the SQL.

 

I decided to give up on open source industrial software. Major events - OGG 1.0 was released, Huawei contributed all source code. Ubuntu 24.04 LTS was officially released. Google Python Foundation team was laid off. Google Reader was killed by the "code shit mountain". Fedora Linux 40 was officially released. A well-known game company released New regulations: Employees’ wedding gifts must not exceed 100,000 yuan. China Unicom releases the world’s first Llama3 8B Chinese version of the open source model. Pinduoduo is sentenced to compensate 5 million yuan for unfair competition. Domestic cloud input method - only Huawei has no cloud data upload security issues
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5148943/blog/11054997