aims
The second series in the use of HiveParser.g in the pushMsg output, but has not been AST (Abstract Syntax Tree abstract syntax tree), not practical. In addition to get AST, The second end also need to address the following three practical issues
- Sensitive issue of token, Hive and in select SELECT acceptable
- Semicolon problem, that is, must be able to resolve the situation in a string that contains multiple sql statements
- Parsing rules, similar to the insert-select in this hive to accept, but HiveParser.g document does not define the situation
Benpian make it clear how to solve the problem AST, and then solve practical insert-select the
Get AST
Previous code actually has come to the finishing touches. As an analysis of the inlet parser.statement()
of this method is the value returned, the default is a return type generated automatically generated class, HiveParser.statement_return, AST hidden in this class, this class can getTree (), to give a type CommonTree Object. CommonTree get this code with the code below python
import jnius_config
jnius_config.set_classpath('./','./grammar/hive110/antlr-3.4-complete.jar')
import jnius
StringStream = jnius.autoclass('org.antlr.runtime.ANTLRStringStream')
Lexer = jnius.autoclass('grammar.hive110.HiveLexer')
Parser = jnius.autoclass('grammar.hive110.HiveParser')
TokenStream = jnius.autoclass('org.antlr.runtime.CommonTokenStream')
sql_string = (
"SELECT DISTINCT a1.c1 AS c2,\n"
" a1.c3 AS c4,\n"
" '' c5\n"
" FROM db2.tb2 AS a1 ;\n"
)
sqlstream = StringStream(sql_string)
inst = Lexer(sqlstream)
ts = TokenStream(inst)
parser = Parser(ts)
ret = parser.statements()
treeroot = ret.getTree()
AST generated configuration
HiveParser.g need to have options in configuration, so antlr generated code output for the AST, the specific location options here
options
{
tokenVocab=HiveLexer;
output=AST;
ASTLabelType=CommonTree;
backtrack=false;
k=3;
}
Output structure traversal and AST
AST traversal need to check what CommonTree this class API documentation ,
Each node is a CommonTree AST instance of this class, the token has the token itself represents Field access nodes, and such methods have getType getText can child attribute, the nodes on the direct access token can access the children Field It can be obtained by a method getChildren, and also a corresponding parent getParent. With these, the entire AST can freely walk the tree.
Here is a simple recursive codes, starting from the root so the depth traversal, and printed text and numeric code corresponding to each node. After an additional code to be run in front of the
def walktree(node,depth = 0):
print("%s%s=%s" % (" "*depth,node.getText(),node.getType()))
children = node.children
if not children:
return
ch_size = children.size()
for i in range(ch_size):
ch =children.get(i)
walktree(ch,depth + 1)
walktree(treeroot,0)
children's java type is java.util.List, iteration can not be done directly in the python, the code to iterate through the index visiting for. The above code output is
None=0
TOK_QUERY=777
TOK_FROM=681
TOK_TABREF=864
TOK_TABNAME=863
db2=26
tb2=26
a1=26
TOK_INSERT=707
TOK_DESTINATION=660
TOK_TAB=835
TOK_TABNAME=863
db1=26
tb1=26
TOK_SELECTDI=792
TOK_SELEXPR=793
.=17
TOK_TABLE_OR_COL=860
a1=26
c1=26
c2=26
TOK_SELEXPR=793
.=17
TOK_TABLE_OR_COL=860
a1=26
c3=26
c4=26
TOK_SELEXPR=793
''=302
c5=26
;=299
<EOF>=-1
You can see, we are interested in table names, column names have been used as the bottom of the leaf nodes, appeared in our output content in their numbers corresponding to the type is 26, Identifiers.
AST roots fixed numeric type is 0, token is null, translated to python was turned into None.
As a sign of the query, the digital type is 777, TOK_QUERY.
Our ultimate goal from this gap is clearly visible.
Solve practical insert-select the
One to test the code to insert-select
INSERT OVERWRITE db1.tb1 SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5 FROM db2.tb2 a1;
The time to write a major when trying to understand why the post .g file insertClause not appear selectClause, did not find the wording is actually wrong, missed a TABLE keyword, correct wording is
INSERT OVERWRITE TABLE db1.tb1 SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5 FROM db2.tb2 a1;
Of course, read the documents of the time did not .g white, semicolon and solve practical problems sensitive, specific solutions into the next.