Series: The python + antlr resolve hive sql data obtained kinship (c)

aims

The second series in the use of HiveParser.g in the pushMsg output, but has not been AST (Abstract Syntax Tree abstract syntax tree), not practical. In addition to get AST, The second end also need to address the following three practical issues

  1. Sensitive issue of token, Hive and in select SELECT acceptable
  2. Semicolon problem, that is, must be able to resolve the situation in a string that contains multiple sql statements
  3. Parsing rules, similar to the insert-select in this hive to accept, but HiveParser.g document does not define the situation

Benpian make it clear how to solve the problem AST, and then solve practical insert-select the

Get AST

Previous code actually has come to the finishing touches. As an analysis of the inlet parser.statement()of this method is the value returned, the default is a return type generated automatically generated class, HiveParser.statement_return, AST hidden in this class, this class can getTree (), to give a type CommonTree Object. CommonTree get this code with the code below python

import jnius_config
jnius_config.set_classpath('./','./grammar/hive110/antlr-3.4-complete.jar')
import jnius

StringStream = jnius.autoclass('org.antlr.runtime.ANTLRStringStream')
Lexer  = jnius.autoclass('grammar.hive110.HiveLexer')
Parser  = jnius.autoclass('grammar.hive110.HiveParser')
TokenStream  = jnius.autoclass('org.antlr.runtime.CommonTokenStream')

sql_string = (
    "SELECT DISTINCT a1.c1 AS c2,\n"
    " a1.c3 AS c4,\n"
    " '' c5\n"
    " FROM db2.tb2 AS a1 ;\n"
    )

sqlstream = StringStream(sql_string)
inst = Lexer(sqlstream)
ts = TokenStream(inst)
parser = Parser(ts)
ret  = parser.statements()
treeroot = ret.getTree()

AST generated configuration

HiveParser.g need to have options in configuration, so antlr generated code output for the AST, the specific location options here

options
{
tokenVocab=HiveLexer;
output=AST;
ASTLabelType=CommonTree;
backtrack=false;
k=3;
}

Output structure traversal and AST

AST traversal need to check what CommonTree this class API documentation ,

Each node is a CommonTree AST instance of this class, the token has the token itself represents Field access nodes, and such methods have getType getText can child attribute, the nodes on the direct access token can access the children Field It can be obtained by a method getChildren, and also a corresponding parent getParent. With these, the entire AST can freely walk the tree.

Here is a simple recursive codes, starting from the root so the depth traversal, and printed text and numeric code corresponding to each node. After an additional code to be run in front of the

def walktree(node,depth = 0):
    print("%s%s=%s" % ("  "*depth,node.getText(),node.getType()))
    children = node.children
    if not children:
        return
    ch_size = children.size()
    for i in range(ch_size):
        ch =children.get(i)
        walktree(ch,depth + 1)

walktree(treeroot,0)

children's java type is java.util.List, iteration can not be done directly in the python, the code to iterate through the index visiting for. The above code output is

None=0
  TOK_QUERY=777
    TOK_FROM=681
      TOK_TABREF=864
        TOK_TABNAME=863
          db2=26
          tb2=26
        a1=26
    TOK_INSERT=707
      TOK_DESTINATION=660
        TOK_TAB=835
          TOK_TABNAME=863
            db1=26
            tb1=26
      TOK_SELECTDI=792
        TOK_SELEXPR=793
          .=17
            TOK_TABLE_OR_COL=860
              a1=26
            c1=26
          c2=26
        TOK_SELEXPR=793
          .=17
            TOK_TABLE_OR_COL=860
              a1=26
            c3=26
          c4=26
        TOK_SELEXPR=793
          ''=302
          c5=26
  ;=299
  <EOF>=-1

You can see, we are interested in table names, column names have been used as the bottom of the leaf nodes, appeared in our output content in their numbers corresponding to the type is 26, Identifiers.
AST roots fixed numeric type is 0, token is null, translated to python was turned into None.
As a sign of the query, the digital type is 777, TOK_QUERY.
Our ultimate goal from this gap is clearly visible.

Solve practical insert-select the

One to test the code to insert-select

INSERT OVERWRITE db1.tb1 SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5 FROM db2.tb2 a1;

The time to write a major when trying to understand why the post .g file insertClause not appear selectClause, did not find the wording is actually wrong, missed a TABLE keyword, correct wording is

INSERT OVERWRITE TABLE db1.tb1 SELECT DISTINCT a1.c1 c2, a1.c3 c4, '' c5 FROM db2.tb2 a1;

Of course, read the documents of the time did not .g white, semicolon and solve practical problems sensitive, specific solutions into the next.

Published 18 original articles · won praise 0 · Views 661

Guess you like

Origin blog.csdn.net/bigdataolddriver/article/details/103935723