Achieve a regular expression engine in Python (a)

Foreword

Project location: the Regex in Python

School fishing in troubled waters for weeks, in recent days with Python made a regular expression engine wheels, where records share.

Goals

It implements all basic grammar

st = 'AS342abcdefg234aaaaabccccczczxczcasdzxc'
pattern = '([A-Z]+[0-9]*abcdefg)([0-9]*)(\*?|a+)(zx|bc*)([a-z]+|[0-9]*)(asd|fgh)(zxc)'

regex = Regex(st, pattern)
result = regex.match()
log(result)

More examples can be github to see

Pre-knowledge

In fact, the regular expression engine can be seen as a small compiler, it is entirely possible that by thinking before you write a compiler of C language to, but it is not so complicated

  1. First, lexical analysis
  2. Parsing (used here top-down)
  3. Semantic Analysis (because the regular expression is very weak, it is omitted portion can be generated directly AST code generation)
  4. Code generation, there is generated for the NFA
  5. NFA to DFA conversion, beginning where the relevant knowledge is regular and the state machine
  6. Minimization of DFA

NFA and DFA

Can be viewed as a finite state machine has a plurality of nodes to the FIG., A state machine, each node can jump to the next node according to the input character, the NFA distinguished ((nondeterministic finite state machine) and DFA (deterministic finite state automaton) is the next state of the DFA jump is uniquely determined)

Finite state automaton starts from the initial state to start reading the input string, using the state transition function of the automaton move to the next state is determined based on the current state and the current input character, but in a state of the NFA is not uniquely determined, so only ensures that the next set of states, input the set of states also need to rely on later to determine the only state it. If the reading is completed when the automatic machine when it is in receive state, then it indicates that the input character string may be received NFA

For all of last NFA can be converted to the corresponding DFA

NFA构造O(n),匹配O(nm)

DFA构造O(2^n),最小化O(kn'logn')(N'=O(2^n)),匹配O(m)

n=regex长度,m=串长,k=字母表大小,n'=原始的dfa大小

The set of all strings accepted by NFA NFA is acceptable language. This language is a regular language.

example

For the regular expression: [0-9]*[A-Z]+the corresponding 4 NFA is to connect the node 3 and the following two nodes of NFA

image.png
image.png

lexical analysis

For the NFA and DFA in fact just know so much and some of the corresponding algorithm is enough, the corresponding algorithm mentioned later, to complete part of the lexical analysis,

这个词法分析比之前C语言编译器的语法分析要简单许多,只要处理几种可能性

  1. 普通字符
  2. 含有语义的字符
  3. 转义字符

token

token没什么好说的,就是对应正则里的语法

Tokens = {
    '.': Token.ANY,
    '^': Token.AT_BOL,
    '$': Token.AT_EOL,
    ']': Token.CCL_END,
    '[': Token.CCL_START,
    '}': Token.CLOSE_CURLY,
    ')': Token.CLOSE_PAREN,
    '*': Token.CLOSURE,
    '-': Token.DASH,
    '{': Token.OPEN_CURLY,
    '(': Token.OPEN_PAREN,
    '?': Token.OPTIONAL,
    '|': Token.OR,
    '+': Token.PLUS_CLOSE,
}

advance

advance是词法分析里最主要的函数,用来返回当前输入字符的Token类型

def advance(self):
    pos = self.pos
    pattern = self.pattern
    if pos > len(pattern) - 1:
        self.current_token = Token.EOS
        return Token.EOS

    text = self.lexeme = pattern[pos]
    if text == '\\':
        self.isescape = not self.isescape
        self.pos = self.pos + 1
        self.current_token = self.handle_escape()
    else:
        self.current_token = self.handle_semantic_l(text)

    return self.current_token

advance的主要逻辑就是读入当前字符,再来判断是否是转义字符或者是其它字符

handle_escape用来处理转义字符,当然转义字符最后本质上返回的还是普通字符类型,这个函数的主要功能就是来记录当前转义后的字符,然后赋值给lexem,供之后构造自动机使用

handle_semantic_l只有两行,一是查表,这个表保存了所有的拥有语义的字符,如果查不到就直接返回普通字符类型了

完整代码就不放上来了,都在github

构造NFA

构造NFA的主要文件都在nfa包下,nfa.py是NFA节点的定义,construction.py是实现对NFA的构造

NFA节点定义

NFA节点的定义也很简单,其实这个正则表达式引擎完整的实现只有900行左右,每一部分拆开看都非常简单

  • edge和input_set都是用来指示边的,边一共可能有四种种可能的属性

    • 对应的节点有两个出去的ε边
      edge = PSILON = -1
    • 边对应的是字符集
      edge = CCL = -2
      input_set = 相应字符集
    • 一条ε边
      edge = EMPTY = -3
    • 边对应的是单独的一个输入字符c
      edge = c
  • status_num每个节点都有唯一的一个标识
  • visited则是为了debug用来遍历NFA

class Nfa(object):
    STATUS_NUM = 0

    def __init__(self):
        self.edge = EPSILON
        self.next_1 = None
        self.next_2 = None
        self.visited = False
        self.input_set = set()
        self.set_status_num()

    def set_status_num(self):
        self.status_num = Nfa.STATUS_NUM
        Nfa.STATUS_NUM = Nfa.STATUS_NUM + 1

    def set_input_set(self):
        self.input_set = set()
        for i in range(ASCII_COUNT):
            self.input_set.add(chr(i))

简单节点的构造

节点的构造在nfa.construction下,这里为了简化代码把Lexer作为全局变量,让所有函数共享

正则表达式的BNF范式如下,这样我们可以采用自顶向下来分析,从最顶层的group开始向下递归

group ::= ("(" expr ")")*
expr ::= factor_conn ("|" factor_conn)*
factor_conn ::= factor | factor factor*
factor ::= (term | term ("*" | "+" | "?"))*
term ::= char | "[" char "-" char "]" | .

BNF在之前写C语言编译器的时候有提到:从零写一个编译器(二)

主入口

这里为了简化代码,就把词法分析器作为全局变量,让所有函数共享

主要逻辑非常简单,就是初始化词法分析器,然后传入NFA头节点开始进行递归创建节点

def pattern(pattern_string):
    global lexer
    lexer = Lexer(pattern_string)
    lexer.advance()
    nfa_pair = NfaPair()
    group(nfa_pair)

    return nfa_pair.start_node

term

虽然是采用的是自顶向下的语法分析,但是从自底向上看会更容易理解,term是最底部的构建,也就是像单个字符、字符集、.符号的节点的构建

term ::= char | "[" char "-" char "]" | | .

term的主要逻辑就是根据当前读入的字符来判断应该构建什么节点

def term(pair_out):
    if lexer.match(Token.L):
        nfa_single_char(pair_out)
    elif lexer.match(Token.ANY):
        nfa_dot_char(pair_out)
    elif lexer.match(Token.CCL_START):
        nfa_set_nega_char(pair_out)

三种节点的构造函数都很简单,下面图都是用markdown的mermaid随便画画的

  • nfa_single_char

单个字符的NFA构造就是创建两个节点然后把当前匹配的字符作为edge

a

def nfa_single_char(pair_out):
    if not lexer.match(Token.L):
        return False

    start = pair_out.start_node = Nfa()
    pair_out.end_node = pair_out.start_node.next_1 = Nfa()
    start.edge = lexer.lexeme
    lexer.advance()
    return True
  • nfa_dot_char

. 这个的NFA和上面单字符的唯一区别就是它的edge被设置为CCL,并且设置了input_set

a

# . 匹配任意单个字符
def nfa_dot_char(pair_out):
    if not lexer.match(Token.ANY):
        return False

    start = pair_out.start_node = Nfa()
    pair_out.end_node = pair_out.start_node.next_1 = Nfa()
    start.edge = CCL
    start.set_input_set()

    lexer.advance()
    return False
  • nfa_set_nega_char

这个函数逻辑上只比上面的多了一个处理input_set

a

def nfa_set_nega_char(pair_out):
    if not lexer.match(Token.CCL_START):
        return False
    
    neagtion = False
    lexer.advance()
    if lexer.match(Token.AT_BOL):
        neagtion = True
    
    start = pair_out.start_node = Nfa()
    start.next_1 = pair_out.end_node = Nfa()
    start.edge = CCL
    dodash(start.input_set)

    if neagtion:
        char_set_inversion(start.input_set)

    lexer.advance()
    return True

小结

篇幅原因,现在已经写到了三百多行,所以就分篇写,准备在三篇内完成。下一篇写构造更复杂的NFA和通过构造的NFA来分析输入字符串。最后写从NFA转换到DFA,再最后用DFA分析输入的字符串

Guess you like

Origin www.cnblogs.com/secoding/p/11576864.html