Foreword
Project location: the Regex in Python
School fishing in troubled waters for weeks, in recent days with Python made a regular expression engine wheels, where records share.
Goals
It implements all basic grammar
st = 'AS342abcdefg234aaaaabccccczczxczcasdzxc'
pattern = '([A-Z]+[0-9]*abcdefg)([0-9]*)(\*?|a+)(zx|bc*)([a-z]+|[0-9]*)(asd|fgh)(zxc)'
regex = Regex(st, pattern)
result = regex.match()
log(result)
More examples can be github to see
Pre-knowledge
In fact, the regular expression engine can be seen as a small compiler, it is entirely possible that by thinking before you write a compiler of C language to, but it is not so complicated
- First, lexical analysis
- Parsing (used here top-down)
- Semantic Analysis (because the regular expression is very weak, it is omitted portion can be generated directly AST code generation)
- Code generation, there is generated for the NFA
- NFA to DFA conversion, beginning where the relevant knowledge is regular and the state machine
- Minimization of DFA
NFA and DFA
Can be viewed as a finite state machine has a plurality of nodes to the FIG., A state machine, each node can jump to the next node according to the input character, the NFA distinguished ((nondeterministic finite state machine) and DFA (deterministic finite state automaton) is the next state of the DFA jump is uniquely determined)
Finite state automaton starts from the initial state to start reading the input string, using the state transition function of the automaton move to the next state is determined based on the current state and the current input character, but in a state of the NFA is not uniquely determined, so only ensures that the next set of states, input the set of states also need to rely on later to determine the only state it. If the reading is completed when the automatic machine when it is in receive state, then it indicates that the input character string may be received NFA
For all of last NFA can be converted to the corresponding DFA
NFA构造O(n),匹配O(nm)
DFA构造O(2^n),最小化O(kn'logn')(N'=O(2^n)),匹配O(m)
n=regex长度,m=串长,k=字母表大小,n'=原始的dfa大小
The set of all strings accepted by NFA NFA is acceptable language. This language is a regular language.
example
For the regular expression: [0-9]*[A-Z]+
the corresponding 4 NFA is to connect the node 3 and the following two nodes of NFA
lexical analysis
For the NFA and DFA in fact just know so much and some of the corresponding algorithm is enough, the corresponding algorithm mentioned later, to complete part of the lexical analysis,
这个词法分析比之前C语言编译器的语法分析要简单许多,只要处理几种可能性
- 普通字符
- 含有语义的字符
- 转义字符
token
token没什么好说的,就是对应正则里的语法
Tokens = {
'.': Token.ANY,
'^': Token.AT_BOL,
'$': Token.AT_EOL,
']': Token.CCL_END,
'[': Token.CCL_START,
'}': Token.CLOSE_CURLY,
')': Token.CLOSE_PAREN,
'*': Token.CLOSURE,
'-': Token.DASH,
'{': Token.OPEN_CURLY,
'(': Token.OPEN_PAREN,
'?': Token.OPTIONAL,
'|': Token.OR,
'+': Token.PLUS_CLOSE,
}
advance
advance是词法分析里最主要的函数,用来返回当前输入字符的Token类型
def advance(self):
pos = self.pos
pattern = self.pattern
if pos > len(pattern) - 1:
self.current_token = Token.EOS
return Token.EOS
text = self.lexeme = pattern[pos]
if text == '\\':
self.isescape = not self.isescape
self.pos = self.pos + 1
self.current_token = self.handle_escape()
else:
self.current_token = self.handle_semantic_l(text)
return self.current_token
advance的主要逻辑就是读入当前字符,再来判断是否是转义字符或者是其它字符
handle_escape用来处理转义字符,当然转义字符最后本质上返回的还是普通字符类型,这个函数的主要功能就是来记录当前转义后的字符,然后赋值给lexem,供之后构造自动机使用
handle_semantic_l只有两行,一是查表,这个表保存了所有的拥有语义的字符,如果查不到就直接返回普通字符类型了
完整代码就不放上来了,都在github上
构造NFA
构造NFA的主要文件都在nfa包下,nfa.py是NFA节点的定义,construction.py是实现对NFA的构造
NFA节点定义
NFA节点的定义也很简单,其实这个正则表达式引擎完整的实现只有900行左右,每一部分拆开看都非常简单
edge和input_set都是用来指示边的,边一共可能有四种种可能的属性
- 对应的节点有两个出去的ε边
edge = PSILON = -1 - 边对应的是字符集
edge = CCL = -2
input_set = 相应字符集 - 一条ε边
edge = EMPTY = -3 - 边对应的是单独的一个输入字符c
edge = c
- 对应的节点有两个出去的ε边
- status_num每个节点都有唯一的一个标识
visited则是为了debug用来遍历NFA
class Nfa(object):
STATUS_NUM = 0
def __init__(self):
self.edge = EPSILON
self.next_1 = None
self.next_2 = None
self.visited = False
self.input_set = set()
self.set_status_num()
def set_status_num(self):
self.status_num = Nfa.STATUS_NUM
Nfa.STATUS_NUM = Nfa.STATUS_NUM + 1
def set_input_set(self):
self.input_set = set()
for i in range(ASCII_COUNT):
self.input_set.add(chr(i))
简单节点的构造
节点的构造在nfa.construction下,这里为了简化代码把Lexer作为全局变量,让所有函数共享
正则表达式的BNF范式如下,这样我们可以采用自顶向下来分析,从最顶层的group开始向下递归
group ::= ("(" expr ")")*
expr ::= factor_conn ("|" factor_conn)*
factor_conn ::= factor | factor factor*
factor ::= (term | term ("*" | "+" | "?"))*
term ::= char | "[" char "-" char "]" | .
BNF在之前写C语言编译器的时候有提到:从零写一个编译器(二)
主入口
这里为了简化代码,就把词法分析器作为全局变量,让所有函数共享
主要逻辑非常简单,就是初始化词法分析器,然后传入NFA头节点开始进行递归创建节点
def pattern(pattern_string):
global lexer
lexer = Lexer(pattern_string)
lexer.advance()
nfa_pair = NfaPair()
group(nfa_pair)
return nfa_pair.start_node
term
虽然是采用的是自顶向下的语法分析,但是从自底向上看会更容易理解,term是最底部的构建,也就是像单个字符、字符集、.符号的节点的构建
term ::= char | "[" char "-" char "]" | | .
term的主要逻辑就是根据当前读入的字符来判断应该构建什么节点
def term(pair_out):
if lexer.match(Token.L):
nfa_single_char(pair_out)
elif lexer.match(Token.ANY):
nfa_dot_char(pair_out)
elif lexer.match(Token.CCL_START):
nfa_set_nega_char(pair_out)
三种节点的构造函数都很简单,下面图都是用markdown的mermaid随便画画的
- nfa_single_char
单个字符的NFA构造就是创建两个节点然后把当前匹配的字符作为edge
def nfa_single_char(pair_out):
if not lexer.match(Token.L):
return False
start = pair_out.start_node = Nfa()
pair_out.end_node = pair_out.start_node.next_1 = Nfa()
start.edge = lexer.lexeme
lexer.advance()
return True
- nfa_dot_char
. 这个的NFA和上面单字符的唯一区别就是它的edge被设置为CCL,并且设置了input_set
# . 匹配任意单个字符
def nfa_dot_char(pair_out):
if not lexer.match(Token.ANY):
return False
start = pair_out.start_node = Nfa()
pair_out.end_node = pair_out.start_node.next_1 = Nfa()
start.edge = CCL
start.set_input_set()
lexer.advance()
return False
- nfa_set_nega_char
这个函数逻辑上只比上面的多了一个处理input_set
def nfa_set_nega_char(pair_out):
if not lexer.match(Token.CCL_START):
return False
neagtion = False
lexer.advance()
if lexer.match(Token.AT_BOL):
neagtion = True
start = pair_out.start_node = Nfa()
start.next_1 = pair_out.end_node = Nfa()
start.edge = CCL
dodash(start.input_set)
if neagtion:
char_set_inversion(start.input_set)
lexer.advance()
return True
小结
篇幅原因,现在已经写到了三百多行,所以就分篇写,准备在三篇内完成。下一篇写构造更复杂的NFA和通过构造的NFA来分析输入字符串。最后写从NFA转换到DFA,再最后用DFA分析输入的字符串