import nltk

如何使用形式化语法来描述无限的句子集合的结构？
如何使用句法树来表示句子结构？
语法分析器如何分析一个句子并自动构建语法树？

8.1 一些语法困境

语言数据和无限可能性

普遍存在的歧义

groucho_grammar = nltk.CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I'
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas'
    V -> 'shot'
    P -> 'in'
    """)

sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

parser = nltk.ChartParser(groucho_grammar)

for tree in parser.parse(sent):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))

8.2 文法有什么用？

超越n-grams

并列结构：如果v1 和v2 都是文法类型X 的短语，那么v1 and v2 也是X 类型的短语。

成分结构：基于对词与其他词结合在一起形成单元的观察。

8.3 上下文无关文法

一种简单的文法

上下文无关文法（context-free grammar，CFG），NLTK 中，上下文无关文法定义在nltk.grammar 模块

例8-1. 上下文无关文法的例子。

grammar1 = nltk.CFG.fromstring("""
    S -> NP VP
    VP -> V NP | V NP PP
    PP -> P NP
    V -> "saw" | "ate" | "walked"
    NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    Det -> "a" | "an" | "the" | "my"
    N -> "man" | "dog" | "cat" | "telescope" | "park"
    P -> "in" | "on" | "by" | "with"
    """)

sent = "Mary saw Bob".split()

rd_parser = nltk.RecursiveDescentParser(grammar1)

for tree in rd_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP Bob)))

表8-1. 句法类型

符号	意思	例子
S	句子	the man walked
NP	名词短语	a dog
VP	动词短语	saw a park
PP	介词短语	with a telescope
Det	限定词	the
N	名词	dog
V	动词	walked
P	介词	in

grammar1 = nltk.CFG.fromstring("""
    S -> NP VP
    VP -> V NP | V NP PP
    PP -> P NP
    V -> "saw" | "ate" | "walked"
    NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    Det -> "a" | "an" | "the" | "my"|"The"
    N -> "man" | "dog" | "cat" | "telescope" | "park"
    P -> "in" | "on" | "by" | "with"
    """)

sent = "The dog saw a man in the park".split()

rd_parser = nltk.RecursiveDescentParser(grammar1)

for tree in rd_parser.parse(sent):
    print(tree)

(S
  (NP (Det The) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man) (PP (P in) (NP (Det the) (N park))))))
(S
  (NP (Det The) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man))
    (PP (P in) (NP (Det the) (N park)))))

由于这句话的两棵树都符合我们的文法规则，这句话被称为结构上有歧义。这个歧义问题被称为介词短语附着歧义，

写你自己的文法

#grammar1 = nltk.data.load('file:mygrammar.cfg')

sent = "Mary saw Bob".split()

rd_parser = nltk.RecursiveDescentParser(grammar1)

for tree in rd_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP Bob)))

句法结构中的递归

例8-2. 递归的上下文无关文法

grammar2 = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> Det Nom | PropN
    Nom -> Adj Nom | N
    VP -> V Adj | V NP | V S | V NP PP
    PP -> P NP
    PropN -> 'Buster' | 'Chatterer' | 'Joe'
    Det -> 'the' | 'a'
    N -> 'bear' | 'squirrel' | 'tree' | 'fish' | 'log'
    Adj -> 'angry' | 'frightened' | 'little' | 'tall'
    V -> 'chased' | 'saw' | 'said' | 'thought' | 'was' | 'put'
    P -> 'on'
    """)

8.4 上下文无关文法分析

分析器根据文法产生式处理输入的句子，并建立一个或多个符合文法的组成结构。文法是一个格式良好的声明规范——它实际上只是一个字符串，而不是程序。分析器是文法的解释程序。它搜索符合文法的所有树的空间找出一棵边缘有所需句子的树。

递归下降分析

一种自上而下的方法

grammar1 = nltk.CFG.fromstring("""
    S -> NP VP
    VP -> V NP | V NP PP
    PP -> P NP
    V -> "saw" | "ate" | "walked"
    NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    Det -> "a" | "an" | "the" | "my"
    N -> "man" | "dog" | "cat" | "telescope" | "park"
    P -> "in" | "on" | "by" | "with"
    """)

rd_parser = nltk.RecursiveDescentParser(grammar1)

sent = 'Mary saw a dog'.split()

for tree in rd_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

移进-归约分析

一种自下而上的方法

sr_parser = nltk.ShiftReduceParser(grammar1)

sent = 'Mary saw a dog'.split()

for tree in sr_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

左角落分析器

一种带自下而上过滤的自上而下的方法

图表分析

一种动态规划技术

例8-3. 使用符合语句规则的子串表的接收器

def init_wfst(tokens, grammar):
    numtokens = len(tokens)
    wfst = [[None for i in range(numtokens+1)] for j in range(numtokens+1)]
    for i in range(numtokens):
        productions = grammar.productions(rhs=tokens[i])
        wfst[i][i+1] = productions[0].lhs()
    return wfst

def complete_wfst(wfst, tokens, grammar, trace=False):
    index = dict((p.rhs(), p.lhs()) for p in grammar.productions())
    numtokens = len(tokens)
    for span in range(2, numtokens+1):
        for start in range(numtokens+1-span):
            end = start + span
            for mid in range(start+1, end):
                nt1, nt2 = wfst[start][mid], wfst[mid][end]
                if nt1 and nt2 and (nt1,nt2) in index:
                    wfst[start][end] = index[(nt1,nt2)]
                    if trace:
                        print("[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]" % \
                              (start, nt1, mid, nt2, end, start, index[(nt1,nt2)], end))
    return wfst

def display(wfst, tokens):
    print('\nWFST ' + ' '.join(("%-4d" % i) for i in range(1, len(wfst))))
    for i in range(len(wfst)-1):
        print("%d " % i, end=" ")
        for j in range(1, len(wfst)):
            print("%-4s" % (wfst[i][j] or '.'), end=" ")
        print()

tokens = "I shot an elephant in my pajamas".split()

wfst0 = init_wfst(tokens, groucho_grammar)

display(wfst0, tokens)

WFST 1    2    3    4    5    6    7   
0  NP   .    .    .    .    .    .    
1  .    V    .    .    .    .    .    
2  .    .    Det  .    .    .    .    
3  .    .    .    N    .    .    .    
4  .    .    .    .    P    .    .    
5  .    .    .    .    .    Det  .    
6  .    .    .    .    .    .    N

wfst1 = complete_wfst(wfst0, tokens, groucho_grammar)

display(wfst1, tokens)

WFST 1    2    3    4    5    6    7   
0  NP   .    .    S    .    .    S    
1  .    V    .    VP   .    .    VP   
2  .    .    Det  NP   .    .    .    
3  .    .    .    N    .    .    .    
4  .    .    .    .    P    .    PP   
5  .    .    .    .    .    Det  NP   
6  .    .    .    .    .    .    N

wfst1 = complete_wfst(wfst0, tokens, groucho_grammar, trace=True)

[2] Det [3]   N [4] ==> [2]  NP [4]
[5] Det [6]   N [7] ==> [5]  NP [7]
[1]   V [2]  NP [4] ==> [1]  VP [4]
[4]   P [5]  NP [7] ==> [4]  PP [7]
[0]  NP [1]  VP [4] ==> [0]   S [4]
[1]  VP [4]  PP [7] ==> [1]  VP [7]
[0]  NP [1]  VP [7] ==> [0]   S [7]

交互式图表分析器应用程序nltk.app.chartparser()demo

grammar = nltk.CFG.fromstring("""
    S -> NP VP
    VP -> V NP | V
    NP -> NAME | ART N
    NAME -> 'John'
    V -> 'ate'
    ART -> 'the'
    N -> 'cat'
    """)

tokens = "John ate the cat".split()

parser = nltk.ChartParser(grammar, trace=1)
for tree in parser.parse(tokens):
    print(tree)
    tree.draw()

|.   John  .   ate   .   the   .   cat   .|
|[---------]         .         .         .| [0:1] 'John'
|.         [---------]         .         .| [1:2] 'ate'
|.         .         [---------]         .| [2:3] 'the'
|.         .         .         [---------]| [3:4] 'cat'
|[---------]         .         .         .| [0:1] NAME -> 'John' *
|[---------]         .         .         .| [0:1] NP -> NAME *
|[--------->         .         .         .| [0:1] S  -> NP * VP
|.         [---------]         .         .| [1:2] V  -> 'ate' *
|.         [--------->         .         .| [1:2] VP -> V * NP
|.         [---------]         .         .| [1:2] VP -> V *
|[-------------------]         .         .| [0:2] S  -> NP VP *
|.         .         [---------]         .| [2:3] ART -> 'the' *
|.         .         [--------->         .| [2:3] NP -> ART * N
|.         .         .         [---------]| [3:4] N  -> 'cat' *
|.         .         [-------------------]| [2:4] NP -> ART N *
|.         .         [------------------->| [2:4] S  -> NP * VP
|.         [-----------------------------]| [1:4] VP -> V NP *
|[=======================================]| [0:4] S  -> NP VP *
(S (NP (NAME John)) (VP (V ate) (NP (ART the) (N cat))))

8.5 依存关系和依存文法

依存关系表示是一个加标签的有向图，其中节点是词汇项，加标签的弧表示依赖关系，从中心词到依赖。
图8-8 显示了一个依存关系图，箭头从中心词指出它们的依赖。

groucho_dep_grammar = nltk.DependencyGrammar.fromstring("""
     'shot' -> 'I' | 'elephant' | 'in'
     'elephant' -> 'an' | 'in'
     'in' -> 'pajamas'
     'pajamas' -> 'my'
     """)

print(groucho_dep_grammar)

Dependency grammar with 7 productions
  'shot' -> 'I'
  'shot' -> 'elephant'
  'shot' -> 'in'
  'elephant' -> 'an'
  'elephant' -> 'in'
  'in' -> 'pajamas'
  'pajamas' -> 'my'

pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)

sent = 'I shot an elephant in my pajamas'.split()

trees = pdp.parse(sent)

for tree in trees:
    print(tree)

配价与词汇

扩大规模

8.6 文法开发

from nltk.corpus import treebank

t = treebank.parsed_sents('wsj_0001.mrg')[0]

print(t)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))

例8-4. 搜索树库找出句子的补语

def filter(tree):
    child_nodes = [child.label() for child in tree
        if isinstance(child, nltk.Tree)]
    return (tree.label() == 'VP') and ('S' in child_nodes)

[subtree for tree in treebank.parsed_sents()
     for subtree in tree.subtrees(filter)]

[Tree('VP', [Tree('VBN', ['named']), Tree('S', [Tree('NP-SBJ', [Tree('-NONE-', ['*-1'])]), Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']), Tree('NN', ['conglomerate'])])])])])]),

entries = nltk.corpus.ppattach.attachments('training')

table = nltk.defaultdict(lambda: nltk.defaultdict(set))

for entry in entries:
    key = entry.noun1 + '-' + entry.prep + '-' + entry.noun2
    table[key][entry.attachment].add(entry.verb)

for key in sorted(table):
    if len(table[key]) > 1:
        print(key, 'N:', sorted(table[key]['N']), 'V:', sorted(table[key]['V']))

%-below-level N: ['left'] V: ['be']
%-from-year N: ['was'] V: ['declined', 'dropped', 'fell', 'grew', 'increased', 'plunged', 'rose', 'was']
%-in-August N: ['was'] V: ['climbed', 'fell', 'leaping', 'rising', 'rose']
%-in-September N: ['increased'] V: ['climbed', 'declined', 'dropped', 'edged', 'fell', 'grew', 'plunged', 'rose', 'slipped']
%-in-week N: ['declined'] V: ['was']

nltk.corpus.sinica_treebank.parsed_sents()[3450].draw()

有害的歧义

grammar = nltk.CFG.fromstring("""
    S -> NP V NP
    NP -> NP Sbar
    Sbar -> NP V
    NP -> 'fish'
    V -> 'fish'
    """)

tokens = ["fish"] * 5

cp = nltk.ChartParser(grammar)

for tree in cp.parse(tokens):
    print(tree)

(S (NP fish) (V fish) (NP (NP fish) (Sbar (NP fish) (V fish))))
(S (NP (NP fish) (Sbar (NP fish) (V fish))) (V fish) (NP fish))

加权文法

def give(t):
    return t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'\
        and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')\
        and ('give' in t[0].leaves() or 'gave' in t[0].leaves())
def sent(t):
    return ' '.join(token for token in t.leaves() if token[0] not in '*-0')
def print_node(t, width):
    output = "%s %s: %s / %s: %s" %\
        (sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))
    if len(output) > width:
        output = output[:width] + "..."
    print(output)

for tree in nltk.corpus.treebank.parsed_sents():
    for t in tree.subtrees(give):
        print_node(t, 72)

gave NP: the chefs / NP: a standing ovation
give NP: advertisers / NP: discounts for maintaining or increasing ad sp...
give NP: it / PP-DTV: to the politicians
gave NP: them / NP: similar help
give NP: them / NP: 
give NP: only French history questions / PP-DTV: to students in a Europe...
give NP: federal judges / NP: a raise
give NP: consumers / NP: the straight scoop on the U.S. waste crisis
gave NP: Mitsui / NP: access to a high-tech medical product
give NP: Mitsubishi / NP: a window on the U.S. glass industry
give NP: much thought / PP-DTV: to the rates she was receiving , nor to ...
give NP: your Foster Savings Institution / NP: the gift of hope and free...
give NP: market operators / NP: the authority to suspend trading in futu...
gave NP: quick approval / PP-DTV: to $ 3.18 billion in supplemental appr...
give NP: the Transportation Department / NP: up to 50 days to review any...
give NP: the president / NP: such power
give NP: me / NP: the heebie-jeebies
give NP: holders / NP: the right , but not the obligation , to buy a cal...
gave NP: Mr. Thomas / NP: only a `` qualified '' rating , rather than ``...
give NP: the president / NP: line-item veto power

概率上下文无关文法（probabilistic context-free grammar，PCFG）

例8-6. 定义一个概率上下文无关文法（PCFG）

grammar = nltk.PCFG.fromstring("""
    S -> NP VP [1.0]
    VP -> TV NP [0.4]
    VP -> IV [0.3]
    VP -> DatV NP NP [0.3]
    TV -> 'saw' [1.0]
    IV -> 'ate' [1.0]
    DatV -> 'gave' [1.0]
    NP -> 'telescopes' [0.8]
    NP -> 'Jack' [0.2]
    """)

print(grammar)

Grammar with 9 productions (start state = S)
    S -> NP VP [1.0]
    VP -> TV NP [0.4]
    VP -> IV [0.3]
    VP -> DatV NP NP [0.3]
    TV -> 'saw' [1.0]
    IV -> 'ate' [1.0]
    DatV -> 'gave' [1.0]
    NP -> 'telescopes' [0.8]
    NP -> 'Jack' [0.2]

viterbi_parser = nltk.ViterbiParser(grammar)

for tree in viterbi_parser.parse(['Jack', 'saw', 'telescopes']):
    print(tree)

(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)

8.7 小结

句子都有内部组织结构，可以用一棵树表示。组成结构的显著特点是：递归、中心词、补语和修饰语。
文法是一个潜在的无限的句子集合的一个紧凑的特性；我们说，一棵树是符合语法规则的或文法树授权一棵树。
文法是用于描述一个给定的短语是否可以被分配一个特定的成分或依赖结构的一种形式化模型。
给定一组句法类别，上下文无关文法使用一组生产式表示某类型A 的短语如何能够被分析成较小的序列α1 … αn。
依存文法使用产生式指定给定的中心词的依赖是什么。
当一个句子有一个以上的文法分析就产生句法歧义（如介词短语附着歧义）。
分析器是一个过程，为符合语法规则的句子寻找一个或多个相应的树。
一个简单的自上而下分析器是递归下降分析器，在文法产生式的帮助下递归扩展开始符号（通常是S），尝试匹配输入的句子。这个分析器并不能处理左递归产生式（如：NP-> NP PP）。它盲目扩充类别而不检查它们是否与输入字符串兼容的方式效率低下，而且会重复扩充同样的非终结符然后丢弃结果。
一个简单的自下而上的分析器是移位-规约分析器，它把输入移到一个堆栈中，并尝试匹配堆栈顶部的项目和文法产生式右边的部分。这个分析器不能保证为输入找到一个有效的解析，即使它确实存在，它建立子结构而不检查它是否与全部文法一致。

致谢
《Python自然语言处理》¹²³ ⁴，作者：Steven Bird, Ewan Klein & Edward Loper，是实践性很强的一部入门读物，2009年第一版，2015年第二版，本学习笔记结合上述版本，对部分内容进行了延伸学习、练习，在此分享，期待对大家有所帮助，欢迎加我微信（验证：NLP），一起学习讨论，不足之处，欢迎指正。
在这里插入图片描述

参考文献

http://nltk.org/ ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2009 ↩︎
（英）伯德，（英）克莱因，（美）洛普，《Python自然语言处理》，2010年，东南大学出版社 ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2015 ↩︎

《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第08章分析句子结构

第08章分析句子结构

8.1 一些语法困境

语言数据和无限可能性

普遍存在的歧义

8.2 文法有什么用？

超越n-grams

8.3 上下文无关文法

一种简单的文法

写你自己的文法

句法结构中的递归

8.4 上下文无关文法分析

递归下降分析

移进-归约分析

左角落分析器

图表分析

8.5 依存关系和依存文法

配价与词汇

扩大规模

8.6 文法开发

有害的歧义

加权文法

概率上下文无关文法（probabilistic context-free grammar，PCFG）

8.7 小结

猜你喜欢

《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第08章 分析句子结构

第08章 分析句子结构

8.1 一些语法困境

语言数据和无限可能性

普遍存在的歧义

8.2 文法有什么用？

超越n-grams

8.3 上下文无关文法

一种简单的文法

写你自己的文法

句法结构中的递归

8.4 上下文无关文法分析

递归下降分析

移进-归约分析

左角落分析器

图表分析

8.5 依存关系和依存文法

配价与词汇

扩大规模

8.6 文法开发

有害的歧义

加权文法

概率上下文无关文法（probabilistic context-free grammar，PCFG）

8.7 小结

猜你喜欢

《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第08章分析句子结构

第08章分析句子结构