Hanlp- names aware debugging method Detailed


HanLP word received more entities in particular, it is particularly likely to cause misidentification. A few names below misidentification example, it should be noted that the name recognition behind the organization also recognized as a place name basis, so if the place name recognition is not accurate, can lead agency name recognition is not accurate.

Cover .jpg

Type 1 digital + names

[1] unannounced visits Harbin network about cars: under 10 single to 7 ten "black car" 1 Liang deck

 

[2] room in the world daily turnover of 5 Yue 12 Ri Haining real estate sales record 43 sets

 

[3] Guangxi myopia surgery experts - Huang Minghan Dean 9 Yue 9 Baise will meet

 

Type 2 before the word + names into word or the first word names suffixes + the word into words

 

[1] Xi'an state-owned 4000 RMB Salaries equivalent to how much money the private sector?

 

[2] July from Baotou to Shandong, about fifteen days, traveling by car Route recommend?

 

[3] Most people welcomed the study section of the city, you have to apply for university city where it?

 

Type 3 names itself into words

 

[1] 滴滴司机接跨省天价订单 乘客半路改道至今未付款

 

[2] 上联:山水不曾随我老,如何对下联?

 

[3] 上联:柳着金妆闲钓水,如何对下联?

 

Badcase分析及修正

 

下边介绍一下排查误判原因以及修正的方法

首先需要明确以下几点注意事项

1.实体识别受分词精度影响。

2.实体识别同样涉及消歧的问题。

3.HanLP收录了一些不常见的实体词,会造成错误率升高。

4.HanLP基于隐马的命名实体识召回率没有特别要求的话,不需要再去训练。

 

这里我们以下边这个badcase的分析过程为例来说明

 

[5] 上联:山水不曾随我老,如何对下联?

 

打开提示模式 HanLP.Config.enableDebug()

 

运行人名识别代码

 

# HanLP命名实体识别

def hanlp_ner(text, ner_type):

    global segment

    ner_li = []

    for term in segment.seg(text):

        if str(term.nature) == ner_type:

            ner_li.append(str(term.word))

return ner_li

 

这里ner_type为你要识别的实体类型,如果是人名则ner_type='nr',地名ner_type='ns',机构名ner_type='nt'text为要抽取实体的文本。

 

识别结果,这里为了清晰,只截取了部分输出

 

粗分结果[上联/n, /w, 山水/n, /d, 曾随/ns, /rr, /a, /w, 如何/ryv, /p, 下联/n, /w]

地名角色观察:[  S 1163565 ][上联 Z 20211628 ][A 2701 B 439 X 11 ][山水 B 6 A 1 ][B 214 A 3 C 3 ][曾随 G 1 H 1 ]

[A 47 B 26 ][C 274 A 75 B 66 D 2 X 2 ][A 40525 B 10497 X 418 ][如何 B 44 ][A 2896 B 454 X 215 ][下联 Z 20211628 ][B 82 ][  B 1322 ]

地名角色标注:[ /S ,上联/Z ,/B ,山水/A ,/C ,曾随/H ,/B ,/B ,/A ,如何/B ,/A ,下联/Z ,/B , /S]

识别出地名:不曾随 CH

hanlp_ns ['不曾随']

 

显然,曾随被认为是地名了,而且粗分结果表示的是未经地名识别模块分词和词性标注的结果,显然这是由于词表导致的。由于没有经过地名识别模块,所以不需要去地名的发射词表ns.txt中去找词语,只需要看核心词表CoreNatureDictionary.txt中去找

Figure 1.jpg 


显然,在核心词表中“曾随“被标记为一个地名,把”曾随“从词表中删除掉,并删除词表文件CoreNatureDictionary.txt.bin,之后再次运行程序得到下边的输出结果

 

hanlp_ns []

 

从这个实例,我们也可以看出一些不常见地名如果做成地名词表,就有导致错误识别实体。因此,我们应该保留一份评测语料,每当修改了实体词表后,需要跑一下测试语料查看准确率,如果降低的太多,则表示这样加进来是不可行的。同时填加的实体名也有可能会造成分词错误。

 

下边说明一下HanLP中有关实体的词表文件名

 

1.CoreNatureDictionary.mini.txt

2.CoreNatureDictionary.txt

3.CustomDictionary.txt

4.机构名词典.txt

5.全国地名大全.txt

6.人名词典.txt

7.上海地名.txt

8.现代汉语补充词库.txt

9.ns.txt

10.nr.txt

11.nt.txt

 

当然这里列出的是通常最有可能导致误识别的词表,如果这些词表都没有找到,还需要在HanLP其他词典文件中寻找。

 

I hope that today's content for use HanLP helpful and Hidden Markov soft spot for small partners. This two-day little experience is, in fact, with the word entity recognition are inextricably linked, both common and difficult process, such as word sense disambiguation (to determine the boundary), lexical analysis is in fact the real NLP one of the elements of , and lexical analysis with machine learning in fact not so much. On top of badcase solution is not the fundamental way to get rid of some direct word, cause some uncommon entity recognition does not come out. Can we consider about information entropy measure function like the word to address whether this needs to open its constituent words around. For lexical analysis we recommend the use of deep learning, after all, to understand these methods is a must, although you can not actually in kind, but does not mean you can not be lazy to learn.

Guess you like

Origin blog.51cto.com/13636660/2424423