Name identification code parsing natural language processing packet hanlp

HanLP nr.txt dictionary words included in the transmit matrix 393 surnames. Yuan Yi in "China's three names figured out how to" document pointed out: Contemporary China 100 common surnames, concentrated 87% of the population, we retain only 100 common words nr.txt in accordance with this data surname roles other words get rid of their surnames role state. After filtration words, nr.txt having a total of 97 characters surname. In the following table:
Ding Wan Joe any more than Hou Fu Feng Liu Lu history Ye Lu Wu Zhou Tang Xia Yao
Jiang hole Sun Meng Song Yoon Cuichang Kang Liao Zhang Peng Xu Dai Fang Yi Cao had Zhu Li
Du Yang Lin Liangwu segment Mao Jiang Tang Wang Shen Pan Xiong Wangtian Whitehead Qin Cheng Luo
Hu Sufan Xiao Dong Jiang Xue Yuan Xu Jia Xie Tan Ho Lai Zhaodeng Qiu Shao Zou Zheng
Hao Guo Qian Yan Lu Chen Lei Han Admiralty Gu Gong Wei Huangli high horse

Experimental results
before surname filtered, each named entity recognition accuracy
NR 33 is%
NS 83%
NT 43 is%
after the surname filtration, each named entity recognition accuracy
NR 36%
NS 83%
NT 81%
there is no open Cascaded Hidden Markov prediction mechanism names and place names , nt reason for the rise is estimated that as many words did not name names are labeled, so nt of pattern matching rules match is not on, so the accuracy of the agency name will come up. Misidentification names, no more 100 common surnames, many probably HanLP intervention in other vocabularies come.
Hidden generally used for word segmentation and POS tagging is relatively good, why the usual sequence tagging method that is used directly entity recognition method BIEO inappropriate it. An example of label words, each word corresponding to its subset of speech tags is limited, and this with respect to a subset of speech tags corpus is small. But by the entity recognition is not necessarily, take the person's name, the last name in addition, words can fill part of the name of words is arbitrary, that is to say any words are likely to appear in the middle position, where certain words in the matrix launch actually not much sense, because it might equally likely issued by the BIEO mark, and if the finger to determine the optimal use of the transition probability between the mark sequence of four markers information is bound to affect the results. Therefore, we mark by introducing role, which is actually introduced a priori knowledge, such as certain words can only be generated by the last name mark, some word generally act as the first word of the name, some characters generally act as the last word name , by a character set made under these circumstances various roles defined in terms of its profile to appear and grammatical meaning it can be reduced tag set of words that can be issued for each character, which is set for each tag words (words) corresponding to size, equivalent to the emission probability distribution is not uniform, then the accuracy of the forecasts certainly been improved.
Lower introduce the main process name recognition HanLP
1. Matching determined using a variety of paths variable word stored wordNetAll
2. Find an optimal path with viterbi segmentation method, where the main use of user-defined dictionaries, and core dictionary, a sequence with variable vertexList storage.
3. The role of observation, the role of which is listed for each tag words may correspond vertexList accordance with a transmission probability matrix. roleObserve (...) method implementation.
4. The role of labeling was determined using optimal character marker sequence viterbi method. viterbiComputeSimply (...) method implementation.
The role of the sequences obtained pattern matching names. NRPattern pattern matching is defined in the class.

In relatively high precision and tight case, the best way to improve the accuracy, leaving only the common last name, retaining only the most likely role 2gram flag pattern. Another point to note, if you predict a large corpus and complete the training corpus in the name of a man identified on the stylistic differences, or that you are not trained on the training corpus forecast on the open training corpus, basically context information is useless and may even give contextual information entity boundary marked cause interference, I think any machine learning, including the depth of learning, whether it is text classification or entity identification field will have a problem with this generalization, which I am afraid that the problem is not solved by the algorithm, so if you can solve any language in any field can be used a model once and for all.

Guess you like

Origin blog.51cto.com/13636660/2425823