Papers Read | Effective Neural Solution for Multi-Criteria Word Segmentation

main idea

This article is the use of multiple standard Chinese word, and Fudan article than before, its method is more simple, no complicated structure, but more efficient than previous methods.

method

Stacked LSTM, the top is the CRF.

The bottom is Bi-LSTM character set. Input: characters embedding, output: wherein each character is represented by context.

 

 

After obtaining ht, CRF as reasoning layer.

Rate:

local score:

Wherein , it is a Bi-LSTM bigram features hidden layer and embedding ht splicing.

 

 

global score:

 

 

 A is the transition matrix tag yi to tag yj.

 

 

 

 

 

 Multi-standard CWS

At the beginning and end of a sentence plus token which indicates that it uses a standard. Calculating a score time and then removed.

training

 

 

 The Y-  represents a sentence X of all possible tag sequence.

experiment

1. Our multi-standard solution that is able to learn heterogeneous data sets?

2. Our solutions can be applied to large-scale corpus of small and informal group composed of text?

3. more data, better performance?

based on Dynet (Neubig et al., 2017)


Dynamic Neural Network Framework

data set

Q1: SIGHAN2005

Q2 3: SIGHAN2008

All data sets were pretreated by successive replacement of English characters and numbers using the unique token. For the training set and development, the line is divided into a sentence or clause is shorter by punctuation, for faster batch.

Especially the traditional Chinese Corpus CityU, AS and CKIP converted to simplified version, using the popular Chinese NLP tools HanLP2.

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/shona/p/11540353.html