main idea
This article is the use of multiple standard Chinese word, and Fudan article than before, its method is more simple, no complicated structure, but more efficient than previous methods.
method
Stacked LSTM, the top is the CRF.
The bottom is Bi-LSTM character set. Input: characters embedding, output: wherein each character is represented by context.
After obtaining ht, CRF as reasoning layer.
Rate:
local score:
Wherein , it is a Bi-LSTM bigram features hidden layer and embedding ht splicing.
global score:
A is the transition matrix tag yi to tag yj.
Multi-standard CWS
At the beginning and end of a sentence plus token which indicates that it uses a standard. Calculating a score time and then removed.
training
The Y- X represents a sentence X of all possible tag sequence.
experiment
1. Our multi-standard solution that is able to learn heterogeneous data sets?
2. Our solutions can be applied to large-scale corpus of small and informal group composed of text?
3. more data, better performance?
based on Dynet (Neubig et al., 2017)
Dynamic Neural Network Framework
data set
Q1: SIGHAN2005
Q2 3: SIGHAN2008
All data sets were pretreated by successive replacement of English characters and numbers using the unique token. For the training set and development, the line is divided into a sentence or clause is shorter by punctuation, for faster batch.