Ordered Neurons: Integrating Tree Structures Into Recurrent Neural Networks

This is an article published on ICLR2019, and one or ICLR2019 of Best paper. This paper presents a tree structure can learn information ON-LSTM models, open source code in this article will be available at GitHub found.

, Smaller units (e.g. phrase) language is a natural hierarchy stacking unit to be larger (e.g. clause). At the end of a large component, all the small components of its internal must end. However, this can not be explicitly modeled hierarchy LSTM standard model. Therefore, this paper carried out by neurons to join this sort inductive bias (ie learning level information), the proposed model is called ordered neurons LSTM (ON-LSTM).

introduction

Natural language is usually expressed as a sequence in the form of, for example, speak and write both sequential expression of one language unit. However, the potential of language structure is not strictly sequential, but a tree-like structure (such as a syntax tree), this approach is also consistent with human cognition. From a practical point of view, the tree structure will be integrated into the neural network model language may be several reasons:

  • In order to obtain different levels of semantic representation, to enhance the level of abstraction
  • To model the composition and processing language of long-term dependence problem
  • In order to improve the offset generated by the inductive effect, while reducing the amount of training data

A straightforward approach is to use grammar parsing sentences analytical model syntax tree, but these methods are equally supervised many problems: 1) the lack of annotation data; 2) in some areas, grammar rules are not so strict (such as network terms); 3) language is constantly changing, the rules of grammar may fail. On the other hand, directly unsupervised learning grammatical structures (grammer induction) is not a good way to solve the problem, and often very complex.

Recurrent Neural Networks proved to be quite effective in the modeling language, it is assumed that the data structure is a sequence. But this assumption may be a problem when the language is non-sequence structure, long-term dependence on tasks or generate problems in the capture. At the same time, by implicitly LSTM coding syntax tree processing mechanism may be implemented.

In this paper, the authors proposed order neurons (Ordered Neurons), the information within each neuron has a different life cycle: the high-order neurons in long-term information storage, can keep more of the number of steps, low-level nerve dollar short-term information storage, it may quickly be forgotten. In order to avoid rigid high-order low-order neurons division, the paper also proposed cumax () to activate this function. Finally, the model on Language modeling, unsupervised constituency parsing, targeted syntactic evaluation and logical inference four experimental task, better than the previous model on the grammatical analysis, while long-term dependence on capture and generate long sentence is better than standard LSTM.

Related work

There are already a lot of work will be applied to the tree structure of natural language processing tasks, and also proved that the introduction of structural information in LSTM have the task very helpful. However, efficient infer the structure has also become a problem. Part of the work directly grammar induction (grammer induction), but these methods are too complex to use. Some choose to work to improve the cycle network, using different time scales cycle mechanism to capture the level of information. However, these jobs typically pre-defined level of depth.

Ordered neurons

Given a sentence \ (S = (x_1, \ DOTS, x_t) \) , figure (a) its composition corresponds to the tree, target model is based on the observed sequence data tree structure information inferred unobservable. Want displayed in Figure (c), in a hidden state in each time step, it is necessary to include information about the current input (leaf node), but also contains a higher level of information. But hidden state \ (h_t \) dimensions are fixed (c for 3), at different time steps and sentences, all levels of information they may have different spans, requiring dynamically from root to leaf a neuron is mapped to each node of the hidden states. For example corresponding to the level of just (a) to (C), but may also have a hierarchy tree layer 4, and the number of neurons in the hidden state element 3 only.

Therefore, in order neurons work, the authors hope that the high-order neurons (corresponding to the upper layer c) contains a long-term dependence or global information, which may last the whole process even more time-step, low-order neurons (corresponding to c the lower layer) encoding short-term memory or local information, which lasted less time step. That lower-order neurons in the frequency of updates faster than higher-order neurons.

ON-LSTM

标准LSTM可以表示为:
\[ f_t = \sigma(W_fx_t + U_f xh_{t-1} + b_f) \\ i_t = \sigma(W_ix_t + U_i xh_{t-1} + b_i) \\ o_t = \sigma(W_ox_t + U_o xh_{t-1} + b_o) \\ \hat{c}_t = \text{tanh}(W_c x_t + U_c h_{t-1} + b_c) \\ h_t = o_t \circ \text{tanh}(c_t) \\ c_t = f_t \circ c_{t-1} + i_t \circ \hat{c_t} \]

ON-LSTM difference between standard and LSTM is that \ (c_t \) update, which is the last formula above. Forgetfulness door \ (f_t \) and enter the door \ (i_t \) control the memory unit \ (c_t \) update, but for each neuron these doors are independent, the paper actually improved forgotten door and enter the door.

Activation function cumax ()

In order to distinguish between high-order and low-order neurons neurons, and updates corresponding to different ways, first need to find a boundary between the two. Thesis practice is to generate a n-hot vector \ (G = (0, \ DOTS, 0,1, \ DOTS, 1) \) , this vector is divided into two sections, for the whole period of 0, 1 for the whole period of , the model can be implemented in different update rules on two.

In order to obtain the above vector, this paper introduces the cumsum function, which represents a total sum for the corresponding cumsum effect on a one-hot vector is a vector into two sections 1s and 0s, for example
\ [\ text {cumsum } ((0,0,1,0,0)) = (0,0,1,1,1) \]
thus generating n-hot vector generated above is transformed into one-hot vector, i.e., find an integer division (position 1 of the first). However, this time the division point is a discrete value, calculate the gradient does not work, the authors used a way to turn to a desired softening. Specifically, if the position \ (D \) probability of occurrence may be represented by the following formula:
\ [P (D) = \ {text} SoftMax (\ DOTS) \]
Since \ (G \) is generated by the cumsum Therefore \ (G \) of \ (K \) th position probability should be pre \ (K \) positions probability cumulative sum, i.e.
\ [p (g_k = 1) = p (d \ leq k) = \ sum_ {i \
leq k} p (d = i) \] Thus activation function cumax resulting vector can be used oF proposed () generates, it is:
\ [\ Hat {G} = \ text { cumax} (\ dots) = \ text {cumsum} (\ text {softmax} (\ dots)) = \ text {cumsum} ((p (1), p (2), \ dots, p (k), \ dots)) \]
and the probability softmax may be a predictor of learning network, put the paper to find the cut-off point the question becomes a probability prediction problem.

Structured door mechanism

Based on the above cumax () the activation function, the paper presents its main forgetting gate \ (\ tilde {f} _t \) and the master input gate \ (\ tilde are {I} _t is \) : \
[\ tilde are {F} _t is = \ text {cumax} (W _ {\ tilde {f}} x_t + U _ {\ tilde {f}} h_ {t-1} + b _ {\ tilde {f}}) \\ \ tilde {i} _t = 1 - \ text {cumax} (
W _ {\ tilde {i}} x_t + U _ {\ tilde {i}} h_ {t-1} + b _ {\ tilde {i}}) \] using the above formula, forgetting the main gate and the input gate of the main vector generated monotonically, but forgetting the main door is incremented from 0 to 1, the main input gate is decremented from one to zero. After using these two gates, the memory unit of the following update rule:
\ [W_T = \ {F} _t is tilde are \ CIRC \ _t is tilde are {I} \\ \ Hat = {F} _t is F_T \ CIRC W_T + (\ {tilde are f} _t - w_t) = \ tilde {f} _t \ circ (f_t \ circ \ tilde {i} _t + 1 - \ tilde {i} _t) \\ \ hat {i} _t = i_t \ circ w_t + ( \ tilde {i} _t - w_t ) = \ tilde {i} _t \ circ (i_t \ circ \ tilde {f} _t + 1 - \ tilde {f} _t) \\ c_t = \ hat {f} _t \ circ c_ {t-1} + \
hat {i} _t \ circ \ hat {c} _t \] Next we talk about how to understand the update rule above. For simplicity, we assume that the primary is still forgotten door \ (\ tilde {f} _t \) is\ ((0, \ dots, 1, \ dots, 1) \) type, the corresponding main input gate \ (\ tilde {i} _t \) is \ ((1, \ dots, 1,0, \ dots, 0) \) vector type.

Wherein \ (W_T \) is \ (\ tilde {i} _t \) and \ ((1, \ dots, 1,0, \ dots, 0) \) the intersection portion, it should have the form \ ((0 , \ DOTS, 1, \ DOTS, 1,0, \ DOTS, 0) \) (or may not 1). So, let's discuss:

When \ (W_T \) all 0 , the two doors that is to say there is no intersection, then there is:
\ [\ _t is Hat {F} = \ {F} _t is tilde are \\ \ _t is Hat {I} = \ tilde are {i} _t \\ c_t = \ hat {f} _t \ circ c_ {t-1} + \ hat {i} _t \ circ \ hat {c} _t = \ tilde {f} _t \ circ c_ {t- 1} + \ tilde {i}
_t \ circ \ hat {c} _t \] to update the memory unit as shown above case shown in the left part, \ (\ {F} _t is tilde are \) a \ (c_ {t-1 } \) is copied to the high-order information \ (C_T \) , \ (\ I} {tilde are _t is \) a \ (\ hat {c} _t \) is copied to the low-order information \ (C_T \) , and the intermediate disjoint area is no.

When \ (W_T \) not all 0 , the two doors have an intersection that is, at this time there are:
\ [C_T = (\ {F} _t is tilde are - W_T) \ CIRC. 1-C_ {T} + (\ tilde are {i} _t -w_t) \ circ
\ hat {c} _t + [f_t \ circ w_t \ circ c_ {t-1} + i_t \ circ w_t \ circ \ hat {c} _t] \] At this time, the memory unit as shown in the right portion of FIG update, the update is split into three sections. Forgotten main door and entered the main gate of the role is still the same, but the area of intersection, two doors together, it devolved into a standard LSTM form.

Main doors forgetting \ (\ tilde {f} _t \) controls to erase the memory, it is split point \ (D_F \) . \ (d_f \) greater representation more high-end update information to be erased out. Master input gate \ (\ tilde {i} _t \) controls the writing of the memory, it is split point \ (D_i \) . \ (d_i \) greater representation more local information life cycle longer. And \ (W_T \) is distributed across two doors, which contains both the previous section also contains information on the current input information, so this part using standard LSTM process.

Because these doors just focus on the main memory of coarse-grained control, the use of hidden dimensions were calculated state will bring a great amount of calculation is not necessary. So in fact the paper a dimension defined gate is \ (D_m = \ dfrac {D} {C} \) , where \ (D \) is the hidden state dimension, \ (C \) is the block size factor (chunk size factor). And in \ (F_T \) and \ (I_T \) before the element-wise multiplication, each neuron repeated \ (C \) times to recover \ (D \) dimension. This embodiment can effectively reduce the dimensionality reduction ON-LSTM of parameters. After using this mode, a neuron corresponding one of the original door becomes continuous \ (C \) neurons share a gate.

experiment

Thesis in language modeling, unsupervised constiuency parsing, targeted syntactic evaluation and logical inference conducted experiments on four tasks. In the first performance of a task as shown below:

这里重点提一下unsupervised constiuency parsing这个任务,这个任务的评测方法是将模型推断出来的树结构和人工标注的结构进行对比。为了使用预训练的模型推断一个的树结构,论文首先将隐状态初始化为全零,然后将句子输入模型。在每个时间都,都对\(d_f\)计算期望:
\[ \hat{d}_f = \mathbb{E}[d_f] = \sum_{k=1}^{D_m}k p_f (d_t=k) = \sum_{k=1}^{D_m} \sum_{i=1}^k p_f(d_t = k) = D_m - \sum_{k=1}^{D_m} \tilde{f}_{tk} \]
其中\(p_f\)是主遗忘门分割点的概率分布,\(D_m\)是隐状态的大小。给定\(\hat{d}_f\),可以使用自顶向下的贪心算法进行解析。首先对\(\{\hat{d}_f\}\)进行排序,对于序列中的第一个\(\hat{d}_f\),将句子分成\(((x_{<i}), (x_i, (x_{>i})))\),然后对\((x_{<i})\)\((x_{>i})\)两部分再次运用上述方法,知道每个部分都只包含一个单词。

Guess you like

Origin www.cnblogs.com/weilonghu/p/11939365.html