Semi-supervised learning for Text Classification by Layer Partitioning

This article is a short article on arxiv, the reason is because they see the title and semi-supervised text classification attracted me. But after reading that the work is relatively small, but the idea is actually quite good.

Most semi-supervised methods to select a small perturbation is applied to the input vector or a representation thereof in such a manner more successful on computer vision, but is not suitable for discrete text. This method is applied to text input, the neural network described herein \ (M \) split: \ (the U-M = \ CIRC F. \) . Wherein \ (F. \) Is frozen (freeze), and feature extraction based droput for adding noise, \ (the U-\) may be any semi-supervised algorithm. Meanwhile, the paper also \ (F \) gradually thaw (unfreeze), pre-training model to avoid catastrophic forgotten.

introduction

Most semi-supervised algorithms rely on consistency or smoothness constraints, forced consensus forecast model input and added a slight disturbance input. In CV problems, the picture can be represented as a dense continuous vector, but in text classification tasks, each word is represented as one-hot form, this approach is inappropriate. Even with word embedding, text or discrete representation of potential. And, for each word independently join disturbances, it may lead to disturbances after the word moot.

For the above problem, the paper proposes a neural network divided into two parts, namely \ (the U-M = \ CIRC F. \) . Wherein \ (F. \) As the encoder and turbulence characteristic function (such as the language model may be used), \ (the U-\) may be any semi-supervised algorithm. \ (F \) is usually unrelated to the field, and \ (U \) is a domain-specific. This is also the reason thesis topic is called the layer partitionning.

method

FIG upper left portion is a schematic view of the entire model, the paper used as ULMFiT \ (F. \) Wherein the encoder, the input of each consecutive transformed into vector space, then the \ (the U-\) ( \ (\ Prod \) Model , Temporal Emsebling, etc.) to learn.

Meanwhile \ (F. \) Is further configured to add noise to the input. However, the authors did not use universal \ (\ tilde {x} \ leftarrow x + \ epsilon \) in this way, but rather as noise dropout. The author believes \ (F \) in the general field of pre-training, with more than common way of text information on this added noise so happy to become sad in this way could completely change the nature of the text.

Then there is training \ (U \) things, the paper cited two models, namely \ (\ Prod \) - Model and temporal ensembling model. They are semi-supervised learning algorithm, a schematic diagram on the right portion as described above.

Training to a certain extent, the author proposes the gradual thaw \ (F \) in the network, this is because the \ (U \) already in \ (\ {F (x) \} \) training on saturation, allowing \ ( F \) also learn some specific tasks related to the features.

experiment

Paper use Internet Movie Dataset (IMDb) and TREC-6 data sets, mainly emotional classification.

Guess you like

Origin www.cnblogs.com/weilonghu/p/11947122.html