Notes reproduced in GitHub project : https://github.com/NLP-LOVE/Introduction-NLP
13. The depth of learning and natural language processing
13.1 limitations of traditional methods
It has been mentioned a hidden Markov model, perceptron, CRFs, Naive Bayes model, support vector machines and other traditional machine learning models, while, in order to these machine learning models used in NLP, we have a feature template , TF-IDF, words like bags feature vector extraction method. The performance limitations of these methods are as follows:
Sparse data
First, the traditional machine learning methods are not good at handling data sparseness problem, which is particularly prominent in the field of natural language processing, language is a discrete symbology, each character, words are discrete random variables. We usually use the one-hot vector (one-hot) to the text into a vector representation, it refers to only one element of 1, all other elements of the binary vector 0. E.g:
Motherland characteristics: [ "China", "America", "French"] (where N = 3)
China => 100
United States => 010
French => 001
Fatherland above features only three Fortunately, that if it is hundreds of thousands? There will be a lot of 0, manifested as sparsity of data.
Feature template
Language is highly complex. For the Chinese, constitute a radical Chinese characters, Chinese characters form words, words to form phrases, phrases form sentences, sentences constitute paragraphs, paragraphs constitute the article, meaning progressive and with increasing levels of granularity, expressed increasingly the more complex.
This feature brings the same template data sparse problems: a particular word is common, but the particular combination of the two words is very rare, especially in three words. Many features appear only once in the training set, the feature appears only once in a statistically meaningless.
Error propagation
Real-world projects, often involving a combination of multiple natural language processing module. For example, in sentiment analysis, the need to make the word, and then speech tagging, filtering out some of the important word for speech tagging, and finally sent to the naive Bayes or support vector machines and other machine learning module classification to predict.
This serious practices pipelined error propagation problem, i.e. a front module generates an error is input to the next module in greater error, resulting in the vulnerability of the entire system.
13.2 deep learning and Benefits
In order to solve the data of traditional machine learning and natural language processing sparse, artificial features such as templates, and error propagation problem, people turned their attention to another trend of machine learning research - deep learning.
Depth study
Deep learning (Deep Leaming, DL) belonging to the category represented by learning (Representation Learning), referring to the use of the model having a certain "depth" to automatically learn things vector representation (vectorial rpresenation) a learning paradigm. Currently, deep learning model used mainly in the layers above layer of the neural network. If in the traditional machine learning, a vector representation of things is the use of hand-feature template to extract sparse binary vector, then study in depth, the feature template is replaced by MLP. Once the problem is expressed as a vector, the next classifier can be used as a single-layer Perceptron model, etc., at the moment the depth of learning and traditional methods as they would, the same thing. So deep learning is not a mystery, is the essence of deep learning vectors extracted by the MLP.
For depth learning principles , before my blog has been described, in detail, please click:
Dense sparse vector data to solve
Output of the neural network is a feature vector samples x h. Because we are free to control the size of the hidden layer of the neural network, so you can also control the length h of the hidden layer obtained. Even if the input layer is a one-hot vector vocabulary size, dimensions, up to hundreds of thousands, the resulting feature vector hidden layer still controlled to a small volume, such as 100 dimension.
Such a 100-dimensional vector is an abstract representation of the words as well as other samples, containing a highly concentrated information. Because of these low-dimensional vectors in the same space, we can easily train classifier to learn the degree of similarity between the words and word documents with documents, pictures and photos, and even trained classifier learning between the picture and document similarity. All this is represented by learning brings, it is the traditional machine learning methods difficult to achieve.
Automatically extracting features represented by multilayer network
Usually all connected (fully connected layer) between two layers of the neural network, it does not require specific design according to the specific connection problems. The hidden layer weight matrix based on the gradient will automatically adjust the loss function MLP weights to automatically learn the characteristics of the hidden layer is represented corner.
The process is completely without human intervention, that deep learning theory deprived useless feature template.
End design
Since the neural network between the layers, "language of communication" between the various neural networks as a vector, the depth learning engineers can easily combine multiple neural network, end to end to form a design. For example, before the case comes to sentiment analysis, one of the most simple solution is to heat only the vector character of each document input to the neural network in order to obtain a feature vector of the entire document. Then the feature vector input into a number of logistic regression classifier, you can classify a document of the sentiment polarity.
The whole process does not require Chinese word, you do not need to stop word filtering. Because neural networks simulate the human process of reading the whole article in alphabetic order, it has acquired all of the input.
13.3 word2vec
As the connection of traditional machine learning and deep learning of the bridge, the word vector has been the first stop entry-depth learning. There are many training methods word vector, word2vec is one of the most famous, as well as fastText, Glove, BERT and has been a very popular XLNet and so on.
word2vec principles explained in my blog already had a detailed description, see:
Training vector word
Understanding of the basic principles of word vectors, this section describes how to invoke the word vector module HanLP implemented, the module accepts the training data format is a space-word plain text format, here to MSR corpus, for example. Training code is as follows ( automatically downloaded Corpus ):
from pyhanlp import * import zipfile import os from pyhanlp.static import download, remove_file, HANLP_DATA_PATH def test_data_path(): """ 获取测试数据路径,位于$root/data/test,根目录由配置文件指定。 :return: """ data_path = os.path.join(HANLP_DATA_PATH, 'test') if not os.path.isdir(data_path): os.mkdir(data_path) return data_path ## 验证是否存在语料库,如果没有自动下载 def ensure_data(data_name, data_url): root_path = test_data_path() dest_path = os.path.join(root_path, data_name) if os.path.exists(dest_path): return dest_path if data_url.endswith('.zip'): dest_path += '.zip' download(data_url, dest_path) if data_url.endswith('.zip'): with zipfile.ZipFile(dest_path, "r") as archive: archive.extractall(root_path) remove_file(dest_path) dest_path = dest_path[:-len('.zip')] return dest_path sighan05 = ensure_data('icwb2-data', 'http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip') msr_train = os.path.join(sighan05, 'training', 'msr_training.utf8') ## =============================================== ## 以下开始 word2vec IOUtil = JClass('com.hankcs.hanlp.corpus.io.IOUtil') DocVectorModel = JClass('com.hankcs.hanlp.mining.word2vec.DocVectorModel') Word2VecTrainer = JClass('com.hankcs.hanlp.mining.word2vec.Word2VecTrainer') WordVectorModel = JClass('com.hankcs.hanlp.mining.word2vec.WordVectorModel') # 演示词向量的训练与应用 TRAIN_FILE_NAME = msr_train MODEL_FILE_NAME = os.path.join(test_data_path(), "word2vec.txt") def train_or_load_model(): if not IOUtil.isFileExisted(MODEL_FILE_NAME): if not IOUtil.isFileExisted(TRAIN_FILE_NAME): raise RuntimeError("语料不存在,请阅读文档了解语料获取与格式:https://github.com/hankcs/HanLP/wiki/word2vec") trainerBuilder = Word2VecTrainer(); return trainerBuilder.train(TRAIN_FILE_NAME, MODEL_FILE_NAME) return load_model() def load_model(): return WordVectorModel(MODEL_FILE_NAME) wordVectorModel = train_or_load_model() # 调用函数训练 word2vec
Word semantic similarity
Once you have term vectors, the most basic application is to find the meaning of a given word most similar to the first N word.
# 打印 单词语义相似度 def print_nearest(word, model): print( "\n Word " "Cosine\n------------------------------------------------------------------------") for entry in model.nearest(word): print("%50s\t\t%f" % (entry.getKey(), entry.getValue())) print_nearest("上海", wordVectorModel) print_nearest("美丽", wordVectorModel) print_nearest("购买", wordVectorModel) print(wordVectorModel.similarity("上海", "广州"))
The results are as follows:
Word Cosine ------------------------------------------------------------------------ 广州 0.616240 天津 0.564681 西安 0.500929 抚顺 0.456107 深圳 0.454190 浙江 0.446069 杭州 0.434974 江苏 0.429291 广东 0.407300 南京 0.404509 Word Cosine ------------------------------------------------------------------------ 装点 0.652887 迷人 0.648911 恬静 0.634712 绚丽 0.634530 憧憬 0.616118 葱翠 0.612149 宁静 0.599068 清新 0.592581 纯真 0.589360 景色 0.585169 Word Cosine ------------------------------------------------------------------------ 购 0.521070 购得 0.500480 选购 0.483097 购置 0.480335 采购 0.469803 出售 0.469185 低收入 0.461131 分期付款 0.458573 代销 0.456689 高价 0.456320 0.6162400245666504
Cosine wherein a column is the cosine similarity between two words, is a value between -1 and 1.
Words analogy
The word vector subtract two words, will produce a new vector. By making dot product with the vector, a difference between the degree of similarity can be drawn with a word of the two words. In English, a common example is king - man + woman = queen, that is to say some dimensions word vector may be to preserve the current level of words associated with royalty, and others may save the gender dimension of information.
# param A: 做加法的词语 # param B:做减法的词语 # param C:做加法的词语 # return:与(A-B+C) 语义距离最近的词语及其相似度列表 print(wordVectorModel.analogy("日本", "自民党", "共和党"))
The results are as follows:
[美国=0.71801066, 德米雷尔=0.6803682, 美国国会=0.65392816, 布什=0.6503047, 华尔街日报=0.62903535, 国务卿=0.6280117, 舆论界=0.6277531, 白宫=0.6175594, 驳斥=0.6155998, 最惠国待遇=0.6062231]
Short text similarity
We will all short text vector word averaging, will be able to express this short text is a dense vector. So we can measure any goose similarity between both ends of a short text.
# 文档向量 docVectorModel = DocVectorModel(wordVectorModel) documents = ["山东苹果丰收", "农民在江苏种水稻", "奥运会女排夺冠", "世界锦标赛胜出", "中国足球失败", ] print(docVectorModel.similarity("山东苹果丰收", "农民在江苏种水稻")) print(docVectorModel.similarity("山东苹果丰收", "世界锦标赛胜出")) print(docVectorModel.similarity(documents[0], documents[1])) print(docVectorModel.similarity(documents[0], documents[4]))
The results are as follows:
0.6743720769882202 0.018603254109621048 0.6743720769882202 -0.11777809262275696
Similarly, you can query the interface by calling the nearest most similar to a given word document
def print_nearest_document(document, documents, model): print_header(document) for entry in model.nearest(document): print("%50s\t\t%f" % (documents[entry.getKey()], entry.getValue())) def print_header(query): print( "\n%50s Cosine\n------------------------------------------------------------------------" % (query)) for i, d in enumerate(documents): docVectorModel.addDocument(i, documents[i]) print_nearest_document("体育", documents, docVectorModel) print_nearest_document("农业", documents, docVectorModel) print_nearest_document("我要看比赛", documents, docVectorModel) print_nearest_document("要不做饭吧", documents, docVectorModel)
The results are as follows:
体育 Cosine ------------------------------------------------------------------------ 世界锦标赛胜出 0.256444 奥运会女排夺冠 0.206812 中国足球失败 0.165934 山东苹果丰收 -0.037693 农民在江苏种水稻 -0.047260 农业 Cosine ------------------------------------------------------------------------ 农民在江苏种水稻 0.393115 山东苹果丰收 0.259620 中国足球失败 -0.008700 世界锦标赛胜出 -0.063113 奥运会女排夺冠 -0.137968 我要看比赛 Cosine ------------------------------------------------------------------------ 奥运会女排夺冠 0.531833 世界锦标赛胜出 0.357246 中国足球失败 0.268507 山东苹果丰收 0.000207 农民在江苏种水稻 -0.022467 要不做饭吧 Cosine ------------------------------------------------------------------------ 农民在江苏种水稻 0.232754 山东苹果丰收 0.199197 奥运会女排夺冠 -0.166378 世界锦标赛胜出 -0.179484 中国足球失败 -0.229308
13.4 High-performance based on neural network dependent parser
Arc-Standard Transfer System
Arc-Eager different previously introduced, the dependency grammar based Arc-Standard transfer system, the following specific actions:
Action name condition Explanation Shift Non-empty queue β The team first word i push LeftArc The second word in the stack I will dominate the second word i word stack to stack the word j, that is, as a child node j i RightArc Will dominate the top of the stack word word j is set to stack the second word i, j namely as a child node i Two different logical transfer systems, Arc-Eager constructed top to bottom, and Arc-Standard claim right subtree bottom-constructed. Although the complexity of both are O (n), but may be due to the simplicity of Arc-Standard (transfer operation less), it is more popular.
Feature Extraction
Although in theory, neural networks can automatically extract features, but as a pioneer for the paper, still failed to separate from feature template. All features are divided into three categories, namely:
- Word feature.
- Speech feature.
- Characterized in dependency tag subtree it has been determined in the.
Next, the parser extracting the three categories of characteristics of the current state, are designated as w, t and l. Unlike the conventional method, where a vector is assigned to each feature, thereby obtaining the dense three vectors Xw, Xt and Xl. Next, these three vectors stitching together input to the neural network comprises a hidden layer, and activated using a cubic function, i.e. to obtain a feature vector of the hidden layer:
\ [H = \ left (W_. 1} {\ left (X ^ {w} \ oplus x ^
{t} \ oplus x ^ {l} \ right) \ right) ^ {3} \] Next, for the k kinds of labels, Arc-Standard total possible presence of 2k +1 transfer action. At this time, only the input feature vector h to the polyhydric logistic regression classifier (neural network can be seen in the output layer) can be obtained in the operation of the transition probability distribution:
\ [P = SoftMax \ left ({2 W_ } h \ right) \]
Finally, the maximum probability p is selected corresponding to the transfer operation and execution. When training using the softmax cross entropy loss function and in stochastic gradient descent optimization method.Implementation code
from pyhanlp import * CoNLLSentence = JClass('com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence') CoNLLWord = JClass('com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord') IDependencyParser = JClass('com.hankcs.hanlp.dependency.IDependencyParser') NeuralNetworkDependencyParser = JClass('com.hankcs.hanlp.dependency.nnparser.NeuralNetworkDependencyParser') parser = NeuralNetworkDependencyParser() sentence = parser.parse("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标。") print(sentence) for word in sentence.iterator(): # 通过dir()可以查看sentence的方法 print("%s --(%s)--> %s" % (word.LEMMA, word.DEPREL, word.HEAD.LEMMA)) print() # 也可以直接拿到数组,任意顺序或逆序遍历 word_array = sentence.getWordArray() for word in word_array: print("%s --(%s)--> %s" % (word.LEMMA, word.DEPREL, word.HEAD.LEMMA)) print() # 还可以直接遍历子树,从某棵子树的某个节点一路遍历到虚根 CoNLLWord = JClass("com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord") head = word_array[12] while head.HEAD: head = head.HEAD if (head == CoNLLWord.ROOT): print(head.LEMMA) else: print("%s --(%s)--> " % (head.LEMMA, head.DEPREL))
For more dependencies defined in the Chinese Dependency Treebank 1.0.
13.5 Conclusion
Natural language processing is a rapidly changing discipline, especially in the era of learning in depth. In academia, even the most advanced of the current research will soon be broken in just two months. Knowledge of this series is provided only to those who basics gate-level only.
Two neural networks common feature extractor: Recurrent Neural Network for sequential data RNN convolutional neural network and a data space for CNN . Which, RNN in natural language processing used widely. RNN can handle input variable length, which is directly applicable to the text. In particular RNN family LSTM network, word memory can be about 200 or so, create the conditions for long-distance dependencies between the model sentence the word. However, RNN drawback that it is difficult to parallelize. If you need to capture the text of the n-gram words, CNN but even better, and have a natural advantage in terms of parallelization. Considering the generally longer documents, many documents are classified using the CNN model to build. And relatively short sentences, based NLP tasks performed on the sentence so that the degree of particle (Chinese word, speech tagging, parsing and named entity recognition, etc.) is often used to achieve RNN.
See RNN principle:
CNN principle see :
LSTM principle see :
In the pre-training word embedded, word2vec is already a thing of the past. Facebook by morphological information word introduced inside the Skip-Gram model obtained fastText may be any of the words in the word vector configuration, without requiring the constant term in the corpus is now derived. However, both word2vec or fastText, can not solve the problem of polysemy. Because Polysemies disambiguation must be given sentences according to the context, which spawned a series of words can be perceived in the context of representation.
Among them, the University of Washington presented ELMO , two-way LSTM language model that is trained on a large scale in plain text. ELMo to predict the current word by word read into the above manner is embedded introduces contextual information. Zalando Research researchers then applied this approach to the character level, has been embedded in a string context, its tagger has made the most advanced accuracy. Google's BERT models and modeling the same time, above and below, in many NLP tasks has made remarkable achievements through an efficient two-way Transformer network.
fastText principle see:
ELMO principle see :
BERT principle see :
Others previously considered difficult NLP tasks, such as automatic document summaries and questions and answers, but in the era of deep learning is very simple. Many QA tasks attributed to the measure between the text and the alternative answers questions similarity, which happens to have attentional mechanisms of neural networks are good at. The text document summary generation technology involved, but also happens to be the language RNN model is good at. In the field of machine translation, Google has long been based on the use of machine translation technology based on neural network out of the phrase machine translation technology. At present, the academic trend is the use of Transformer and attention mechanisms to extract features.
Transformer principle see :
Attentional mechanisms principle see :
In short, the picture of the future of natural language processing grand and broad. Natural Language Processing Getting Started as a stepping stone to a series of articles on this long road, hoping to give readers some of the necessary people door concept. As for the next practice, the road is long, and the king of mutual encouragement.
13.6 GitHub
HanLP Ho Han - "Natural Language Processing Getting Started" notes:
https://github.com/NLP-LOVE/Introduction-NLP
table of Contents