nlp statistical language model

The statistical model regards the sentence as a sequence of words, that is, the compound conditional probability of multiple words. A word is the atomic unit of an article. The basic idea of nlp is to vectorize words (computable), model documents, and then perform classification and correlation analysis.

一、bow（bag of word）

A combination of words, representing a document. The order and context of the words are not considered.

2. n-gram model

Consider context in addition to bow

3. Vectorization of words (mathematical modeling):

1. One hot representation Except one dimension is 1, the rest are 0

2、distributed representation

One hot is too sparse, so let the neural network learn the mapping of vector space first, from sparse representation to distributed representation (the feature of deep learning is to automatically extract features). Each word is mapped to a small vector, in which there are multiple component values that are not 0 (so called distributed representation)

3. Word embedding: It is to map the word word to another space, where this mapping has the characteristics of injective (single injection) and structure-preserving (order and structure can be maintained). The popular translation can be considered as word embedding, which is to map the words in the space of X into a multi-dimensional vector to the space of Y, then the multi-dimensional vector is equivalent to embedding in the space of Y, one radish and one pit. Word embedding is to find a mapping or function to generate an expression in a new space, which is the word representation.

4, the current commonly used representation (model)

tfidf

LSA (latent semantic analysis): matrix-based information compression

LDA (latent dirichlet allocation): Introduce topic between doc and word, and use 2-level probability distribution to describe the internal relationship of language.

word2vec (obtained at the same time as the language model): The output layer uses Huffman coding to increase the speed of learning. The chapter level is called thought vector

The probability product of each branch along the entire path expresses the corresponding context. This thing is actually an encoder in deep learning

FastText: Supervised, only uses bow, no sequence. Good for grammatical analogies

VarEmbed：

WordRank: WordRank is better for semantic analogies

GloVe：

5. Two ideas of word2vec CBOW skip-gram

CBOW: The environment before and after is determined, and the probability of finding the current word.

skip-gram: The current word is determined, and the word combination before and after is explored

4. Learning Methods

One" (based on word frequency, directly using the mathematical expression of probability)

1. The appearance of each word is only related to the previous limited words. The number of parameters to be determined is O(N^n). Generally, n=3 is the balance between effect and complexity.

2. Calculation steps: 1 "Extract vocabulary 2" Count the word frequency according to the corpus, and approximate the probability according to the frequency (the corpus is required to be large enough)

2. Neural network representation

1. The input is a sequence of vectors (digitized first) representing each word

2. The output is the probability of each word under the corresponding sequence

3. Network structure and parameters to reflect the precursor of the word, and the joint probability distribution

Three" language models for deep learning (various RNNs)

LSTM

TOWER CRANE

Five, participle

1. (complex) maximum matching, based on dictionaries and rules

2. Selection method of full segmentation path: The full segmentation method is to list all possible segmentation combinations, and select the best segmentation path from them. Regarding the path selection method, there are generally the n-shortest path method, the word-based n-gram grammar model method, and the like.

2. Word tagging (probability of using machine learning to form words), such as Conditional Random Fields (CRF, Conditional Random Fields) HMM (HiddenMarkov Model) maximum entropy

3. Statistics combined with dictionary

4. Deep Learning (RNN, LSTM)

5. Related libraries

The principle of stuttering word segmentationhttp://blog.csdn.net/john_xyz/article/details/54645527

https://github.com/fxsjy/jieba CRF HMM is currently the most reliable

https://github.com/NLPchina/ansj_seg n-Gram+CRF+HMM

IKAnalyzer is based on matching, and the effect is average

nlp statistical language model

Guess you like