The statistical model regards the sentence as a sequence of words, that is, the compound conditional probability of multiple words. A word is the atomic unit of an article. The basic idea of nlp is to vectorize words (computable), model documents, and then perform classification and correlation analysis.
一、bow(bag of word)
A combination of words, representing a document. The order and context of the words are not considered.
2. n-gram model
Consider context in addition to bow
3. Vectorization of words (mathematical modeling):
1. One hot representation Except one dimension is 1, the rest are 0
2、distributed representation
One hot is too sparse, so let the neural network learn the mapping of vector space first, from sparse representation to distributed representation (the feature of deep learning is to automatically extract features). Each word is mapped to a small vector, in which there are multiple component values that are not 0 (so called distributed representation)
3. Word embedding: It is to map the word word to another space, where this mapping has the characteristics of injective (single injection) and structure-preserving (order and structure can be maintained). The popular translation can be considered as word embedding, which is to map the words in the space of X into a multi-dimensional vector to the space of Y, then the multi-dimensional vector is equivalent to embedding in the space of Y, one radish and one pit. Word embedding is to find a mapping or function to generate an expression in a new space, which is the word representation.
4, the current commonly used representation (model)
tfidf
LSA (latent semantic analysis): matrix-based information compression
LDA (latent dirichlet allocation): Introduce topic between doc and word, and use 2-level probability distribution to describe the internal relationship of language.
word2vec (obtained at the same time as the language model): The output layer uses Huffman coding to increase the speed of learning. The chapter level is called thought vector
The probability product of each branch along the entire path expresses the corresponding context. This thing is actually an encoder in deep learning
FastText: Supervised, only uses bow, no sequence. Good for grammatical analogies
VarEmbed:
WordRank: WordRank is better for semantic analogies
GloVe:
5. Two ideas of word2vec CBOW skip-gram
CBOW: The environment before and after is determined, and the probability of finding the current word.
skip-gram: The current word is determined, and the word combination before and after is explored
4. Learning Methods
One" (based on word frequency, directly using the mathematical expression of probability)
1. The appearance of each word is only related to the previous limited words. The number of parameters to be determined is O(N^n). Generally, n=3 is the balance between effect and complexity.
2. Calculation steps: 1 "Extract vocabulary 2" Count the word frequency according to the corpus, and approximate the probability according to the frequency (the corpus is required to be large enough)
2. Neural network representation
1. The input is a sequence of vectors (digitized first) representing each word
2. The output is the probability of each word under the corresponding sequence
3. Network structure and parameters to reflect the precursor of the word, and the joint probability distribution
Three" language models for deep learning (various RNNs)
LSTM
TOWER CRANE
Five, participle
1. (complex) maximum matching, based on dictionaries and rules
2. Selection method of full segmentation path: The full segmentation method is to list all possible segmentation combinations, and select the best segmentation path from them. Regarding the path selection method, there are generally the n-shortest path method, the word-based n-gram grammar model method, and the like.
2. Word tagging (probability of using machine learning to form words), such as Conditional Random Fields (CRF, Conditional Random Fields) HMM (HiddenMarkov Model) maximum entropy
3. Statistics combined with dictionary
4. Deep Learning (RNN, LSTM)
5. Related libraries
The principle of stuttering word segmentationhttp://blog.csdn.net/john_xyz/article/details/54645527
https://github.com/fxsjy/jieba CRF HMM is currently the most reliable
https://github.com/NLPchina/ansj_seg n-Gram+CRF+HMM
IKAnalyzer is based on matching, and the effect is average