Study Notes CB008: Word Sense Disambiguation, Supervised, Unsupervised, Semantic Role Labeling, Information Retrieval, TF-IDF, Implicit Semantic Indexing Model

Word sense disambiguation, the basis for semantic understanding of sentences and chapters, must be solved. Languages ​​have a large number of words with multiple meanings. Word sense disambiguation can be solved by machine learning methods. Word sense disambiguation has a supervised machine learning classification algorithm to determine the category to which the word sense belongs. Word sense disambiguation unsupervised machine learning clustering algorithm, clustering word senses into multiple categories, each category has a meaning.

Supervised word sense disambiguation methods. Based on the mutual information word sense disambiguation method, the two languages ​​are compared, and the model can be disambiguated based on a large number of Chinese and English corpora training models. Source information theory, one random variable contains another random variable information (English information contains Chinese information), assuming that the probabilities of two random variables X and Y are p(x), p(y) respectively, the joint distribution probability is p(x,y), the mutual information calculation formula, I(X; Y) = ∑∑p(x,y)log(p(x,y)/(p(x)p(y))). Mutual information, one random variable is known to reduce the uncertainty of another random variable (when understanding Chinese, the understanding of Chinese is more certain because of the known English meaning), uncertainty, entropy, I(X; Y) = H(X) - H(X|Y). The corpus is continuously trained iteratively, I(X; Y) is continuously reduced, and the algorithm termination condition I(X; Y) is no longer reduced. The word sense disambiguation method based on mutual information works best for machine translation systems. The disadvantage is that the bilingual corpus is limited, and the ability to identify ambiguity in multiple languages ​​is also limited (for example, the same word in Chinese and English has ambiguity).

Disambiguation method based on Bayesian classifier. Conditional probability, context, any polysemous meaning is related to context. Suppose that the context (context) is denoted by c, the semantic (semantic) is denoted by s, the polysemy (word) is denoted by w, and the polysemous word w has the probability of semantic s under the context c, p(s|c), p(s|c) = p (c|s)p(s)/p(c). In p(s|c), s takes a certain semantic maximum probability, p(c) is given, only the maximum value of the molecule is considered, and s estimation=max(p(c|s)p(s)). Context c must be expressed by words in natural language processing, which consists of multiple v (words), max(p(s)∏p(v|s)).

p(s) expresses the probability of a certain semantic s of a polysemy word w, and calculates the maximum likelihood estimation of a large number of corpora, p(s) = N(s)/N(w). p(v|s) The probability of a polysemy w a certain semantic s conditional word v, p(v|s) = N(v, s)/N(s). After training p(s) and p(v|s), a polysemy w disambiguates the maximum probability of (p(c|s)p(s)).

Unsupervised word sense disambiguation methods. Completely unsupervised word sense disambiguation is impossible, and word sense cannot be defined without labeling, and word sense recognition can be done by unsupervised methods. Unsupervised word sense recognition, a Bayesian classifier, parameter estimation is not based on labeled training corpus, but first randomly initializes the parameter p(v|s), re-estimates the probability value according to the EM algorithm, and calculates p for each context c of w (c|s), get the likelihood value of the real data, re-estimate p(v|s), recalculate the likelihood value, iteratively update the model parameters, and finally get a classification model, which can classify words, and ambiguous words in different Contexts are divided into different categories. Based on monolingual context vectors. Vector similarity, compare the similarity of the cosine value of the angle between two vectors, cos(a,b) = ∑ab/sqrt(∑a^2∑b^2).

Shallow semantic annotation, an effective language analysis method, can describe the relationship between the semantic roles of sentences based on the shallow semantic role analysis method. Semantic roles, predicates, agents, receivers, when things happen, quantities. Semantic role annotation analyzes role information, and computer extracts important structural information to understand language meaning.

Semantic role tagging depends on the results of syntactic analysis. Syntactic analysis includes phrase structure analysis, shallow syntactic analysis, and dependency analysis. Syntactic analysis results semantic role labeling method. Process, syntax analysis->candidate argument pruning->argument identification->argument labeling->semantic role labeling result. Argument pruning is definitely not the argument part in many candidates. Argument identification, binary classification, is an argument and is not an argument. Argument labeling, multi-value classification.

A method for semantic role tagging based on phrase structure tree. The phrase structure tree expresses the structural relationship, and the semantic role labeling process relies on the structural relationship to design complex strategies, and the content of the strategy becomes complicated with the complexity of the language structure. The strategy of analyzing argument pruning, the semantic role is centered on the predicate, and the phrase structure tree is centered on the predicate node. Parallel analysis is performed first, and it is different from the subject. If the current node’s sibling node and the current node are not in syntactic structure, they are considered as candidates. Yuan. Argument identification, binary classification, machine learning based on annotated corpus, machine learning binary classification method is fixed, predicate itself, phrase structure tree path, phrase type, argument in predicate position, predicate morphology, argument center word, subordination Category, Argument First and Last Word, Combining Features. Argument Labeling, Machine Learning Multivalued Classifiers.

Dependency-based syntactic analysis results and chunk-based semantic role tagging methods. The argument pruning process is based on different syntactic structures. Based on the results of dependency syntax analysis, the semantic role labeling method directly extracts the predicate-argument relationship based on dependency syntax. In the pruning strategy, the predicate is used as the current node, all child nodes of the current node are candidate arguments, and the parent node of the current node is used as the current node to repeat the above process until the root node is reached. Based on the results of the dependency syntax analysis, the semantic role labeling method argument identification algorithm feature design, and more features about the parent and child nodes.

Fusion method, weighted summation, interpolation.

Semantic role labeling is currently not very effective, relying on the accuracy of syntactic analysis and poor domain adaptability. The new method uses bilingual parallel corpus to make up for the accuracy problem, and the cost is much higher.

Information retrieval, whether it is Google or Baidu, is inseparable from the TF-IDF algorithm, which is simple and effective but lacks semantic features.

TF-IDF. TF (term frequency), the frequency of a word in a document. IDF (inverse document frequency), how many documents a word appears in. The same word appears as many times in short documents as in long documents, and is more valuable for short documents. Once a word with a low probability of occurrence appears in a document, its value is greater than that of other commonly occurring words. In the field of information retrieval, the vector model is very effective for similarity calculation, and it was once the nirvana of Google. The soft underbelly of chatbots only considers independent words without any semantic information.

Implicit Semantic Indexing Model. In the TF-IDF model, all words form a high-dimensional semantic space, each document is mapped to a point, the dimension is generally high, and each word is used to separate the relationship between words and words in one dimension. Treat words and documents the same, construct a low-dimensional semantic space, and each word and each document is mapped to a point in this space. Mathematics, examine document probability, word probability, joint probability. Design a hypothetical latent class to include between documents and words, choose a document probability p(d), find a latent class probability p(z|d), and generate a word w with probability p(w|z). Estimate p(d, w) joint probability from observation data, z is a latent variable expressing a semantic feature. Use p(d, w) to estimate p(d), p(z|d) and p(w|z), and find more accurate p according to p(d), p(z|d) and p(w|z) (w, d), the correlation between words and documents. Design optimization objective function log-likelihood function, L=∑∑n(d, w) log P(d, w). p(d, w) = p(d) × p(w|d), p(w|d) = ∑p(w|z)p(z|d), p(z|d) = p(z )p(d|z)/∑p(z)p(d|z), p(d, w) =p(d)×∑p(w|z) p(z)p(d|z)/ ∑p(z)p(d|z)=∑p(z)×p(w|z)×p(d|z).

EM algorithm, according to the principle of maximum likelihood, first randomly take a distribution parameter, classify it into a certain part according to the distribution, re-count the number according to the classification, estimate the distribution parameters according to the maximum likelihood, and then reclassify, adjust the parameters, estimate, and finally get the optimal solution. Each training data is classified, p(z|d,w), first take a p(z), p(d|z), p(w|z), p(z|d,w)=p( z)p(d|z)p(w|z)/∑p(z)p(d|z)p(w|z), the numerator is a z, and the denominator is all z sums. The probability estimation (E process) of the maximum likelihood estimation of p(z|d,w), classify each training sample, and count n(d,w) according to the classified data, according to the formula p(z) = 1/R ∑n(d,w)p(z|d,w) updates the parameters. p(d|z)=∑n(d,w)p(z|d,w) / ∑n(d,w)p(z|d,w), the numerator is a d sum, and the denominator is all d sums , compute the maximum likelihood estimate of p(d|z). p(w|z)=∑n(d,w)p(z|d,w) / ∑n(d,w)p(z|d,w), the numerator is a w sum, and the denominator is all w sums, Compute the maximum likelihood estimate of p(w|z). Recalculate p(z|d,w), p(z|d,w)=p(z)p(d|z)p(w|z)/∑p(z)p(d|z)p( w|z). Repeat the above EM process continuously to maximize the log-likelihood function, L=∑∑n(d, w) log P(d, w). Through the above iterations, the final p(w, d) is obtained, the correlation between words and documents, and the correlation is used for retrieval.

The correlation between words, p(w, d) multiplied by transpose, p(w, w) = p(w, d) × trans(p(w, d)). User query query keyword constitutes word vector Wq, document d is represented as word vector Wd, query and document d are related, R(query, d) = Wq×p(w,w)×Wd. Calculate the relevance of all documents and sort them from large to small is the search sorting result.

The implicit semantic indexing model, compared with TF-IDF, adds semantic information, considers word-word relationships, and performs information retrieval based on semantics, which is more suitable for developing chatbots for corpus training and analysis. TF-IDF is more suitable for independent word information retrieval. , more suitable for plain text search engines.

References:

"Python Natural Language Processing"

http://www.shareditor.com/blogshow?blogId=88

http://www.shareditor.com/blogshow?blogId=89

http://www.shareditor.com/blogshow?blogId=90

Welcome to recommend machine learning job opportunities in Shanghai, my WeChat: qingxingfengzi

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325563450&siteId=291194637