Text representation overview

Text representation

Text representation is a basic task in natural language processing tasks. A good text representation has a very critical determinant effect on downstream tasks. The most common ones are text clustering and classification.

  • Introduction
    How to describe a sentence can actually be considered from two aspects. The first is to show that the sentence is composed of word sequences , so the representation of the sentence can naturally be represented by the sub-sequence that constitutes the sentence; in one, the sentence is a text. the constituent elements, the sentence may be composed of chapters sentences environment be said that common sense applied here is local information .

  • Traditional representation method
    The traditional representation method here is for the current wave of deep learning that is currently on fire.
    (1) The character representation of the text is directly represented by sentence words or ngram information of words. The disadvantage is that non-numbers cannot be mathematically calculated and can only be matched, and the similarity of words can only be expressed by external dictionaries.
    (2) The quantitative expression of the text, such as one-hot, tf-idf, etc., with a certain amount of statistical information, indicating that the effect has been improved, and that there is a mathematical expression that can be calculated by the distance formula. The disadvantage is The sparse vector takes up a lot of space, and the similarity of words cannot be well described.
    (3) The local hash algorithm, I feel that it is used for medium-length sentences, too long and too short are not good, and it is sensitive to syntax and order.
    (4) doc2vec, gensim package has two methods, LSI and doc2vec, LSI mainly relies on data dimensionality reduction, the interpretability is not very good, and the model is not small. doc2vec is not sensitive to out-of-set data.

  • Sentence Representation Based on Deep Learning The sentence
    vector representation of deep learning is generally related to tasks, such as classification generation.
    (1) Skip-thought: With the help of a generative model, the sentence's representation is used to generate its context, because the semantics of sentences with similar contexts in common sense are also similar, and the structural dependence on the text is relatively weak. However, it was found in the experiment that the ending word has a greater influence on its expression.
    (2) Quick-thought: With the help of a classification model, the common sense is that adjacent sentences are more semantically related than non-adjacent sentences. The experiment found that the result is not very stable, and some sentences are outrageous, which may be related to the training data. Speaking of classification, it feels like the semantic matching model DSSM in principle. That is to say, classification is essentially a spatial mapping problem. When the classification label is sentence semantics, then the downstream task of the classification task is removed, and the encoding representation can be a semantic representation. The category is too thick.

  • Sentence representation based on language model
    (1) bert: This is true fire, using the correlation between words and the correlation between sentences. Other examples: ELMo, GPT, etc. This will not be expanded here. The language model requires a large amount of training corpus and high-performance computing resources, which are not affordable for ordinary nights.

Guess you like

Origin blog.csdn.net/cyinfi/article/details/81989821