N-Gram bag-of-words method of NLP shallow model

Algorithm introduction:
  n-gram is a collection of N (or less) consecutive words extracted from a sentence. The "word" in this concept can also be replaced with "character".
  Let's look at a simple example. Consider the sentence "The cat sat on the mat." ("The cat sat on the mat."). It can be broken down into the following set of 2-grams.

{"The", "The cat", "cat", "cat sat", "sat",
  "sat on", "on", "on the", "the", "the mat", "mat"}

This sentence can also be broken down into the following set of 3-grams.

{"The", "The cat", "cat", "cat sat", "The cat sat",
  "sat", "sat on", "on", "cat sat on", "on the", "the",
  "sat on the", "the mat", "mat", "on the mat"}

  Such sets are called bag-of-2-grams and bag-of-3-grams. The term bag here means that we are dealing with a collection of tags, not a list or sequence, that is, the tags have no specific order. This series of word segmentation methods is called bag-of-words.
Analysis:
  Bag of words is a word segmentation method that does not preserve the order (the generated tags form a set, not a sequence, discarding the overall structure of the sentence), so it is often used in shallow language processing models instead of Deep learning model. When using lightweight shallow text processing models (such as logistic regression and random forest), n-gram is a powerful and indispensable feature engineering tool.
  Extracting n-grams is a feature engineering. Deep learning does not require this rigid and unstable method and replaces it with hierarchical feature learning.

—— Excerpted from "Python Deep Learning"

Guess you like

Origin blog.csdn.net/ManWZD/article/details/108769833