Natural Language Processing Project Flow

 

The first step: Get corpus

Corpus, i.e. material language, is a basic unit constituting the corpus. So, people simply used as an alternative text, and text in the context of real-world language as an alternative to the context. We put a set of text called Corpus (Corpus), when there are several such collections of text, we call corpus collection (Corpora). (Defined Source: Baidu Encyclopedia) by source corpus, we corpus will be divided into the following two:

1, there corpus

Paper or electronic text data == "e ==" corpus.

2, download, grab corpus

Domestic and international open standard data sets (such as domestic Chinese Chinese Sogou corpus, People's Daily corpus) or by crawlers.

Step two: pre-corpus

Pretreatment corpus will probably account for 50% -70% of the entire workload.

The basic process: data cleaning == "== word" speech tagging == "to stop word

1, cleaning corpus

Corpus cleaning : find content of interest in the corpus, will not be interested, regarded as the noise content of cleaning deleted. Including: extracting title, abstract, text and other information to the original text, for reptiles, removing advertising, labeling, HTML, JS and other code and comments.

Common data cleaning method : Manual de-duplication, align, delete, and dimensions, or extract the contents of the rules, regular expression matching, according to speech and named entity extraction, writing scripts or code batch processing.

2, word

Word : the short text and text-processing length as a minimum unit particle size is the word or words in the process.

Common methods : Method word string matching based word segmentation method based appreciated, word segmentation method based on word segmentation method and rule-based statistics, wherein each of the following methods corresponds to a number of specific methods.

Difficulty: ambiguity identify and recognize new words. eg: "finished badminton auction", this can be cut into "finished badminton auction", may be cut into "finished badminton Auction" == "context information

3, speech tagging

Speech tagging : hit parts of speech tag to each word or words, is a classic of sequence labeling problem. eg: adjectives, verbs, nouns and so on. Contribute to the integration of more useful language information in subsequent processing.

Non-speech tagging is not required . For example, common text classification would not be concerned about the part of speech issue, but a similar sentiment analysis, reasoning knowledge is required, the following figure is a common Chinese speech finishing.
Here Insert Picture Description
Common methods : rule-based and statistical-based methods.

  • Based on statistical methods: based on the maximum entropy speech tagging, speech output and HMM-based speech tagging based on statistical maximum probability.

4, to stop words

Stop words words without any contribution to the textual features, eg:: punctuation, mood, person and so on.

Note: according to the specific scene . eg: emotional analysis, modal, exclamation mark should be retained, because they represent the degree of tone, a certain degree of emotional contribution and significance.

Third, the characteristics of the project

How words and word after word can be represented as a type of computer calculations.
Ideas : Chinese word string == "vector

Two common representation model:

  • Bag of words model (BoW)
  • Word vector

1 bag of words model (BoW)

Bag of words model (Bag of Word, BOW): does not consider the order of terms originally sentence directly each word or symbol unified placed in a set (e.g., List), and with the count manner count of the number of occurrences. This word frequency statistics only the most basic way, TF-IDF is a classic word usage model of the bag.

2, vector word

Word vector: convert word, the word is a vector matrix of computational model.

Commonly used word representation:

  • One-Hot:把每个词表示为一个很长的向量。这个向量的维度是词表大小,其中绝大多数元素为 0,只有一个维度的值为 1,这个维度就代表了当前的词。eg: [0 0 0 0 0 0 0 0 1 0 0 0 0 ... 0]
  • Word2Vec:其主要包含两个模型:跳字模型(Skip-Gram)和连续词袋模型(Continuous Bag of Words,简称 CBOW),以及两种高效训练的方法:负采样(Negative Sampling)和层序 Softmax(Hierarchical Softmax)。值得一提的是,Word2Vec 词向量可以较好地表达不同词之间的相似和类比关系。
  • Doc2Vec
  • WordRank
  • FastText

第四步:特征选择

关键:如何构造好的特征向量?
==》要选择合适的、表达能力强的特征。

常见的特征选择方法:DF、 MI、 IG、 CHI、WLLR、WFO 六种。

第五步:模型训练

1、模型

对于不同的应用需求,我们使用不同的模型

  • 传统的有监督和无监督等机器学习模型: KNN、SVM、Naive Bayes、决策树、GBDT、K-means 等模型;
  • 深度学习模型: CNN、RNN、LSTM、 Seq2Seq、FastText、TextCNN 等。

2、注意事项

(1)过拟合

过拟合:模型学习能力太强,以至于把噪声数据的特征也学习到了,导致模型泛化能力下降,在训练集上表现很好,但是在测试集上表现很差。

常见的解决方法有:

  • 增大数据的训练量;
  • 增加正则化项,如 L1 正则和 L2 正则;
  • 特征选取不合理,人工筛选特征和使用特征选择算法;
  • 采用 Dropout 方法等。

(2)欠拟合

欠拟合:就是模型不能够很好地拟合数据,表现在模型过于简单。

常见的解决方法有:

  • 添加其他特征项;
  • 增加模型复杂度,比如神经网络加更多的层、线性模型通过添加多项式使模型泛化能力更强;
  • Reducing the regularization parameter, regularization purpose is to prevent over-fitting, but now the model appeared less fit, you need to reduce the regularization parameter.

(3) For the neural network, pay attention to the disappearance of the gradient and gradient explosion.

Guess you like

Origin www.cnblogs.com/zrmw/p/11248625.html