- The first step: Get corpus
- Step two: pre-corpus
- Third, the characteristics of the project
- Step Four: feature selection
- Step five: model training
The first step: Get corpus
Corpus, i.e. material language, is a basic unit constituting the corpus. So, people simply used as an alternative text, and text in the context of real-world language as an alternative to the context. We put a set of text called Corpus (Corpus), when there are several such collections of text, we call corpus collection (Corpora). (Defined Source: Baidu Encyclopedia) by source corpus, we corpus will be divided into the following two:
1, there corpus
Paper or electronic text data == "e ==" corpus.
2, download, grab corpus
Domestic and international open standard data sets (such as domestic Chinese Chinese Sogou corpus, People's Daily corpus) or by crawlers.
Step two: pre-corpus
Pretreatment corpus will probably account for 50% -70% of the entire workload.
The basic process: data cleaning == "== word" speech tagging == "to stop word
1, cleaning corpus
Corpus cleaning : find content of interest in the corpus, will not be interested, regarded as the noise content of cleaning deleted. Including: extracting title, abstract, text and other information to the original text, for reptiles, removing advertising, labeling, HTML, JS and other code and comments.
Common data cleaning method : Manual de-duplication, align, delete, and dimensions, or extract the contents of the rules, regular expression matching, according to speech and named entity extraction, writing scripts or code batch processing.
2, word
Word : the short text and text-processing length as a minimum unit particle size is the word or words in the process.
Common methods : Method word string matching based word segmentation method based appreciated, word segmentation method based on word segmentation method and rule-based statistics, wherein each of the following methods corresponds to a number of specific methods.
Difficulty: ambiguity identify and recognize new words. eg: "finished badminton auction", this can be cut into "finished badminton auction", may be cut into "finished badminton Auction" == "context information
3, speech tagging
Speech tagging : hit parts of speech tag to each word or words, is a classic of sequence labeling problem. eg: adjectives, verbs, nouns and so on. Contribute to the integration of more useful language information in subsequent processing.
Non-speech tagging is not required . For example, common text classification would not be concerned about the part of speech issue, but a similar sentiment analysis, reasoning knowledge is required, the following figure is a common Chinese speech finishing.
Common methods : rule-based and statistical-based methods.
- Based on statistical methods: based on the maximum entropy speech tagging, speech output and HMM-based speech tagging based on statistical maximum probability.
4, to stop words
Stop words words without any contribution to the textual features, eg:: punctuation, mood, person and so on.
Note: according to the specific scene . eg: emotional analysis, modal, exclamation mark should be retained, because they represent the degree of tone, a certain degree of emotional contribution and significance.
Third, the characteristics of the project
How words and word after word can be represented as a type of computer calculations.
Ideas : Chinese word string == "vector
Two common representation model:
- Bag of words model (BoW)
- Word vector
1 bag of words model (BoW)
Bag of words model (Bag of Word, BOW): does not consider the order of terms originally sentence directly each word or symbol unified placed in a set (e.g., List), and with the count manner count of the number of occurrences. This word frequency statistics only the most basic way, TF-IDF is a classic word usage model of the bag.
2, vector word
Word vector: convert word, the word is a vector matrix of computational model.
Commonly used word representation:
- One-Hot:把每个词表示为一个很长的向量。这个向量的维度是词表大小,其中绝大多数元素为 0,只有一个维度的值为 1,这个维度就代表了当前的词。eg:
[0 0 0 0 0 0 0 0 1 0 0 0 0 ... 0]
- Word2Vec:其主要包含两个模型:跳字模型(Skip-Gram)和连续词袋模型(Continuous Bag of Words,简称 CBOW),以及两种高效训练的方法:负采样(Negative Sampling)和层序 Softmax(Hierarchical Softmax)。值得一提的是,Word2Vec 词向量可以较好地表达不同词之间的相似和类比关系。
- Doc2Vec
- WordRank
- FastText
第四步:特征选择
关键:如何构造好的特征向量?
==》要选择合适的、表达能力强的特征。
常见的特征选择方法:DF、 MI、 IG、 CHI、WLLR、WFO 六种。
第五步:模型训练
1、模型
对于不同的应用需求,我们使用不同的模型
- 传统的有监督和无监督等机器学习模型: KNN、SVM、Naive Bayes、决策树、GBDT、K-means 等模型;
- 深度学习模型: CNN、RNN、LSTM、 Seq2Seq、FastText、TextCNN 等。
2、注意事项
(1)过拟合
过拟合:模型学习能力太强,以至于把噪声数据的特征也学习到了,导致模型泛化能力下降,在训练集上表现很好,但是在测试集上表现很差。
常见的解决方法有:
- 增大数据的训练量;
- 增加正则化项,如 L1 正则和 L2 正则;
- 特征选取不合理,人工筛选特征和使用特征选择算法;
- 采用 Dropout 方法等。
(2)欠拟合
欠拟合:就是模型不能够很好地拟合数据,表现在模型过于简单。
常见的解决方法有:
- 添加其他特征项;
- 增加模型复杂度,比如神经网络加更多的层、线性模型通过添加多项式使模型泛化能力更强;
- Reducing the regularization parameter, regularization purpose is to prevent over-fitting, but now the model appeared less fit, you need to reduce the regularization parameter.