The complete process of Chinese natural language processing: Lesson 01

The first step: Get corpus

Corpus, that language materials. Corpus linguistics is the content of the study. Corpus is a basic unit constituting the corpus. So, people simply used as an alternative text, and text in the context of real-world language as an alternative to the context. We put a set of text called Corpus (Corpus), when there are several such collections of text, we call corpus collection (Corpora). (Defined Source: Baidu Encyclopedia) by source corpus, we corpus will be divided into the following two:

1. existing corpus

Many businesses, companies and other organizations with business development will be the accumulation of a large number of paper-based or electronic text data. So, for these materials under conditions which allow us a little consolidation, the text of fully electronic paper can be used as our corpus.

2. download, grab corpus

If the hands do not now how to do personal data it? This time, we can choose to get domestic and international standards of open data sets, such as domestic Chinese Chinese Sogou corpus , People's Daily corpus . Mostly because foreign English or foreign language, here being less than. You can also select by reptiles themselves to grab some data, and then follow-up content.

Step two: pre-corpus

Here focus on what corpus pre-processing, in a complete Chinese natural language processing engineering applications, pre-corpus will probably account for 50% -70% of the entire work, so developers will most of the time during the pre-corpus . Cleared by the following data, segmentation, POS tagging, to deactivate four major aspects word to complete the corpus preprocessing.

1. Corpus cleaning

Data cleansing, as the name suggests is to find what we are interested in the corpus, the interest is not regarded as noise content of cleaning deleted, including text extraction for the original title, abstract, text and other information for Web content crawling, remove ads , tags, HTML, JS, etc., code and comments. Common data cleansing methods are: manual de-duplication, align, delete, and dimensions, or extract the contents of the rules, regular expression matching, according to speech and named entity extraction, writing scripts or batch codes and so on.

2. Segmentation

Chinese corpus data for the number of short text or long text, for example: a collection of sentences, article summaries, paragraphs or entire article composed. General sentence, the word between paragraphs, the word is continuous, has a certain meaning. When text mining and analysis, we want to minimize the size of text processing unit is the word or words, so this time we need the word to all of the text word.

Common segmentation algorithm are: segmentation method based on string matching, segmentation method based on understanding, based on word segmentation method of statistical and rule-based methods, each method corresponds to the following number of specific algorithms.

The main difficulty of the current Chinese word segmentation algorithm ambiguous recognition and new word recognition, such as: "badminton auction is over", this can cut into "badminton auction is over", can also be cut into "badminton auction is over," the other sentence if you do not depend on context it would be difficult to know how to understand.

3. speech tagging

Speech tagging, is playing tag parts of speech, such as adjectives, verbs, nouns, etc. for each word or words. This will enable the text into the more useful language information in subsequent processing. Speech tagging is a classic of sequence labeling problem, but for some Chinese natural language processing, the non-speech tagging is not required. For example, common text classification would not be concerned about the part of speech issue, but a similar sentiment analysis, reasoning knowledge is required, the following figure is a common Chinese speech finishing.

enter image description here

Common speech tagging method can be divided into rule-based and statistical-based methods. Which it based on statistical methods, such as those based on maximum entropy speech tagging, speech output and HMM-based speech tagging based on statistical maximum probability.

4. to stop words

Stop word generally refers to no contribution on the role of textual features words such as punctuation, tone, known and some words. So in general text processing, after the word, the next step is to stop words. But for the Chinese, the word to disable the operation is not static, stop word dictionary is determined based on specific scenarios, such as sentiment analysis, the tone of the word, an exclamation point is should be retained, because they represent voice, affection the contribution of a certain color and meaning.

The third step: Feature Project

After doing corpus pretreatment, the next word need to consider how the following words and word can be represented as a type of computer calculations. Obviously, if you want to calculate we need to convert at least a Chinese word string into a digital, precisely, should be a vector mathematics. There are two common models are represented bag of words and word vector model.

词袋模型(Bag of Word, BOW),即不考虑词语原本在句子中的顺序,直接将每一个词语或者符号统一放置在一个集合(如 list),然后按照计数的方式对出现的次数进行统计。统计词频这只是最基本的方式,TF-IDF 是词袋模型的一个经典用法。

词向量是将字、词语转换成向量矩阵的计算模型。目前为止最常用的词表示方法是 One-hot,这种方法把每个词表示为一个很长的向量。这个向量的维度是词表大小,其中绝大多数元素为 0,只有一个维度的值为 1,这个维度就代表了当前的词。还有 Google 团队的 Word2Vec,其主要包含两个模型:跳字模型(Skip-Gram)和连续词袋模型(Continuous Bag of Words,简称 CBOW),以及两种高效训练的方法:负采样(Negative Sampling)和层序 Softmax(Hierarchical Softmax)。值得一提的是,Word2Vec 词向量可以较好地表达不同词之间的相似和类比关系。除此之外,还有一些词向量的表示方式,如 Doc2Vec、WordRank 和 FastText 等。

第四步:特征选择

同数据挖掘一样,在文本挖掘相关问题中,特征工程也是必不可少的。在一个实际问题中,构造好的特征向量,是要选择合适的、表达能力强的特征。文本特征一般都是词语,具有语义信息,使用特征选择能够找出一个特征子集,其仍然可以保留语义信息;但通过特征提取找到的特征子空间,将会丢失部分语义信息。所以特征选择是一个很有挑战的过程,更多的依赖于经验和专业知识,并且有很多现成的算法来进行特征的选择。目前,常见的特征选择方法主要有 DF、 MI、 IG、 CHI、WLLR、WFO 六种。

第五步:模型训练

在特征向量选择好之后,接下来要做的事情当然就是训练模型,对于不同的应用需求,我们使用不同的模型,传统的有监督和无监督等机器学习模型, 如 KNN、SVM、Naive Bayes、决策树、GBDT、K-means 等模型;深度学习模型比如 CNN、RNN、LSTM、 Seq2Seq、FastText、TextCNN 等。这些模型在后续的分类、聚类、神经序列、情感分析等示例中都会用到,这里不再赘述。下面是在模型训练时需要注意的几个点。

1.注意过拟合、欠拟合问题,不断提高模型的泛化能力。

过拟合:模型学习能力太强,以至于把噪声数据的特征也学习到了,导致模型泛化能力下降,在训练集上表现很好,但是在测试集上表现很差。

常见的解决方法有:

  • 增大数据的训练量;
  • 增加正则化项,如 L1 正则和 L2 正则;
  • 特征选取不合理,人工筛选特征和使用特征选择算法;
  • 采用 Dropout 方法等。

欠拟合:就是模型不能够很好地拟合数据,表现在模型过于简单。

常见的解决方法有:

  • 添加其他特征项;
  • 增加模型复杂度,比如神经网络加更多的层、线性模型通过添加多项式使模型泛化能力更强;
  • 减少正则化参数,正则化的目的是用来防止过拟合的,但是现在模型出现了欠拟合,则需要减少正则化参数。

2.对于神经网络,注意梯度消失和梯度爆炸问题。

发布了26 篇原创文章 · 获赞 23 · 访问量 5万+

Guess you like

Origin blog.csdn.net/dongdouzin/article/details/80814037