【Classification and processing of natural language】

The two most widely used classification models are the Decision Tree Model and the Naive Bayesian Model (NBM).

Naive Bayes is a classification method based on Bayes' theorem and the assumption of feature condition independence.

 

In fact, no one is unfamiliar with the classification problem. It is not an exaggeration to say that each of us performs classification operations every day, but we do not realize it. For example, when you see a stranger, your mind subconsciously judges whether the TA is a man or a woman; you may often walk on the road and say to your friend next to you, "This person is rich at first glance, there is a non-mainstream person over there. ” and so on, in fact, this is a classification operation.

 

Classification is the process of classifying an unknown sample into several pre-known classes. The solution to the data classification problem is a two-step process: In the first step, a model is built that describes a prior dataset or set of concepts . Models are constructed by analyzing samples (or instances, objects, etc.) described by attributes. Each sample is assumed to have a predefined class, identified by a property called the class label. The data tuples that are analyzed for model building form a training data set, a step also known as supervised learning.

Among the many classification models, the two most widely used classification models are the Decision Tree Model (Decision Tree Model) and the Naive Bayesian Model (NBC). Decision tree models solve classification problems by constructing trees.

 

The training dataset is first used to construct a decision tree, and once the tree is built, it can generate a classification for unknown samples. There are many advantages to using decision tree models in classification problems. Decision trees are easy to use and efficient; rules can be easily constructed from decision trees, which are usually easy to interpret and understand; decision trees scale well to large databases At the same time, its size is independent of the size of the database; another great advantage of the decision tree model is that it can construct decision trees for datasets with many attributes. Decision tree models also have some disadvantages, such as difficulty in dealing with missing data, the emergence of overfitting problems, and ignoring correlations between attributes in the dataset.

 

Compared with the decision tree model, the Naive Bayes Classifier (or NBC) originated from classical mathematical theory, has a solid mathematical foundation, and has stable classification efficiency. At the same time, the NBC model needs to estimate few parameters, is not sensitive to missing data, and the algorithm is relatively simple. In theory, the NBC model has the smallest error rate compared to other classification methods. But this is not always the case, because the NBC model assumes that the attributes are independent of each other, which is often not true in practical applications, which has a certain impact on the correct classification of the NBC model.

The way to solve this problem is generally to establish an attribute model, and deal with them separately for the attributes that are not independent of each other. For example, when classifying and recognizing Chinese text, we can build a dictionary to process some phrases. If a particular pattern property is found to exist in a particular problem, it is dealt with separately.

This is also in line with the Bayesian probability principle, because we regard a phrase as a separate pattern, such as English text processing some words of different lengths, and they are also processed as separate independent patterns, which is natural language and other classifications. Identify the differences in the problem.

When actually calculating the prior probability, because these patterns are calculated by the program as probabilities, rather than being understood by natural language, the result is the same.

When the number of attributes is relatively large or the correlation between attributes is relatively large, the classification efficiency of the NBC model is not as good as that of the decision tree model. But this needs to be verified, because the specific problems are different, the results obtained by the algorithms are different, and the same algorithm has different recognition performance for the same problem as long as the pattern changes. This has been recognized in many foreign papers. In the book Machine Learning, it is also mentioned that the algorithm's recognition of attributes depends on many factors, such as the ratio of training samples and test samples, which affects the performance of the algorithm.

Decision trees are used for text classification and recognition, depending on the specific situation. The NBC model performs slightly better when the attribute correlations are small. When the attribute correlation is small, other algorithms also perform well, which is determined by the information entropy theory.

 

 

Naive Bayes classification is divided into three stages:

 第一阶段——准备工作阶段,这个阶段的任务是为朴素贝叶斯分类做必要的准备,主要工作是根据具体情况确定特征属性,并对每个特征属性进行适当划分,然后由人工对一部分待分类项进行分类,形成训练样本集合。这一阶段的输入是所有待分类数据,输出是特征属性和训练样本。这一阶段是整个朴素贝叶斯分类中唯一需要人工完成的阶段,其质量对整个过程将有重要影响,分类器的质量很大程度上由特征属性、特征属性划分及训练样本质量决定。

 

第二阶段——分类器训练阶段,这个阶段的任务就是生成分类器,主要工作是计算每个类别在训练样本中的出现频率及每个特征属性划分对每个类别的条件概率估计,并将结果记录。其输入是特征属性和训练样本,输出是分类器。这一阶段是机械性阶段,根据前面讨论的公式可以由程序自动计算完成。

 

第三阶段—应用阶段。这个阶段的任务是使用分类器对待分类项进行分类,其输入是分类器和待分类项,输出是待分类项与类别的映射关系。这一阶段也是机械性阶段,由程序完成。

 

 

原创不易,欢迎打赏,请认准正确地址,谨防假冒



 

 

 



 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326283192&siteId=291194637