Text Classification Learning (6) AdaBoost and SVM

I jumped directly from feature extraction to BoostSVM because I have been writing programs, analyzing junk text, and thinking about the shortcomings of text classification for identifying junk text. Learning text classification by yourself is all about identifying junk text.

The blog in the middle will be added after you have thoroughly researched it.

 

Because when getting junk text, it is found that junk text is not simple junk text, they have multiple characteristics:

1. There are many kinds, and it is difficult to have common characteristics. Advertising covering all walks of life, or politically sensitive content, or pornographic information. Unlike text classification, which belongs to a category of text, their content belongs to that field, and feature extraction is very convenient

2. It has a certain degree of disguise. On the surface, 80% of the content is normal, and only 20% of the content of the advertisement is introduced.

3. The forms are diverse, there are a lot of junk texts in Martian script, and there are a lot of junk texts with links, all of which cannot be segmented.

 

One of the reasons why AdaBoost is considered is because someone has indeed studied the combination of AdaBoost and SVM, which combines the weak classifiers trained by SVM into a strong classifier.

The second is because, I feel that SVM will have a poor classification effect on the garbage text described above. SVM originally trained a strong classifier. If the correct rate of adjustment parameters is always about 50%, then it is a weak classifier. Can this problem be solved by AdaBoost? Everything has to be verified in practice.

 

The basic idea of ​​AdaBoost algorithm:

At the beginning, a training set N is given, and the training set is marked with a weight W, and W = 1/N of all training samples in the initial stage.

Then learn through machine learning to get a classifier (weak classifier). It is found that some samples in the training set will be wrongly classified through this weak classifier. We increase the weights of these wrongly classified samples and calculate the weights of this classifier.

In the third step, the training set after changing the weights, we select those training samples with large weights (the samples that were wrongly classified by the previous classifier) ​​and pick them out to continue training, get the second classifier, and repeat the steps of the second step

After iterating T times, T classifiers and their weights are obtained. The combination of these classifiers is a strong classifier.

 

AdaBoost and SVM combined:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325855964&siteId=291194637