As can be seen from the pictures of SVM, SVM is a typical two-class classifier, that is, it only answers the question of whether it belongs to the positive class or the negative class. In reality, the problems to be solved are often multi-class problems (a few exceptions, such as spam filtering, only need to determine "yes" or "not" spam), such as text classification, such as number recognition. How to obtain a multi-class classifier from a two-class classifier is a problem worthy of study.

Also take text classification as an example, there are many ready-made methods, one of which is a once-and-for-all method, which is to really consider all samples at one time, and solve an optimization problem of a multi-objective function, and obtain multiple classification surfaces at one time, just like the following: The picture is like this:

clip_image001

Multiple hyperplanes divide the space into multiple regions, and each region corresponds to a category. Give an article and see which region it falls in to know its classification.

It looks beautiful, right? It is a pity that this algorithm is still basically on paper, because the calculation amount of the one-time solution method is too large to be practical.

Taking a step back, we can think of the so-called "one class to the rest" method, which is to solve a two-class classification problem at a time. For example, we have 5 categories. For the first time, the samples of category 1 are set as positive samples, and the remaining samples of 2, 3, 4, and 5 are combined as negative samples, so that a two-class classifier is obtained, which can point out a Whether the article is of category 1 or not; the second time we set the samples of category 2 as positive samples, and combined the samples of 1, 3, 4, and 5 as negative samples to get a classifier, and so on, we Five such two-class classifiers can be obtained (always the same as the number of classes). When there is an article that needs to be classified, we will take this article and ask the classifier one by one: Does it belong to you? Is it yours? Whichever classifier nods and says yes, the category of the article is determined. The advantage of this method is that the size of each optimization problem is relatively small, and the classification is very fast (only need to call 5 classifiers to know the result). But sometimes there are two very embarrassing situations, such as taking an article and asking around, each classifier says it belongs to its category, or each classifier says it is not its category The former is called classification overlap phenomenon, and the latter is called unclassifiable phenomenon. The classification overlap is easy to handle. It is not too outrageous to choose any result, or look at the distance between this article and each hyperplane, whichever is farther away will be awarded. The unclassifiable phenomenon is really difficult to handle, and it can only be divided into the sixth category... What is even worse is that the number of samples in each category is similar, but the number of samples in the "other" category is always Several times the positive class (because it is the sum of the samples of other classes except the positive class), which artificially causes the "dataset skew" problem mentioned in the previous section.

Therefore, we have to take a step back and solve the two-class classification problem, or choose a sample of one class at a time as a positive class sample, while a negative class sample becomes only one class (called "one-to-one heads-up"). method, oh no, there is no heads-up, it's a "one-on-one" approach, hehe), which avoids skew. So the process is to figure out some classifiers, the first one just answers "is it class 1 or class 2", the second only answers "is it class 1 or class 3", and the third only answers "is it class 1" Class or Class 4", and so on, you can also immediately come to the conclusion that there should be 5 X 4/2=10 such classifiers (the general formula is, if there are k classes, then the total number of two-class classifiers is k(k-1)/2). Although the number of classifiers is larger, the total time used in the training phase (that is, when calculating the classification plane of these classifiers) is much less than that of the "one-for-the-rest" method. throw an article to all the classifiers, the first classifier will vote that it is "1" or "2", the second will say it is "1" or "3", let each vote their own The number of votes is finally counted. If the category "1" has the most votes, the article will be judged to belong to the first category. Obviously, this method will also have overlapping classifications, but there will be no unclassifiable phenomena, because it is impossible for all categories to have 0 votes. Does it look good enough? Actually not, think about classifying an article, how many classifiers do we call? 10, which is still when the number of categories is 5. If the number of categories is 1000, the number of classifiers to be called will rise to about 500,000 (the square of the number of categories). How is this good?

It seems that we have to take a step back and work on classification, we still train like a one-to-one method, but before classifying an article, we first organize the classifier as shown in the following figure (as you As you can see, this is a directed acyclic graph, so this method is also called DAG SVM)

clip_image002

In this way, when classifying, we can first ask the classifier "1 to 5" (meaning it can answer "is it class 1 or class 5"), if it answers 5, we go left and ask "2 For the 5" classifier, if it still says "5", we continue to go to the left, and if we keep asking, we can get the classification result. Where is the benefit? We actually only call 4 classifiers (if the number of categories is k, only call k-1), the classification speed is fast, and there is no classification overlap and unclassifiable phenomenon! Where are the shortcomings? If the first classifier answered incorrectly (it was clearly an article of category 1, it said 5), then the latter classifier could not correct its error anyway (because the latter classifier did not appear "1" at all. "This category label), in fact, there is such a phenomenon of downward accumulation of errors for the classifiers of each layer below. .

But don't be intimidated by the accumulation of errors in the DAG method. Error accumulation also exists in the one-to-one and one-to-one methods. The DAG method is better than them in that the upper limit of the accumulation, no matter how large or small, always has Conclusive, there is theoretical proof. In the one-to-one and one-to-one methods, although the generalization error limit of each two-class classifier is known, when combined for multi-class classification, no one knows what the upper bound of the error is, which means that It is also possible that the accuracy rate is as low as 0, which is very depressing.

And now the selection of the root node of the DAG method (that is, how to select the first classifier to participate in the classification), there are also some methods to improve the overall effect, we always hope that the root node makes less mistakes, so the two parts involved in the first classification It is better to have very different categories, so big that it is impossible to classify them wrongly; or we always take the classifier with the highest accuracy rate in the two categories as the root node, or we let the two categories classify When the classifier is classifying, it not only outputs the label of the category, but also outputs something similar to "confidence". When it is not very confident in its own results, we not only follow its output, but put the one next to it. Go the other way, and so on.

Big Tips : Computational complexity of SVM

When using SVM for classification, it is actually two completely different processes of training and classification, so the discussion of complexity cannot be generalized. What we are talking about here is mainly the complexity of the training phase, that is, the complexity of solving the quadratic programming problem. Spend. The solution to this problem is basically divided into two parts, the analytical solution and the numerical solution.

Analytical solution is a theoretical solution. Its form is an expression, so it is accurate. As long as a problem has a solution (what else is involved in a problem without a solution, haha), its analytical solution must exist. . Of course, existence is one thing, and being able to solve it, or being able to solve it within an acceptable time frame, is another. For SVM, the worst time complexity of obtaining an analytical solution can reach O(N sv 3 ), where N sv is the number of support vectors, and although there is no fixed ratio, the number of support vectors is also the same as depends on the size of the training set.

Numerical solutions are solutions that can be used, which are numbers one by one, and are often approximate solutions. The process of finding a numerical solution is very similar to the exhaustive method. Start with a number and try it out when the solution does not meet a certain condition (called the stop condition, that is, after this is satisfied, the solution is considered to be accurate enough, and there is no need to continue the calculation. ), then try the next one. Of course, the next number is not randomly selected, and there are certain rules and regulations to follow. Some algorithms only try one number at a time, while others try more than one, and the methods for finding the next number (or the next group of numbers) are also different, the stopping conditions are also different, and the final solution accuracy is also different. Different from each other, it can be seen that the discussion of the complexity of finding numerical solutions cannot be separated from specific algorithms.

A specific algorithm, the Bunch-Kaufman training algorithm, has a typical time complexity between O(N sv 3 +LN sv 2 +dLN sv ) and O(dL 2 ), where N sv is the number of support vectors, L is the number of training set samples, and d is the dimension of each sample (the original dimension, the dimension before mapping to the high-dimensional space). The complexity will change because it is not only related to the scale of the input problem (not only related to the number of samples and dimensions), but also related to the final solution of the problem (that is, related to the support vector). If the support vector is relatively small, the process will Much faster, if there are many support vectors, close to the number of samples, it will produce a very bad result of O(dL 2 ). , huh, and this input scale is too normal for text classification).

So looking back, it becomes clear why the one-to-one method, despite having a larger number of two-class classifiers to train, actually takes less time than the pair-rest method, which considers all samples per training (It's just that different parts are divided into positive or negative classes each time ) , which is naturally much slower.