SVM algorithm for structural mapping: interpretation of core ideas

Recently I am planning to do a topic related to NER and anaphora resolution, and I am reading related articles with my classmates. Today a classmate found a very interesting article about using SVM to do NER. I think this idea is quite strange. How does SVM process text and sequences? So I took a look and found that the article he was looking for was a shell article. The core algorithm quoted an article published many years ago on how to use SVM for structured data mapping, that is

Tsochantaridis I ,  Joachims T ,  Hofmann T , et al. Large Margin Methods for Structured and Interdependent Output Variables[J]. Journal of Machine Learning Research, 2006, 6(2):1453-1484.

This article has more than 3,000 citations. It should be quite famous, but I have never heard of it (face covering). So I just spent some time reading it. The article is relatively long and has many formulas. I have refined the core issues we are concerned about and shared them with you.

How to handle text sequence input for SVM

When I first heard my classmates talking about using SVM, I thought it was a straightforward article about word2vec and throwing SVM at it. After reading it, I realized that this article was mainly a theoretical analysis. The original article uses the seq2seq problem as an example and analyzes it step by step:

For sequences \{x_1,x_2,......\}, the simplest method is to directly predict a single element (without encoding), but this is obviously a garbage method. Secondly, of course, it is natural to extract window features, which is equivalent to x_iencoding a series of nearby elements. As for how to code, it depends on different problems (because this algorithm is aimed at a type of problem, so we need to talk about the macro solution. After section 4.5, we will analyze how to do it specifically for the natural language parse problem)

How to infer a sequence with SVM

The SVM we usually use has no internal state, so there is no way to do traditional seq2seq. This article first considers it in isolation: SVM is first responsible for calculating x_i(referring to x_ithe encoding formed by position sliding window calculation) y_i. Since x_ithe encoding has been transformed into an ordinary spatial mapping problem, it can be solved with traditional SVM. However, because each position of the sequence is diverse, such as part-of-speech tagging, there are many types of parts of speech; for problems such as reorganizing input elements (such as converting sequence structure to tree structure), there are even more types of elements at each position. , there are too many classes, and traditional SVM may not be able to converge. The main contribution of this article is here, a cutting plane method is proposed:

 This method is optimized based on SVM with four different losses:

The algorithm is based on adding different constraints (by metrics) through different processes of iterative optimization over the kernel function \psispace H(y). Due to the large number of classes, a large number of constraints will be generated, and the cutting plane method is used to solve when the number of constraints is large. Simply put, when it is impossible to directly construct a solution that satisfies all constraints, first construct a bunch of solutions that satisfy some constraints, and then iterate based on the positions of these solutions to find a solution that satisfies all constraints.

Note: I am not sure about this part. In the previous problem analysis, the author gave an example of "n situations at each position of the sequence, which will cause the number of classes to expand exponentially. This algorithm can solve the problem of a huge number of classes." This implies that the algorithm can directly output the sequence. However, the algorithm introduced in Chapter 3 does not involve sequence-related content, and the feature map form of the input and output of the algorithm is no different from that of traditional SVM. Several specific tasks are introduced in Chapter 4, including tasks that use this algorithm to solve non-structural mapping problems. Obviously, it is unwise to use a model that handles sequences to solve non-sequential problems. Therefore, I think the "algorithm directly outputs the sequence" mentioned in the previous article is a wrong hint. This method does not directly process the sequence. The actual processing of the sequence requires the help of other methods (see below)

How to use sequence information

For seq2seq problems, the improved SVM mentioned above can be used for each x_iinference y_i. But what we want to predict is a sequence Y. It is not necessary to x_iinfer one for each, just infer n times. y_iIn doing so, no sequence information is utilized except when encoding the input, and everything inferred y_iis isolated. What we want is overall optimization. Therefore, the article performs dynamic programming for the possible inference results of each position:

The method it uses has a premise, that is, there must be Yindicators that can calculate the quality of the final sequence (such as the F1 score that measures syntax trees mentioned in the article). For this overall indicator, each inference y_iwill affect the subsequent optimal one y_{i+1}(for example, "eat" is more likely to be "full" instead of "hungry"). In this way, the best result given by SVM cannot be mechanically selected for every inference, because the local optimum is not equal to the global optimum. For this type of sequence optimization problem, it is obvious that dynamic programming can be used to optimize this external indicator so that the overall sequence is optimal. excellent. This is the same idea as using conventional language model + CRF to solve the natural language seq2seq problem.

Other weird stuff

In addition, this article also has an interesting modeling idea:

This part points out: Because there are unified characteristics between the elements of the structure, for example, for the partial order a<b<c, the mapping between a and b and the mapping between b and c must have characteristics that can be shared. Therefore, you can model these shared features of the elements and learn different parameters for different elements (such as an example I imagined: predicting the next element through the previous element of the partial order sequence, which will produce different results when different elements are input. internal parameters for inference). This idea can guide the specific modeling in the second part of this article.

Guess you like

Origin blog.csdn.net/FYZDMMCpp/article/details/126322995