Popular understanding of Platt scaling/Platt scaling/Platt scaling

I. Introduction

        I recently came across Platt scaling when I was reading a paper, and I didn't understand this concept a bit. Then I was more curious, so I read some popular science and traced the source to investigate Platt scaling. I finally understood this concept and wrote a blog to record it. Chinese translation has seen: Platt scaling, but I prefer to call it Platt scaling. Platt is a researcher in the field of machine learning, and Platt scaling is a concept named after him.

        Platt scaling comes from the paper [2]. This paper is very powerful. The number of Google academic citations has reached 7337 when it was seen on January 7, 2023.

Two, Platt scaling/Platt scaling/Platt scaling

        In machine learning, Platt scaling is a method of converting the output of a classification model into a class probability distribution. This method was invented by John Platt in the context of support vector machines and replaces Vapnik's earlier method, but can be applied to other classification models. Platt scaling works by fitting a logistic regression model to the classifier's score.

        Simply put, in some classification models, for an input sample x, the output y given by the model is not a probability value (range [0,1]), but a score or distance value (range beyond 0,1] ). This score or distance value may affect the application of the classification model in other tasks, which is inconvenient to use. So the researchers thought of a very clever way: convert the output value (score or distance value) of the model into a probability value, and then use it for other tasks.

        The specific method is as follows:

        Suppose \tiny xis a sample, and \tiny x \in R^d. \tiny fRepresents a binary classification model \tiny xwhose output value for an input is \tiny f(x), where \tiny f(x)the range is not just an interval \tiny [0,1]. In this binary classification task, the value of \tiny xthe label is or .\tiny y\tiny +1\tiny -1

        Platt scaling is a two-parameter optimization problem, and the optimization goal is

\tiny min-\sum_i{t_i{\rm log}(p_i)+(1-t_i){\rm log}(1-p_i)}

Among them ,\tiny t_i=\frac{y_i+1}{2} the sample is the label value; in addition, , and are the parameters to be learned by Platt scaling (only two). In this optimization problem, the training set is .\tiny y_i\tiny x_i\tiny p_i = \frac{1}{1+e^{Af(x_i)+B}}\tiny A\tiny B\tiny (f(x_i), t_i)

        Since the optimization goal is a two-parameter minimization problem, any optimization algorithm should be used. Among them, the paper [2] uses the model-trust minimization algorithm [3]. I won’t talk about the specific optimization algorithm here. If we only understand Platt scaling, we might as well imagine using the gradient descent algorithm to optimize this goal.

        In the end, we can obtain a probability value through the sample \tiny xand the model , and then pass it to Platt scaling to obtain a probability value , and the range of must belong to the interval . Scaling was finally achieved.\tiny f\tiny f(x)\tiny f(x)\tiny p\tiny p\tiny [0,1]\tiny f(x)

3. Application example

        I did not give examples of specific applications above. Here, I will talk about a practical application based on the problems I am currently encountering. This is the case, my field is knowledge graph embedding, the output value of some samples of knowledge graph embedding model is norm distance (range is [0,+∞)), others are probability values ​​(range is [0,1 ]), and some are real-valued (in the range [-∞,+∞)). Then I saw an article on ensemble learning [4], which wanted to integrate various knowledge graph embedding models as individual learners. However, the range of output values ​​of these individual learners is very different, and the result of integration is definitely not good. Therefore, the author of this article used Platt scaling, and finally converted the output values ​​​​of all models into probability values. The authors then use the arithmetic mean of the probability values ​​of all models as the final predicted probability.

4. Reference

        [1] Platt scaling - Wikipedia.

        [2] Platt, John. "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." Advances in large margin classifiers 10.3 (1999): 61-74.

        [3] Gill, Philip E., Walter Murray, and Margaret H. Wright. Practical optimization. Society for Industrial and Applied Mathematics, 2019.

        [4] Krompaß, Denis, and Volker Tresp. "Ensemble solutions for link-prediction in knowledge graphs." Proceedings of the 2nd Workshop on Linked Data for Knowledge Discovery, Porto, Portugal. 2015.

Guess you like

Origin blog.csdn.net/qq_36158230/article/details/128590183