[Machine Learning] Logistic Regression (Rearranging)

Logistic regression learning circuit

Preliminary knowledge: It is recommended to go to Station B to learn the concepts of information volume, entropy, BL divergence, and cross entropy.

Recommend Bilibili video: How does "cross entropy" work as a loss function? Package understanding of "information volume", "bit", "entropy", "KL divergence", "cross entropy"

Information refers to the amount of information contained in an event, usually represented by a logarithm with base 2. For example, if the probability of an event occurring is 1/8, then the amount of information of this event is log2(1/8)=-3, because three bits are required to represent it.

Entropy is a measure of uncertainty in a system or source, and can also be understood as the average amount of information. In information theory, greater entropy means that the system or source is less predictable and therefore contains more information. For example, a pile of coins that come up heads has many possibilities, so they have higher entropy than a pile of coins that are known to all come up heads.

KL divergence (Kullback-Leibler
divergence), also known as relative entropy, is a way to measure the difference between two probability distributions. The KL divergence is non-negative and takes a value of 0 if and only if the two distributions are identical.

Cross-entropy (Cross-entropy) is a method used to compare the difference between two probability distributions, which is often used to evaluate the performance of classification models. Similar to KL divergence, cross-entropy is also non-negative, taking a value of 0 if and only if the two distributions are equal.

Knowledge map for this section

Functional Model of Logistic Regression

 Logistic regression is a classification model

 It can be used to predict whether something will happen or not. Classification problems are the most common problems in life:

  • In life: such as predicting whether the Shanghai Composite Index will rise tomorrow, whether it will rain in a certain area tomorrow, whether the watermelon is ripe

  • Financial field: Whether a certain transaction is suspected of violating regulations, whether a certain company is currently violating regulations, and whether there will be violations in the future

  • Internet: Will users buy something, will they click something

 For known results, the answer to the above question is only: 0, 1 .

 Let's take the following binary classification as an example. For a given data set, there is a straight line that can divide the entire data set into two parts:

 At this point, the decision boundary is w 1 x 1 + w 2 x 2 + b = 0 w_{1}x_1+w_2x_2+b=0w1x1+w2x2+b=0 , at this point we can easily put h ( x ) = w 1 x 1 + w 2 x 2 + b > 0 h(x)=w_1x_1+w_2x_2+b>0h(x)=w1x1+w2x2+b>A sample of 0 is set to 1, otherwise it is set to 0. But this is actually a decision-making process of a perceptron.
 On this basis, logistic regression needs to add a layerto find the relationship between the classification probability and the input variable, and judge the category through probability.
Recalling the linear regression model:h ( x ) = w T x + bh(x)=w^Tx+bh(x)=wTx+b Add a function ggon the basis of the linear modelg,即 h ( x ) = g ( w T x + b ) h(x)=g(w^Tx+b) h(x)=g(wTx+b ) . This function is the sigmoid function, also called the logistic function g ( z ) = 1 1 + e − zg(z)=\frac{1}{1+e^{-z}}g(z)=1+ez1

It converts the result of a linear regression into a probability value. At this time h ( x ) h(x)h ( x ) represents the probability of something happening, we can also record it asp ( Y = 1 ∣ x ) p(Y=1|x)p ( Y=1∣ x )
You can look at the image of the sigmoid function:

Summary: In this way, we get the Logistic model h ( x ) h(x)The expression of h ( x ) h ( x ) = 1 1 + e − w T x + bh(x) = \frac{1}{1+e^{-w^Tx+b}}h(x)=1+ewTx+b1
: h ( x i ) h(x_i) h(xi) means that the samplexi x_ixiThe probability that the label is 1 is h ( x ) h(x)h(x)

Loss function, loss minimization architecture

log loss as loss function

对数损失:l ( y , y ^ ) = − ∑ i = 1 myilog 2 y ^ l(y,\hat{y})=-\sum_{i=1}^my_ilog_2\hat{y}l ( y ,y^)=i=1myilog2y^

when llWhen l takes the minimum value, the modely ^ \hat{y}y^The distribution closest to the true (theoretical) model yi y_iyidistributed

Loss Minimization Architecture

Because it is a binary classification problem, so


Feed it into the log loss function lllSo
the loss minimization architecture of the two classifications is:

m i n − ∑ [ y i l o g ( h ( x i ) ) + ( 1 − y i ) l o g ( 1 − h ( x i ) ) ] min -\sum [y_ilog(h(x_i))+(1-y_i)log(1-h(x_i))] min[yilog(h(xi))+(1yi)log(1h(xi))]

Why doesn't the logistic regression loss function use the least squares method? Answer


This is the intersection:
− ∑ [ yilog ( h ( xi ) ) + ( 1 − yi ) log ( 1 − h ( xi ) ) ] -\sum [y_ilog(h(x_i))+(1-y_i)log( 1-h(x_i))][yilog(h(xi))+(1yi)log(1h(xi))]


From the perspective of probability theory and statistics, loss minimization architecture (cross entropy):
 In statistics, suppose we already have a set of samples (X, Y), in order to calculate the parameters that can generate this set of samples. Usually we will use the method of maximum likelihood estimation (a commonly used method of parameter estimation). If we use the maximum likelihood estimation, we also need a hypothetical estimation, here we are assuming YYY is subject tothe Bernoulli distribution. P ( Y = 1 ∣ x ) = p ( x ) P(Y=1|x)=p(x)P ( Y)=1∣x)=p(x) P ( Y = 0 ∣ x ) = 1 − p ( x ) P(Y=0|x)=1-p(x) P ( Y)=0∣x)=1p ( x ) due toYYY obeys the Bernoulli distribution, we can easily have the likelihood function: L = ∏ [ p ( xi ) yi ] [ 1 − p ( xi ) ] ( 1 − yi ) L=\prod[p(x_i)^ {y_i}][1-p(x_i)]^{(1-y_i)}L=[p(xi)yi][1p(xi)](1yi) in order to solve it we can take the logarithm on both sides:
log L = ∑ [ yilog ( p ( xi ) ) + ( 1 − yi ) log ( 1 − p ( xi ) ) ] logL = \sum [y_ilog(p(x_i)) +(1-y_i)log(1-p(x_i))]logL=[yilog(p(xi))+(1yi)log(1p(xi))]

The maximum likelihood estimation is actually the probability model with the highest probability: you may not understand it, but in another way, ∏ p ( h ( xi ) ∣ θ ) \prod p(h(x_i)|\theta)p(h(xi) θ ) , the existing probability modelh ( x ) h(x)h ( x ) in the existing sampleθ \thetaUnder the condition of θ , the larger the calculated value, the more h ( x ) h(x)h ( x ) is closest to the theoretical probability model

We generally like to take the minimum value of the formula, so convert the original formula to
min ( − log L ) = min − ∑ [ yilog ( p ( xi ) ) + ( 1 − yi ) log ( 1 − p ( xi ) ) ] min(-logL) = min -\sum [y_ilog(p(x_i))+(1-y_i)log(1-p(x_i))]min ( l o gL )=min[yilog(p(xi))+(1yi)log(1p(xi))]

Here p ( xi ) p(x_i)p(xi) corresponds toh ( xi ) h(x_i)h(xi)

Looking at the loss minimization architecture from the perspective of information volume and entropy:

KL divergence: when DKL = 0 D_{KL}=0DKL=0 , P model = Q model, what we pursue is the Q model we built (that is,h ( x ) h(x)h ( x ) ) is close to the real P model (here is the theoretical model of the sample,pi is p_i ispifor the sample)

Yes, according to Gibbs inequality , cross entropy ≥ \geq the entropy of the P system, so when we take the minimum valueof cross entropy, the Q model is closer to the real theoretical P model, and we know that the definition of information is f : = − log 2 xf:=-log_2xf:=log2So bring x
into the original formula, the cross entropy is:− ∑ i = 1 mpilog 2 x -\sum_{i=1}^mp_ilog_2xi=1mpilog2x

This is where the log loss comes in: l ( y , y ^ ) = − ∑ i = 1 myilog 2 y ^ l(y,\hat{y})=-\sum_{i=1}^my_ilog_2\hat{ y}l ( y ,y^)=i=1myilog2y^When watching the teacher's class, he directly threw me a logarithmic loss, and then directly talked about cross entropy. Why logarithmic loss can be used as a loss function did not tell us. Through the above knowledge, readers should understand how this logarithmic loss came about. Keep reading, you will gain more.

Recommended video: [How does "cross-entropy" make a loss function? Package understanding "information amount", "bit", "entropy", "KL divergence", "cross entropy"]

Because it is a binary classification problem, so


So the cross entropy of the two categories is:

− ∑ [ y i l o g ( p ( x i ) ) + ( 1 − y i ) l o g ( 1 − p ( x i ) ) ] -\sum [y_ilog(p(x_i))+(1-y_i)log(1-p(x_i))] [yilog(p(xi))+(1yi)log(1p(xi))]

Here p ( xi ) p(x_i)p(xi) corresponds toh ( xi ) h(x_i)h(xi)

Thinking from these two perspectives:

  • The formulas derived from them are the same, can they all be called cross entropy?
    • No, they are just the same formula. When the maximum likelihood estimation function is derived, the log log appears because it is customary to change multiplication to addition.l o g , its base can be any positive number, and it has no unit, andthe amount of informationwill belog 2 log_2log2Written in the definition, and it has a unit, and the bit is its unit.

classification function

Maximum Probability Classification Function

in a kkIn the k -ary classification problem, given the prediction probability modelhhh,即
h ( x ) = ( h 1 ( x ) , h 2 ( x ) , . . . , h k ( x ) ) h(x)=(h_1(x),h_2(x),...,h_k(x)) h(x)=(h1(x),h2(x),...,hk( x )) Among them,hi ( x ) h_i(x)hi( x ) is the probability that the sample belongs to the ith class. The maximum probability classification function
for model his: Max P robh ( x ) = argmaxhi ( x ) MaxProb_h(x)=argmax h_i(x)
M a x P ro bh(x)=argmaxhi(x)

:argmax, take the subscript of the maximum value.
This hi ( x ) h_i(x)hi( x ) can be understood as, the sample label ishi h_ihiThe probability of the maximum hi h_ihias the label of sample i

threshold classification function

In a 2-gram classification problem, let the labels take values ​​in {0,1}.
Given a prediction probability model h , ie h(x) represents the probability that the label of feature group x is 1. The threshold classification function
with t as the threshold for model h is:

That is, if the real label is 1, then when the function value of the prediction model (label is 1) is greater than a certain value, the prediction label is considered to be 1, and it is 0 anyway.

Optimization Algorithm for Logistic Regression

gradient descent

Theoretical part: Another article of mine: Search Algorithm - Study Notes

achine_learning.logistic_regression.lib.logistic_regression_gd
import numpy as np

def sigmoid(scores):
	return 1 / (1 + np.exp(-scores))

class LogisticRegression:
	def fit(self, X, y, eta=0.1, N = 1000): #eta η
		m, n = X.shape
		w = np.zeros((n,1))
		for t in range(N):
			h = sigmoid(X.dot(w)) #hw(X)
			g = 1.0 / m * X.T.dot(h - y) #梯度g
			w = w - eta * g #更新w
 		self.w = w
 	def predict_proba(self, X):
		return sigmoid(X.dot(self.w)) # 概率
		
	def predict(self, X):
		proba = self.predict_proba(X) # 概率
		return (proba >= 0.5).astype(np.int) 

Is cross entropy a convex function? It can be derived.

stochastic gradient descent

Theoretical part: Another article of mine: Search Algorithm - Study Notes

import numpy as np

def sigmoid(scores):
	return 1.0 / (1 + np.exp(-scorecs ))

class LogisticRegression:
	def fit(self, X, y, eta_0=10, eta_1=50, N=1000):
		m, n = X.shape
		w = np.zeros((n,1))
		self.w = w
		for t in range(N):
			i = np.random.randint(m) #随机选一个样本 x(i)
			x = X[i].reshape(1, -1) # 1*n 向量
			pred = sigmoid(x.dot(w)) # hw(x)
			g = x.T * (pred - y[i]) # 梯度
			w = w - eta_0 / (t + eta_1) * g # 更新w
			self.w += w
		self.w /= N # w均值

	def predict_proba(self, X):
		return sigmoid(X.dot(self.w))

	def predict(self, X):
		proba = self.predict_proba(X)
		return (proba >= 0.5).astype(np.int)

Mini-batch gradient descent

coordinate descent

Guess you like

Origin blog.csdn.net/qq_25218219/article/details/129894609