Logistic regression study notes
 Logistic regression learning circuit

 Preliminary knowledge: It is recommended to go to Station B to learn the concepts of information volume, entropy, BL divergence, and cross entropy.
 Functional Model of Logistic Regression
 Loss function, loss minimization architecture
 classification function
 Optimization Algorithm for Logistic Regression
Logistic regression learning circuit
Preliminary knowledge: It is recommended to go to Station B to learn the concepts of information volume, entropy, BL divergence, and cross entropy.
Recommend Bilibili video: How does "cross entropy" work as a loss function? Package understanding of "information volume", "bit", "entropy", "KL divergence", "cross entropy"
Information refers to the amount of information contained in an event, usually represented by a logarithm with base 2. For example, if the probability of an event occurring is 1/8, then the amount of information of this event is log2(1/8)=3, because three bits are required to represent it.
Entropy is a measure of uncertainty in a system or source, and can also be understood as the average amount of information. In information theory, greater entropy means that the system or source is less predictable and therefore contains more information. For example, a pile of coins that come up heads has many possibilities, so they have higher entropy than a pile of coins that are known to all come up heads.
KL divergence (KullbackLeibler
divergence), also known as relative entropy, is a way to measure the difference between two probability distributions. The KL divergence is nonnegative and takes a value of 0 if and only if the two distributions are identical.
Crossentropy (Crossentropy) is a method used to compare the difference between two probability distributions, which is often used to evaluate the performance of classification models. Similar to KL divergence, crossentropy is also nonnegative, taking a value of 0 if and only if the two distributions are equal.
Knowledge map for this section
Functional Model of Logistic Regression
Logistic regression is a classification model
It can be used to predict whether something will happen or not. Classification problems are the most common problems in life:

In life: such as predicting whether the Shanghai Composite Index will rise tomorrow, whether it will rain in a certain area tomorrow, whether the watermelon is ripe

Financial field: Whether a certain transaction is suspected of violating regulations, whether a certain company is currently violating regulations, and whether there will be violations in the future

Internet: Will users buy something, will they click something
For known results, the answer to the above question is only: 0, 1 .
Let's take the following binary classification as an example. For a given data set, there is a straight line that can divide the entire data set into two parts:
At this point, the decision boundary is $w_{1}x_{1}+w_{2}x_{2}+b=0$ , at this point we can easily put$h(x)=w_{1}x_{1}+w_{2}x_{2}+b>A sample of 0$ is set to 1, otherwise it is set to 0. But this is actually a decisionmaking process of a perceptron.
On this basis, logistic regression needs to add a layerto find the relationship between the classification probability and the input variable, and judge the category through probability.
Recalling the linear regression model:$h(x)=w_{T}x+b$ Add a function ggon the basis of the linear model$g$，即$h(x)=g(w_{T}x+b)$ . This function is the sigmoid function, also called the logistic function$g(z)=1+e_{−z}1 $
It converts the result of a linear regression into a probability value. At this time $h(x)$ represents the probability of something happening, we can also record it as$p(Y=1∣x)$
You can look at the image of the sigmoid function:
Summary: In this way, we get the Logistic model h ( x ) h(x)The expression of$h$$($$x$$)$$h(x)=1+e_{−w_{T}x+b}1 $
:$h(x_{i})$ means that the sample$x_{i}$The probability that the label is 1 is $h(x)$
Loss function, loss minimization architecture
log loss as loss function
对数损失：$l(y,y^ )=−i=1∑m y_{i}log_{2}y^ $
when $When l$ takes the minimum value, the model$y^ $The distribution closest to the true (theoretical) model $y_{i}$distributed
Loss Minimization Architecture
Because it is a binary classification problem, so
Feed it into the log loss function $lSo$
the loss minimization architecture of the two classifications is:
$min−∑[y_{i}log(h(x_{i}))+(1−y_{i})log(1−h(x_{i}))]$
Why doesn't the logistic regression loss function use the least squares method? Answer
This is the intersection:
$−∑[y_{i}log(h(x_{i}))+(1−y_{i})log(1−h(x_{i}))]$
From the perspective of probability theory and statistics, loss minimization architecture (cross entropy):
In statistics, suppose we already have a set of samples (X, Y), in order to calculate the parameters that can generate this set of samples. Usually we will use the method of maximum likelihood estimation (a commonly used method of parameter estimation). If we use the maximum likelihood estimation, we also need a hypothetical estimation, here we are assuming $Y$ is subject tothe Bernoulli distribution. $P(Y)=1∣x)=p(x)$$P(Y)=0∣x)=1−p(x)$ due to$Y$ obeys the Bernoulli distribution, we can easily have the likelihood function:$L=∏[p(x_{i})_{y_{i}}][1−p(x_{i})]_{(1−y_{i})}$ in order to solve it we can take the logarithm on both sides:
$logL=∑[y_{i}log(p(x_{i}))+(1−y_{i})log(1−p(x_{i}))]$
The maximum likelihood estimation is actually the probability model with the highest probability: you may not understand it, but in another way, $∏p(h(x_{i})∣θ)$ , the existing probability model$h(x)$ in the existing sampleθ \thetaUnder the condition of$θ$ , the larger the calculated value, the more$h(x)$ is closest to the theoretical probability model
We generally like to take the minimum value of the formula, so convert the original formula to
$min(−logL)=min−∑[y_{i}log(p(x_{i}))+(1−y_{i})log(1−p(x_{i}))]$
Here $p(x_{i})$ corresponds to$h(x_{i})$
Looking at the loss minimization architecture from the perspective of information volume and entropy:
KL divergence: when $D_{KL}=0$ , P model = Q model, what we pursue is the Q model we built (that is,$h(x)$ ) is close to the real P model (here is the theoretical model of the sample,$p_{i}for$ the sample)
Yes, according to Gibbs inequality , cross entropy $≥$ the entropy of the P system, so when we take the minimum valueof cross entropy, the Q model is closer to the real theoretical P model, and we know that the definition of information is$f:=−log_{2}So bring x$
into the original formula, the cross entropy is:$−i=1∑m p_{i}log_{2}x$
This is where the log loss comes in: $l(y,y^ )=−i=1∑m y_{i}log_{2}y^ $When watching the teacher's class, he directly threw me a logarithmic loss, and then directly talked about cross entropy. Why logarithmic loss can be used as a loss function did not tell us. Through the above knowledge, readers should understand how this logarithmic loss came about. Keep reading, you will gain more.
Recommended video: [How does "crossentropy" make a loss function? Package understanding "information amount", "bit", "entropy", "KL divergence", "cross entropy"]
Because it is a binary classification problem, so
So the cross entropy of the two categories is:
$−∑[y_{i}log(p(x_{i}))+(1−y_{i})log(1−p(x_{i}))]$
Here $p(x_{i})$ corresponds to$h(x_{i})$
Thinking from these two perspectives:
 The formulas derived from them are the same, can they all be called cross entropy?
 No, they are just the same formula. When the maximum likelihood estimation function is derived, the $log$ , its base can be any positive number, and it has no unit, andthe amount of informationwill be$log_{2}$Written in the definition, and it has a unit, and the bit is its unit.
classification function
Maximum Probability Classification Function
in a $In the k$ ary classification problem, given the prediction probability model$h$，即
$h(x)=(h_{1}(x),h_{2}(x),...,h_{k}(x))$ Among them,$h_{i}(x)$ is the probability that the sample belongs to the ith class. The maximum probability classification function
for model his: Max P robh ( x ) = argmaxhi ( x ) MaxProb_h(x)=argmax h_i(x)
$MaxProb_{h}(x)=argmaxh_{i}(x)$
:argmax, take the subscript of the maximum value.
This $h_{i}(x)$ can be understood as, the sample label is$h_{i}$The probability of the maximum $h_{i}$as the label of sample i
threshold classification function
In a 2gram classification problem, let the labels take values in {0,1}.
Given a prediction probability model h , ie h(x) represents the probability that the label of feature group x is 1. The threshold classification function
with t as the threshold for model h is:
That is, if the real label is 1, then when the function value of the prediction model (label is 1) is greater than a certain value, the prediction label is considered to be 1, and it is 0 anyway.
Optimization Algorithm for Logistic Regression
gradient descent
Theoretical part: Another article of mine: Search Algorithm  Study Notes
achine_learning.logistic_regression.lib.logistic_regression_gd
import numpy as np
def sigmoid(scores):
return 1 / (1 + np.exp(scores))
class LogisticRegression:
def fit(self, X, y, eta=0.1, N = 1000): #eta η
m, n = X.shape
w = np.zeros((n,1))
for t in range(N):
h = sigmoid(X.dot(w)) #hw(X)
g = 1.0 / m * X.T.dot(h  y) #梯度g
w = w  eta * g #更新w
self.w = w
def predict_proba(self, X):
return sigmoid(X.dot(self.w)) # 概率
def predict(self, X):
proba = self.predict_proba(X) # 概率
return (proba >= 0.5).astype(np.int)
Is cross entropy a convex function? It can be derived.
stochastic gradient descent
Theoretical part: Another article of mine: Search Algorithm  Study Notes
import numpy as np
def sigmoid(scores):
return 1.0 / (1 + np.exp(scorecs ))
class LogisticRegression:
def fit(self, X, y, eta_0=10, eta_1=50, N=1000):
m, n = X.shape
w = np.zeros((n,1))
self.w = w
for t in range(N):
i = np.random.randint(m) #随机选一个样本 x(i)
x = X[i].reshape(1, 1) # 1*n 向量
pred = sigmoid(x.dot(w)) # hw(x)
g = x.T * (pred  y[i]) # 梯度
w = w  eta_0 / (t + eta_1) * g # 更新w
self.w += w
self.w /= N # w均值
def predict_proba(self, X):
return sigmoid(X.dot(self.w))
def predict(self, X):
proba = self.predict_proba(X)
return (proba >= 0.5).astype(np.int)