Machine Learning - Training a Model

Machine Learning - Training a Model

linear regression

y ^ = h Θ ( x ) = Θ ∗ x where Θ is the parameter vector of the model, which includes the bias term Θ 0 and the feature weights Θ 1 to Θ nx are the feature vectors of the instance, including from x 0 to xn , x 0 Always 0 Θ ∗ x is the vector Θ and x dot product h Θ is the hypothesis function, using model parameters Θ \widehat{y} = h_{\Theta}(x) = \Theta*x \\where \Theta is the model's Parameter vector, including bias item \Theta_0 and feature weight \Theta_1 to \Theta_n \\x is the feature vector of the instance, including from x_0 to x_n, x_0 is always 0 \\ \Theta*x is the vector \Theta and x point The product \\h_{\Theta} is the hypothesis function, using the model parameter \Thetay =hTh(x)=Thxwhere Θ is the parameter vector of the model, which includes the bias term Θ0and feature weights Θ1to Θnx is the eigenvector of the instance, including0to xn,x0always 0Thx is the vector Θ dot product with xhThis the hypothetical function, using the model parameter Θ

MSE cost function for linear regression model

M S E = ( X , h Θ ) = 1 / m ∑ i = 1 m ( Θ T x ( i ) − y ( I ) ) 2 MSE = (X,h_{\Theta}) = 1/m\sum_{i = 1}^{m}(\Theta^Tx^{(i)}-y^{(I)})^2 MSE=(X,hTh)=1/mi=1m( ThTx(i)y(I))2

standard equation

Θ ^ = ( XTX ) − 1 XT y In the equation, Θ ^ is the Θ value that minimizes the cost function y is the target value vector \widehat{\Theta}=(X^TX) including y ( 1 ) to y ( m ) ^{-1}X^Ty \\In the equation, \widehat{\Theta} is the \Theta value that minimizes the cost function\\y is the target value including y^{(1)} to y^{(m)} vectorTh =(XTX)1XTyin the equationTh is the value of Θ that minimizes the cost functiony is including y( 1 ) toy( m ) target value vector

gradient descent

batch gradient descent


Partial derivative of $$ cost function:

\ \frac{ \partial MSE(\theta)}{\partial\theta_j} = 2/m\sum_{i = 1}{m}(\ThetaTx{(i)-y{i}})x_{j}^{(i)}
$$

Form: ∇ Θ MSE ( Θ ) = 2 / m XT ( X Θ − y ) Form: \nabla_{\Theta}MSE(\Theta) = 2/mX^T(X\Theta-y) .Gradient vector: ThMSE(Θ)=2/mXT(XΘy)

Gradient descent step: Θ next step = Θ − η ∇ Θ MSE ( Θ ) ( η is the learning rate) Gradient descent step: \Theta^{next step} = \Theta-\eta\nabla_{\Theta}MSE(\Theta )\ \ \ \ \ (\eta is the learning rate)Gradient descent step: ΘNext step=ThηThMSE ( Θ ) ( η is the learning rate )     

eta = 0.1
n_iteration = 1000#迭代次数
m = 100

theta = np.random.randn(2,1)
for n in range(n_iteration):
    gradienters = 2/m*x_b.T.dot(x_b.dot(theta)-y)#偏导计算
    theta = theta-eta*gradienters #更改theta

stochastic gradient descent

'''随机梯度下降'''
n_epochs = 50#向前或向后迭代次数

t0,t1 = 5,50 #学习步长超参数

def learning_schedule(t):#计算步长
    return t0/(t+t1)#步长逐渐减小

theta = np.random.randn(2,1)
for n in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = x_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradienters = 2*xi.T.dot(xi.dot(theta)-yi)
        eta = learning_schedule(n*m+i)
        theta = theta-eta*gradienters

sklearn implements stochastic gradient descent

sgd_reg = SGDRegressor(max_iter=1000,tol = 1e-3,penalty=None,eta0=0.1)
sgd_reg.fit(x,y.ravel())
print(sgd_reg.intercept_,sgd_reg.coef_)

Mini-batch gradient descent

'''小批量梯度下降'''
theta_path_mgd = []

n_iterations = 50
minibatch_size = 20

np.random.seed(42)
theta = np.random.randn(2,1)  # random initialization

t0, t1 = 200, 1000
def learning_schedule(t):
    return t0 / (t + t1)

t = 0
for epoch in range(n_iterations):
    shuffled_indices = np.random.permutation(m)
    x_b_shuffled = x_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size):
        t += 1
        xi = x_b_shuffled[i:i+minibatch_size]
        yi = y_shuffled[i:i+minibatch_size]
        gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(t)
        theta = theta - eta * gradients
        theta_path_mgd.append(theta)

polynomial regression

'''多项式回归'''
m = 100
x = 6*np.random.rand(m,1)-3
y = 0.5*x**2+x+2+np.random.randn(m,1)

poly_features = PolynomialFeatures(degree=2,include_bias=False)#聚类特征
x_poly = poly_features.fit_transform(x)

lin_reg = LinearRegression()
lin_reg.fit(x_poly,y)
#画图
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(x, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])

plt.show()

PolynomialFeatures can also add all combinations of features to a given polynomial order

regularized linear model

Ridge regression

Cost function
J ( Θ ) = MSE ( Θ ) + α 1 2 ∑ i = 1 n Θ i 2 The hyperparameter α controls how much to regularize the model. If α = 0 , ridge regression is only linear regression. If α is very large, all weights end up very close to zero, and the result is a flat line through the data mean J(\Theta) = MSE(\Theta)+\alpha\frac{1}{2}\sum_{i =1}^{n}\Theta_i^2 \\hyperparameter\alpha controls how much to regularize the model. If \alpha=0, ridge regression is linear regression only. \\If \alpha is very large, all weights end up very close to zero, and the result is a flat line through the mean of the dataJ(Θ)=MSE(Θ)+a21i=1nThi2The hyperparameter α controls how much to regularize the model. if α=0 , then ridge regression is just linear regression.If α is very large, all weights end up very close to zero, and the result is a flat line through the mean of the data Ridge
regression with closed-form solution:
Θ ^ = ( XTX + α A ) − 1 XT y \widehat{\Theta} = (X^TX+\alpha A)^{-1}X^TyTh =(XTX+αA)1XTy

from sklearn.linear_model import Ridge
np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

Lasso returns

Least Absolute Shrinkage and Selection Operator Regression

Form:
J ( Θ ) = MSE ( Θ ) + α ∑ i = 1 n ∣ Θ i ∣ J(\Theta) = MSE(\Theta)+\alpha\sum_{i=1}^{n}|\; Theta_i|J(Θ)=MSE(Θ)+ai=1nΘi
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jufNfNa1-1658889211621)()]

Lasso regression subgradient vector:
$$
g(\Theta,J) = \nabla_{\Theta}MSE(\Theta)+\alpha\left| \
begin{matrix}
sin(\Theta_1)\
sin(\Theta_2)\
\vdots\
sin(\Theta_n)
\end{matrix}
\right|\ \ \ where sign(\Theta_i) =
\begin{cases}
-1\ if \Theta_i<0\
0\ if \Theta_i = 0\
+1 \ If \Theta_i>0

\end{cases}
$$

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

elastic network

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

stop early

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GFuvg0wm-1658889211622)()]

np.random.seed(42)
m = 100
x = 6 * np.random.rand(m, 1) - 3
y = 2 + x + 0.5 * x**2 + np.random.randn(m, 1)

x_train, x_val, y_train, y_val = train_test_split(x[:50], y[:50].ravel(), test_size=0.5, random_state=10)
poly_scaler = Pipeline([('poly_features',PolynomialFeatures(degree=90,include_bias=False)),('std_scaler',StandardScaler())])
x_train_poly_scaled = poly_scaler.fit_transform(x_train)
x_val_poly_scaled = poly_scaler.transform(x_val)
sgd_reg = SGDRegressor(max_iter=1,tol=-np.infty,warm_start=True,penalty=None,learning_rate='constant',eta0=0.0005)
minimum_val_error = float('inf')
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(x_train_poly_scaled,y_train)
    y_val_predict = sgd_reg.predict(x_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = deepcopy(sgd_reg)

#画图
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                       penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)

n_epochs = 500
train_errors, val_errors = [], []
for epoch in range(n_epochs):
    sgd_reg.fit(x_train_poly_scaled, y_train)
    y_train_predict = sgd_reg.predict(x_train_poly_scaled)
    y_val_predict = sgd_reg.predict(x_val_poly_scaled)
    train_errors.append(mean_squared_error(y_train, y_train_predict))
    val_errors.append(mean_squared_error(y_val, y_val_predict))

best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])

plt.annotate('Best model',
             xy=(best_epoch, best_val_rmse),
             xytext=(best_epoch, best_val_rmse + 1),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.05),
             fontsize=16,
            )

best_val_rmse -= 0.03  # just to make the graph look better
plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("RMSE", fontsize=14)

plt.show()

Answers to questions 1, 2, and 4
The purpose of early termination is to prevent overfitting. If we only need to return the parameters that make the validation error the lowest, we can obtain a model with a lower error in the validation set.
Write picture description here
Figure 1. Learning curve (horizontal axis is training rounds, vertical axis is negative log-likelihood)
Notes to Figure 1:
The blue curve indicates how the loss on the training set changes with the training rounds.
The red curve shows how the loss on the test set changes with the training rounds.
Note: The red curve is the curve obtained by testing once after each training epoch.
It can be seen from the figure that the test error gradually decreases in the first few epochs, but after training to a certain epoch, the test error increases slightly again. This shows that overfitting has occurred at this time.

Overfitting is something we don't want to see, we can use early stopping (early stopping)
to prevent overfitting from happening.

Early termination means that the training is stopped before the test error starts to rise, even if the training has not yet converged (that is, the training error has not reached the minimum value).

First of all, we need to save the current model (network structure and weights), train num_batch times (ie one epoch), and get a new model. Use the test set as the input of the new model for testing. If we find that the test error is larger than the test error obtained last time, we will not terminate the test immediately, but continue to train and test for several epochs. If the test error still does not decrease, then we consider that the test is in Stop the last time the lowest test error was reached. The specific algorithm can be found in "deep learning"

Answer to Question 3
This question is to answer why early stopping works as a regularizer .
First, we expand the loss function in the neighborhood of ω∗ω∗ with Taylor expansion (only expanded to the second order), then we have

J′(ω)=J(ω∗)+1/2(ω−ω∗)TH(ω−ω∗)J′(ω)=J(ω∗)+1/2(ω−ω∗)TH (ω−ω∗)

where H is the Hessian matrix. The reason why there is no first-order derivative information here is because ω∗ω∗ is the optimal solution, and in the neighborhood of ω∗ω∗, the gradient can be approximately considered to be 0.
Finding the gradient of J′(ω)J′(ω), we get

∇ωJ′(ω)=H(ω−ω∗)∇ωJ′(ω)=H(ω−ω∗)

We initialize the parameter vector ω(0)ω(0) to origin 0. According to the gradient descent method, the following formula can be obtained:

ω(τ)=ω(τ–1)−α∇ωĴ (ω(τ−1))ω(τ)=ω(τ–1)−α∇ωĴ(ω(τ−1))

ω(τ)=ω(τ−1)−αH(ω(τ−1)−ω∗)ω(τ)=ω(τ−1)−αH(ω(τ−1)−ω∗)

ω(τ)−ω∗=(I−αH)(ω(τ−1)−ω∗)ω(τ)−ω∗=(I−αH)(ω(τ−1)−ω∗)

Decompose H into eigenvalues: H=QTπQH=QTπQ, where Q is an orthonormal matrix and π is a diagonal matrix.
So

ω(τ)−ω∗=QT(I−απ)Q(ω(τ−1)−ω∗)ω(τ)−ω∗=QT(I−απ)Q(ω(τ−1)−ω ∗)

Q(ω((τ))−ω∗)=(I−αp)Q(ω((τ−1))−ω∗)Q(ω((τ))−ω∗)=(I−αp) Q(ω((τ−1))−ω∗)

where αα is small enough to ensure that |1−απi|<1|1−απi|<1.

Qω(τ)=(I−(I−ap)τ)Qω∗Qω(τ)=(I−(I−ap)τ)Qω∗

When analyzing the L2 regular term, Qω̃ = (I−(π+εI)−1ε)Qω∗Qω̃=(I−(π+εI)−1ε)Qω∗ From the comparison of the above two formulas, we can see that if the following
formula Established:

(π+εI)−1ε=(I−απ)τ(π+εI)−1ε=(I−απ)τ

Then L2 regularization and early termination can be considered equivalent. Further, there are

ε/(πi+ε)=(1−απι)τ/(πι+ε)=(1−απι)τ

Taking the logarithm on both sides, we can know:

log(ε/(πi+ε))=τlog(1−αpi)log⁡(ε/(πi+ε))=τlog(1−αpi)

As a simple approximation:

−log(1+πi/ε)=τlog(1−αpi)−log⁡(1+πi/ε)=τlog⁡(1−αpi)

−πi/ε=−ατπι−πι/ε=−ατπι

(Comparing the first non-constant term of the Taylor expansion, the Taylor expansion is unique)
Thus

t=1/yr=1/yr

The above derivation shows that early termination can play the role of regularization.

logistic regression

estimated probability

The estimated probability of the logistic regression model
p ^ = h Θ ( x ) = σ ( x T Θ ) is denoted as σ ( . ) in logic, which is a sigmoid function (that is, an S-type function), and outputs a value between 0 and 1 The number \widehat{p} = h_{\Theta}(x) = \sigma(x^T\Theta)\\ logic is recorded as \sigma(.), which is a sigmoid function (that is, an S-type function), and outputs an intermediate a number between 0 and 1p =hTh(x)=s ( xT Θ)The logic is denoted as σ ( . ) ,Is a s i g m o i d function (i.e. S -type function), outputting a number between 0 and 1 Logical
function
σ ( t ) = 1 1 + exp ( − t ) \sigma(t) = \ frac{1}{1+exp(-t)}s ( t )=1+exp(t)1
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AhRbsEOJ-1658889211623)()]

Logistic regression model predicts
y ^ = { 0 , if p ^ < 0.5 1 , if p ^ >= 0.5 Note that when t < 0, σ( t ) < 0.5 ; when t > 0, σ( t ) >= 0.5 . If x T Θ is a positive class, the predictive result of the logistic regression model is 1. If it is a negative class, the prediction is 0 \widehat{y} = \begin{cases} 0, if \widehat{p}<0.5\\ 1, if \widehat{p}>=0.5 \end{cases}\\ Note that when t<0, \sigma(t)<0.5; when t>0, \sigma(t)>=0.5.\\If x^ T\Theta is a positive class, and the prediction result of the logistic regression model is 1\\If it is a negative class, the prediction is 0y ={ 0,ifp <0.51,ifp >=0.5Note that when t<0 , σ ( t )<0.5;t>0 , σ ( t )>=0.5.if xT Θis the positive class, and the logistic regression model predicts1If it is a negative class, the prediction is 0

Training and Cost Function

The cost function c ( Θ ) = { − log ( p ^ ) for a single training instance, if y = 1 − log ( 1 − p ^ ) , if y = 0 c(\Theta) = \begin{cases} -log (
\widehat{p}),\ \ \ \ \ \ if y = 1\\ -log(1-\widehat{p}), if y = 0 \end{cases}c ( Θ )={ log(p ),      if y=1log(1p ),if y=0
Logistic regression cost function (log loss)
J ( Θ ) = − 1 m ∑ i = 1 m [ log ( p ^ ( i ) + ( 1 − yi ) log ( 1 − p ^ ( i ) ) ) ] J( \Theta) = -\frac{1}{m}\sum_{i=1}^{m}[log(\widehat{p}^{(i)}+(1-y^{i})log( 1-\widehat{p}^{(i)}))]J(Θ)=m1i=1m[log(p (i)+(1yi)log(1p ( i ) ))]
Free-scale thermodynamic
∂ ∂ Θ j J ( Θ ) = 1 m ∑ i = 1 m ( σ ( Θ T x ( i ) ) − y ( i ) ) xj ( i ) \frac{ . \partial}{\partial\Theta_j}J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}(\sigma(\Theta^Tx^{(i)})- y^{(i)})x_j^{(i)}ΘjJ(Θ)=m1i=1m( p ( ThTx(i))y(i))xj(i)

decision boundary

'''决策边界'''
iris = datasets.load_iris()
x = iris['data'][:,3:]
y = (iris['target']==2).astype(np.int_)
log_reg = LogisticRegression()
log_reg.fit(x,y)
#画图展示模型估算出的概率
x_new = np.linspace(0,3,1000).reshape(-1,1)
y_proba = log_reg.predict_proba(x_new)
plt.plot(x_new,y_proba[:,1],'g-',label = 'Iris virginica')
plt.plot(x_new,y_proba[:,0],'b--',label = 'Not Iris virginica')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CfEOzzhj-1658889211624)()]

Softmax regression

Softmax score for class k
sk ( x ) = x T Θ ( k ) s_k(x) = x^T\Theta^{(k)}sk(x)=xT Th( k )
Softmax function
p ^ k = σ ( s ( x ) ) k = exp ( sk ( x ) ) ∑ j = 1 kexp ( sj ( x ) ) In this equation, K is the number of classes and s ( x ) is a vector that includes the per-class scores σ ( s ( x ) ) of instance x where k is the estimated probability that instance x belongs to class k, given the per-class scores of the instance \widehat{p}_k = \sigma(s (x))_k = \frac{exp(s_k(x))}{\sum_{j=1}^{k}exp(s_j(x))}\\ In this equation\\K is the number of classes\ \s(x) is a vector that includes the scores for each class of instance x \\\sigma(s(x))_k is the estimated probability that instance x belongs to class k, given the scores for each class of that instancep k=σ ( s ( x ) )k=j=1kexp(sj(x))exp(sk(x))In this equationK is the number of classess ( x ) is a vector containing the scores for each class of instance xσ ( s ( x ) )kis the estimated probability that instance x belongs to class k , given the scores for each class of that instance
Softmax regression classification prediction
y ^ = argmax σ ( s ( x ) ) k = argmaxsk ( x ) = argmax ( ( Θ k ) T x ) \widehat{y} = argmax\sigma(s(x))_k = argmaxs_k(x) = argmax((\Theta^{k})^Tx)y =argmaxσ(s(x))k=argmaxsk(x)=a r g ma x (( Thk)T x)
cross entropy cost function
J ( Θ ) = − 1 m ∑ i = 1 m ∑ k = 1 kyk ( i ) log ( p ^ l ( i ) ) In this equation yk ( i ) is the class k The target probability of the ith instance of . Generally equal to 1 or 0, depending on whether the instance belongs to the J(\Theta) = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{ k}y_k^{(i)}log(\widehat{p}_l^{(i)}) \\In this equation y_k^{(i)} is the target probability of the i-th instance belonging to class k . Generally equal to 1 or 0, depending on whether the instance belongs to theJ(Θ)=m1i=1mk=1kyk(i)log(p l(i))In this equation yk(i)is the target probability of the ith instance belonging to class k . Generally equal to 1 or 0 , depending on whether the instance belongs to the

introduce

Cross Entropy (Cross Entropy) is an important concept in Shannon information theory, which is mainly used to measure the difference information between two probability distributions. The performance of language models is usually measured by cross entropy and complexity ( perplexity ). The meaning of cross-entropy is the difficulty of using the model to recognize text, or from a compression point of view, how many bits are used to encode each word on average. The meaning of complexity is to use the model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word. Smoothing refers to assigning a probability value to unobserved N-gram combinations to ensure that the word sequence can always obtain a probability value through the language model. Commonly used smoothing techniques are Turing estimation , deletion interpolation smoothing, Katz smoothing and Kneser-Ney smoothing.

official

The cross-entropy is introduced into the field of computational linguistics disambiguation, and the real semantics of the sentence is used as the prior information of the cross-entropy training set , and the semantics of machine translation is used as the posterior information of the test set. Calculate the cross entropy of the two, and use the cross entropy to guide the identification and elimination of ambiguity. Examples show that the method is simple and effective. It is easy for computer adaptive realization. Cross-entropy is an effective tool for disambiguation in computational linguistics.

In information theory, cross entropy represents two probability distributions p, q, where p represents the real distribution, and q represents the non-real distribution. In the same set of events, the non-real distribution q is used to represent the occurrence of an event. The average number of bits required. From this definition, it is difficult for us to understand the definition of cross entropy. Let's take an example to describe it:

Suppose there are two probability distributions p, q in a sample set, where p is the real distribution and q is the unreal distribution. If, according to the real distribution p, the expectation of the code length required to identify a sample is:

H§=

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-DYioVdqB-1658889211625)(https://bkimg.cdn.bcebos.com/formula/d8600ed8a168b3ccde14700c4edba145.svg)]

However, if the wrong distribution q is taken to represent the average code length from the true distribution p, it should be:

H(p,q)=

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-lk1m7IsR-1658889211626)(https://bkimg.cdn.bcebos.com/formula/4f4bd894f88b051dbcb362809a886c10.svg)]

At this time, H(p,q) is called cross entropy. The cross entropy is calculated as follows:

For discrete variables, it is calculated in the following way: H(p,q)=

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-CyFkN1Yr-1658889211626)(https://bkimg.cdn.bcebos.com/formula/39dffa5dea6b891ae6a322110c04ea1a.svg)]

For continuous variables, it is calculated in the following way:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-l9Gu7e8Y-1658889211626)(https://bkimg.cdn.bcebos.com/formula/86749cf53f17a226a4e45c56664cae10.svg)]

Θ ( k
) J ( Θ ) = 1 m ∑ i = 1 m ( p ^ k ( i ) − yk ( i ) ) x ( i ) \nabla_{\Theta(k)} J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}(\width{p}_k^{(i)}-y_k^{(i)})x^{ (i)}Θ ( k )J(Θ)=m1i=1m(p k(i)yk(i))x(i)

'''Softmax回归'''
iris = datasets.load_iris()
X= iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"]

softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10, random_state=42)
softmax_reg.fit(X, y)
x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]


y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)
#画图
plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])

plt.show()

code summary

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression,SGDRegressor,LogisticRegression
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from copy import  deepcopy
from sklearn import datasets


x = 2*np.random.rand(100,1)
y = 4+3*x+np.random.randn(100,1)

x_b = np.c_[np.ones((100,1)),x]
theta_best = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y)

x_new = np.array([[0],[2]])
x_new_b = np.c_[np.ones((2,1)),x_new]
y_predict = x_new_b.dot(theta_best)

#使用sklearn
lin_reg = LinearRegression()
lin_reg.fit(x,y)

'''批量梯度下降'''
eta = 0.1
n_iteration = 1000#迭代次数
m = 100

theta = np.random.randn(2,1)
for n in range(n_iteration):
    gradienters = 2/m*x_b.T.dot(x_b.dot(theta)-y)#偏导计算
    theta = theta-eta*gradienters #更改theta

'''随机梯度下降'''
n_epochs = 50#向前或向后迭代次数

t0,t1 = 5,50 #学习步长超参数

def learning_schedule(t):#计算步长
    return t0/(t+t1)#步长逐渐减小

theta = np.random.randn(2,1)
for n in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = x_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradienters = 2*xi.T.dot(xi.dot(theta)-yi)
        eta = learning_schedule(n*m+i)
        theta = theta-eta*gradienters


sgd_reg = SGDRegressor(max_iter=1000,tol = 1e-3,penalty=None,eta0=0.1)
sgd_reg.fit(x,y.ravel())
print(sgd_reg.intercept_,sgd_reg.coef_)

'''小批量梯度下降'''
theta_path_mgd = []

n_iterations = 50
minibatch_size = 20

np.random.seed(42)
theta = np.random.randn(2,1)  # random initialization

t0, t1 = 200, 1000
def learning_schedule(t):
    return t0 / (t + t1)

t = 0
for epoch in range(n_iterations):
    shuffled_indices = np.random.permutation(m)
    x_b_shuffled = x_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size):
        t += 1
        xi = x_b_shuffled[i:i+minibatch_size]
        yi = y_shuffled[i:i+minibatch_size]
        gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(t)
        theta = theta - eta * gradients
        theta_path_mgd.append(theta)

'''多项式回归'''
m = 100
x = 6*np.random.rand(m,1)-3
y = 0.5*x**2+x+2+np.random.randn(m,1)

poly_features = PolynomialFeatures(degree=2,include_bias=False)#聚类特征
x_poly = poly_features.fit_transform(x)

lin_reg = LinearRegression()
lin_reg.fit(x_poly,y)
画图
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(x, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])

plt.show()

'''学习曲线'''
def plot_learning_curves(model,x,y):
    x_train,x_val,y_train,y_val = train_test_split(x,y,test_size=0.2)
    train_erros,val_erros = [],[]
    for m in range(1,len(x_train)):
        model.fit(x_train[:m],y_train[:m])
        y_train_pred = model.predict(x_train[:m])
        y_val_pred = model.predict(x_val)
        train_erros.append(mean_squared_error(y_train[:m],y_train_pred))
        val_erros.append(mean_squared_error(y_val,y_val_pred))
    plt.plot(np.sqrt(train_erros),'r-+',linewidth = 2,label = 'train')
    plt.plot(np.sqrt(val_erros),'b--',linewidth = 3,label = 'val')
    plt.show()


lin_reg = LinearRegression()
plot_learning_curves(lin_reg,x,y)

'''岭回归'''
np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
'''lasso'''
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

'''提前停止'''
np.random.seed(42)
m = 100
x = 6 * np.random.rand(m, 1) - 3
y = 2 + x + 0.5 * x**2 + np.random.randn(m, 1)

x_train, x_val, y_train, y_val = train_test_split(x[:50], y[:50].ravel(), test_size=0.5, random_state=10)
poly_scaler = Pipeline([('poly_features',PolynomialFeatures(degree=90,include_bias=False)),('std_scaler',StandardScaler())])
x_train_poly_scaled = poly_scaler.fit_transform(x_train)
x_val_poly_scaled = poly_scaler.transform(x_val)
sgd_reg = SGDRegressor(max_iter=1,tol=-np.infty,warm_start=True,penalty=None,learning_rate='constant',eta0=0.0005)
minimum_val_error = float('inf')
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(x_train_poly_scaled,y_train)
    y_val_predict = sgd_reg.predict(x_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = deepcopy(sgd_reg)

#画图
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                       penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)

n_epochs = 500
train_errors, val_errors = [], []
for epoch in range(n_epochs):
    sgd_reg.fit(x_train_poly_scaled, y_train)
    y_train_predict = sgd_reg.predict(x_train_poly_scaled)
    y_val_predict = sgd_reg.predict(x_val_poly_scaled)
    train_errors.append(mean_squared_error(y_train, y_train_predict))
    val_errors.append(mean_squared_error(y_val, y_val_predict))

best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])

plt.annotate('Best model',
             xy=(best_epoch, best_val_rmse),
             xytext=(best_epoch, best_val_rmse + 1),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.05),
             fontsize=16,
            )

best_val_rmse -= 0.03  # just to make the graph look better
plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("RMSE", fontsize=14)

plt.show()

'''决策边界'''
iris = datasets.load_iris()
x = iris['data'][:,3:]
y = (iris['target']==2).astype(np.int_)
log_reg = LogisticRegression()
log_reg.fit(x,y)
#画图展示模型估算出的概率
x_new = np.linspace(0,3,1000).reshape(-1,1)
y_proba = log_reg.predict_proba(x_new)
plt.plot(x_new,y_proba[:,1],'g-',label = 'Iris virginica')
plt.plot(x_new,y_proba[:,0],'b--',label = 'Not Iris virginica')
plt.show()
'''Softmax回归'''
iris = datasets.load_iris()
X= iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"]

softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10, random_state=42)
softmax_reg.fit(X, y)
x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]


y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])

plt.show()

]

y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y2, 0], X[y2, 1], “g^”, label=“Iris virginica”)
plt.plot(X[y1, 0], X[y1, 1], “bs”, label=“Iris versicolor”)
plt.plot(X[y0, 0], X[y0, 1], “I”, label=“Iris setosa”)

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap([‘#fafab0’,‘#9898ff’,‘#a0faa0’])

plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel(“Petal length”, fontsize=14)
plt.ylabel(“Petal width”, fontsize=14)
plt.legend(loc=“center left”, fontsize=14)
plt.axis([0, 7, 0, 3.5])

plt.show()


Guess you like

Origin blog.csdn.net/m0_63953970/article/details/126009543