Machine Learning

Introduction

bili Ng Enda machine learning

It's the science of getting computers to learn without being explicitly programmed.
Arthur Samuel (1959): The field of study that gives computers the ability to learn without being explicitly programmed.

Machine learning is a subfield of artificial intelligence. Specializes in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. It is the core of artificial intelligence and the fundamental way to make computers intelligent.

Machine learning has the following definitions:
(1) Machine learning is a science of artificial intelligence. The main research object of this field is artificial intelligence, especially how to improve the performance of specific algorithms in experience learning.
(2) Machine learning is the study of computer algorithms that improve automatically through experience.
(3) Machine learning is the use of data or past experience to optimize the performance criteria of computer programs.


machine learning algorithm

Machine Learning Algorithms:

  • supervised learning
  • unsupervised learning

reinforcement learning

Supervised Learning Supervised Learning

Labels from input to output
learn from being given the "correct answer"

insert image description here

Types of supervised learning algorithms:

  • return
    • Regression tries to predict any of an infinite number of possible numbers.
  • Classification
    • Classification refers to predicting only a small number of possible outputs or classes.

insert image description here

Regression

insert image description here

Try to predict a number, such as house prices in the example, from an infinite number of possible numbers.

Classification

Taking breast cancer detection as an example, there are only two possible outputs or two possible classes, which is why it is called classification.

insert image description here

The predicted category does not have to be numeric.

insert image description here

insert image description here

Unsupervised LearningUnsupervised Learning

The data only contains the input x, not the output label y.
Algorithms have to find structure, patterns, or interesting things in the data.

insert image description here

We don't want to oversee the algorithm and give the correct answer to the quote for every input; instead, we want the algorithm to find some structure or pattern on its own or just find something interesting in the data.
An unsupervised learning algorithm can decide that data can be assigned to two different groups or two different clusters.

Unsupervised Learning for Classification

  • clustering
    • Group similar data points together.
  • abnormal detection
    • Spotting anomalous data points (fraud detection in financial systems)
  • Dimensionality reduction
    • Compress data using fewer numbers, losing as little information as possible

Clustering

A clustering algorithm is an unsupervised learning algorithm. Take unlabeled data and try to automatically group it into clusters.

insert image description here

insert image description here

insert image description here


Linear Regression Model Linear Regression Model

A linear regression model means fitting a straight line to your data, and it is probably the most widely used learning algorithm in the world today. It is a special type of supervised learning model, it is called a regression model because it predicts numbers as output.

In a regression model, the model can output an infinite number of possible numbers, but a classification model has only a small number of possible outputs, so there is a discrete finite set of possible outputs.

Example: Predicting the price of a house based on its size.

Please add a picture description

Please add a picture description

the term

  • The data set used to train the model is calledTraining set

  • Input variables are also called features orinput features

  • output variable (target variable, output target)

insert image description here

Training set, input feature, output target→→→ learning algorithm→→→ function

insert image description here

The function f is called the model. y is the target, the actual ground truth value in the training set. y-hat is an estimate or prediction of y, which may or may not be the actual true value.

A linear function is just a fancy term for a straight line, not a nonlinear function like a curve or a parabola.

Linear regression, linear regression with one variable, linear model with one input variable also known asUnivariate Linear Regression

Cost (cost) function Cost Function

Model: f_w,b (x) = wx +b
In machine learning, w and b are called the parameters of the model, the parameters of the model are variables that can be adjusted during training to improve the model, in this case so that the line fits the training data better. Sometimes w and b are also called coefficients or weights.

insert image description here

The cost function used by machine learning is actually twice m, the extra division by 2 is to make some of our later calculations a little tidier.

The cost function in this example is also calledsquared error cost function, the square error cost function is the most commonly used function in linear regression
In order to measure the degree of matching between the selection of parameters w and b and the training data (measuring the degree of alignment of the training data), the cost function J(w,b) is used

insert image description here

Intuitive understanding of the cost function

The role of the cost function is to measure the difference between the predicted value of the model and the true value of y
The goal of linear regression is to find the parameter w (or parameters w and b), and find the smallest possible value of the cost function J

insert image description here

The comparison between the simplified image of the model and the image of the cost function:
w = 1 ===> J(w) = 0

insert image description here

w = 0.5 ===> J(w) = 0.58

insert image description here

w = 0 ===> J(w) = 2.3

w = -0.5 ===> J(w) = 5.25

insert image description here

insert image description here

Visualize the cost function

insert image description here

For the following size prediction model of housing prices, ignore the value of parameter b, and only consider the cost function image obtained by parameter w:

insert image description here

The resulting cost function after considering the parameters w and b:

insert image description here

Use the contour plot (top right plot) to visualize the cost function:

insert image description here

Gradient Descent

Find an efficient algorithm that can be written in code that automatically finds the values ​​of the parameters w and b that gives you the best fit line that minimizes the cost function, the gradient descent algorithm can do that.
Gradient descent is ubiquitous in machine learning, not only for linear regression, but also for some larger and more complex models in artificial intelligence, such as training some of the most advanced neural network models, also known as deep learning models.
Gradient descent is an algorithm that can be used to minimize any function, not just the cost function of linear regression.

  • Generally, set the parameter to 0 first
  • Then gradually change the value of the parameter to reduce the value of the cost function
  • until we hit or get close to a minimum (and this minimum may be more than one)

insert image description here

local minimum

insert image description here

implement gradient descent

a:learning rate(Learning Rate), the learning rate is usually a number between 0 and 1, such as 0.1 0.2; this variable controls the step size when going downhill, in this example, it is the step taken when changing the model parameters w and b the size of.
∂J(w, b) / ∂w: the (partial) derivative term of the cost function, the direction when going downhill

Update the values ​​of parameters w and b, and then repeat these two steps until the algorithm converges and then reaches a local minimum. At this minimum, the parameters w and b no longer vary much with each additional step taken.
passSynchronization UpdateIt is more natural to implement it in the correct way, the step in the lower left corner is correct, and the value of w that is brought in when finding tmp_b is the old value, not the value of tmp_w.

insert image description here

An Intuitive Understanding of Gradient Descent

insert image description here

From the left side of point w, the slope is negative, so moving from left to right, the new value of w will increase; on the contrary, the slope on the right side of point w is an integer, so moving from right to left, the new value of w will be reduce.

insert image description here

Learning Rate Learning Rate, α

  • If the learning rate is small, gradient descent will be slow, which can take a lot of time because it takes many steps before it approaches the minimum, and each step is small.
  • If the learning rate is too large, then gradient descent may exceed the desired goal and may never reach the minimum (gradient descent may not converge, and may even diverge).
    insert image description here

Gradient descent can reach a local minimum

insert image description here

Gradient descent will automatically take smaller steps when approaching the local minimum, because the derivative will automatically become smaller when approaching the local minimum, which means that the update step will also automatically become smaller, even if the learning rate α is kept at some fixed value.

insert image description here

Gradient Descent in Linear Regression

Representing a linear regression model with gradient descent with a squared error cost function will train the linear regression model to fit in a straight line with our training set data.

insert image description here

两个偏导数对应的公式的推导过程:
∂ ∂ w J ( w , b ) = ∂ ∂ w 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 = ∂ ∂ w 1 2 m ∑ i = 1 m ( w x ( i ) + b − y ( i ) ) 2 = 1 2 m ∑ i = 1 m ( w x ( i ) + b − y ( i ) ) 2 x ( i ) = 1 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) \frac{∂}{∂w} J(w, b) = \frac{∂}{∂w} \frac{1}{2m} \sum_{i=1}^m (f_{w, b}(x^{(i)}) - y^{(i)}) ^2 = \\ \frac{∂}{∂w} \frac{1}{2m} \sum_{i=1}^m (wx^{(i)} + b - y^{(i)}) ^2 = \\ \frac{1}{2m} \sum_{i=1}^m (wx^{(i)} + b - y^{(i)}) 2x^{(i)} = \\ \frac{1}{m} \sum_{i=1}^m (f_{w, b}(x^{(i)}) - y^{(i)}) x^{(i)} wJ(w,b)=w2 m1i=1m(fw,b(x(i))y(i))2=w2 m1i=1m(wx(i)+by(i))2=2 m1i=1m(wx(i)+by( i ) )2x(i)=m1i=1m(fw,b(x(i))y(i))x(i)
insert image description here

The final gradient descent algorithm:

insert image description here

The cost function of this model has multiple local minima, and the operation can end at different local minima.

insert image description here

But when using the squared error cost function of a linear regression model, the cost function will not have multiple local minima, only a single global minimum. And the cost function at this time is a convex function.

insert image description here

run gradient descent

insert image description here

Batched Gradient Descent
At each step of gradient descent, what we see are all the training examples, not just a subset of the training data.

Multiple Features Multiple Features

multiple linear regression

A multi-class feature means that there are multiple feature values. Taking the prediction of housing prices as an example, in the previous linear regression model, there is only one feature w, which is the area of ​​the house. But when we actually predict the price, using a feature may not be very accurate. So now it is not only to predict the house price by the area of ​​the house, but also to calculate the house price by adding the number of bedrooms, the number of floors, and the age of the house (variables), all of which can be used as eigenvalues.

xj : column j, i.e. feature j n: number of features x ⃗ ( i ): i training sample, i.e. row i data xj ( i ): value of feature j of i training sample x_j : The jth column, that is, the jth feature \\ n: the number of features\\ \vec{x}^{(i)} : The ith training sample, that is, the ith row of data\\ x^{(i )}_j : The value of the jth feature of the ith training sample\\xj:The jth column, that is, the jth featuren : number of featuresx (i):The i- th training sample, that is, the i-th row of dataxj(i):The value of the j -th feature of the i- th training sample

insert image description here

Definition of a model with multiple features:

M o d e l : Model: Model:
P r e v i o u s l y : f w , b ( x ) = w x + b Previously: f_{w, b}(x) = wx + b Previously:fw,b(x)=wx+b
f w , b ( x ) = w 1 x 1 + w 2 x 2 + w 3 x 3 + w 4 x 4 + b f_{w, b}(x) = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + b fw,b(x)=w1x1+w2x2+w3x3+w4x4+b

e x a m p l e : f w , b ( x ) = 0.1 x 1 + 4 x 2 + 10 x 3 + ( − 2 ) x 4 + 80 example: f_{w,b}(x) = 0.1x_1 + 4x_2 + 10x_3 + (-2)x_4 + 80 example:fw,b(x)=0.1x1+4x _2+10x3+(2)x4+80
x 1 : s i z e , x 2 : b e d r o o m s , x 3 : f l o o r s , x 4 : y e a r s , 80 : b a s e p r i c e x_1: size, x_2: bedrooms, x_3: floors, x_4: years, 80: base price x1:size,x2:bedrooms,x3:floors,x4:years,80:baseprice

fw , b ( x ) = w 1 x 1 + w 2 x 2 + ⋅ ⋅ ⋅ + wnxn + b f_{w, b}(x) = w_1x_1 + w_2x_2 + ··· + w_nx_n + bfw,b(x)=w1x1+w2x2+⋅⋅⋅+wnxn+b


A model with several improved features isMultiple (characteristic) linear regression

Vectorized Vectorized Dot Product

insert image description here

Vectorization Vectorization

Numpy: A Numerical Linear Algebra Library in Python

Advantages of using vectorized calculations:

  1. code concise
  2. run faster

np.dot(w, x)Functions are implemented on computer hardware by addressing

insert image description here
★☆
insert image description here

insert image description here

Gradient Descent Method for Multiple Linear Regression

insert image description here

insert image description here

insert image description here

Feature scaling Feature Scaling

When the possible range of a feature is large, a good model is more likely to learn to choose a relatively small baseline value; similarly, when the possible value of a feature is small, the reasonable value of its parameter will be relatively large .

The role of feature scaling:
When faced with a large number of features, ensuring that these features have similar scales (dimensionless) can make the gradient descent method converge faster.

insert image description here

insert image description here

insert image description here

1. Maximum value normalization

This method is directly divided by the maximum value:
The scaling range of x1: 300/2000 ~ 2000/2000, that is, 0.15 ~ 1
The scaling range of x2: 0/5 ~ 5/5, that is, 0 ~ 1

insert image description here

2. Mean Normalization

The final image obtained by this method is centered at zero.
step:

  1. Find the mean μ over the training set (in this example, μ1 = 600 as the mean of x1)
  2. x 1 = x 1 − μ 1 m a x − m i n x_1 = \frac{x_1 - \mu_1}{max - min} x1=maxminx1m1, 即 x 1 m i n = 300 − 600 2000 − 300 = − 0.18 x_1 min = \frac{300 - 600}{2000 - 300} = -0.18 x1min=2000300300600=0.18 x 1 m a x = 2000 − 600 2000 − 300 = 0.82 x_1 max = \frac{2000 - 600}{2000 - 300} = 0.82 x1max=20003002000600=0.82 , get − 0.18 ≤ x 1 ≤ 0.82 -0.18 \leq x_1 \leq 0.82 0.18x10.82, at this time, x1 has been normalized by the mean.
  3. Find the mean μ over the training set (in this example, μ2 = 2.3 as the mean of x2)
  4. x 2 = x 2 − μ 2 m a x − m i n x_2 = \frac{x_2 - \mu_2}{max - min} x2=maxminx2m2,即 x 2 m i n = 0 − 2.3 5 − 0 = − 0.46 x_2min = \frac{0 - 2.3}{5 - 0} = -0.46 x2min=5002.3=0.46 x 2 m a x = 5 − 2.3 5 − 0 = 0.54 x_2max = \frac{5 - 2.3}{5 - 0} = 0.54 x2max=5052.3=0.54 , get − 0.46 ≤ x 2 ≤ 0.54 -0.46 \leq x_2 \leq 0.54 0.46x20.54, at this time, x2 has been normalized by the mean.

insert image description here

3. Z-score normalization

Using this method, you need to calculate the standard deviation σ of each feature, first calculate the mean and standard deviation.
step:

  1. If the standard deviation of x1 is σ1 = 450, the mean μ1 = 600
  2. x 1 = x 1 − μ 1 σ 1 x_1 = \frac{x_1 - \mu_1}{\sigma_1} x1=p1x1m1, x 1 m i n = 300 − 600 450 = 0.67 , x 2 m a x = 2000 − 600 450 = 3.1 x_1min = \frac{300 - 600}{450} = 0.67, x_2max = \frac{2000 - 600}{450} = 3.1 x1min=450300600=0.67,x2max=4502000600=3.1 , namely − 0.67 ≤ x 1 ≤ 3.1 -0.67 \leq x_1 \leq 3.1 0.67x13.1
  3. If the standard deviation of x2 is σ2 = 1.4, the mean μ2 = 2.3
  4. x 2 = x 2 − μ 2 σ 2 x_2 = \frac{x_2 - \mu_2}{\sigma_2} x2=p2x2m2 , x 2 m i n = 0 − 2.3 1.4 = − 1.6 , x 2 m a x = 5 − 2.3 1.4 = 1.9 x_2min = \frac{0 - 2.3}{1.4} = -1.6, x_2max = \frac{5 - 2.3}{1.4} = 1.9 x2min=1.402.3=1.6,x2max=1.452.3=1.9 , ie − 1.6 ≤ x 2 ≤ 1.9 -1.6 \leq x_2 \leq 1.9 1.6x21.9

The upper middle image in the figure below is the standard deviation of a Gaussian distribution (normal distribution).

insert image description here

Check if gradient descent has converged

Make sure gradient descent is working

1. Create aLearning Curve Learning Curve, trying to figure out when to stop your particular training model

The graph below can help you see how your cost function changes after each iteration of gradient descent.
After each iteration, the value of the cost function should decrease. If the value of the cost function increases after a certain iteration, this means that the learning rate α is chosen poorly (usually too large) or there is a bug.
insert image description here

2. Automatic convergence test Automatic convergence test

  1. ϵ \epsilonϵ to represent a very small number, such as 0.001
  2. If in an iteration, the value of the cost function drops by less than ϵ \epsilonϵ , then the current curve image is likely to be in the flat part of the above figure, and convergence can be declared at this time, which indicates that the parameter vector w and constant b found at this time are very close to the minimum value of the cost function
  3. But usually, choosing the correct threshold ϵ \epsilonϵ is quite hard, so actually prefer to use the above method instead of using automatic convergence detection

Learning Rate Selection

With a sufficiently small learning rate, the cost function should decrease after each iteration

  • If the learning rate is too small, gradient descent will take many iterations to converge
  • If the learning rate is too large, it may exceed the minimum value during iterations, resulting in an increase in cost

Feature Engineering Feature Engineering

Use intuition to design new features by transforming or combining original features.
Using intuition to design new features, by transforming or combining original features.

Depending on what insights you have about the problem, rather than just depending on the features you started with, sometimes by defining new features, you might get a better model.
fw ⃗ , b ( x ⃗ ) = w 1 ∗ x 1 + w 2 ∗ x 2 + bif : area = frontage ∗ depthset : x 3 = x 1 ∗ x 2 , ( newfeature : x 3 ) fw ⃗ , b ( x ⃗ ) = w 1 ∗ x 1 + w 2 ∗ x 2 + w 3 ∗ x 3 + b f_{\vec{w}, b}(\vec{x}) = w_1 * x_1 + w_2 * x_2 + b \ \ if: area = frontage * depth \\ set: x_3 = x_1 * x_2, (new feature: x_3)\\ f_{\vec{w},b}(\vec{x}) = w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + bfw ,b(x )=w1x1+w2x2+bif:area=frontagedepthset:x3=x1x2,(newfeature:x3)fw ,b(x )=w1x1+w2x2+w3x3+b
insert image description here

Polynomial Regression Polynomial Regression

Using the idea of ​​multiple linear regression and feature engineering, a new polynomial regression algorithm is proposed to fit the curve.

Link

  • The multinomial regression analysis method that studies a dependent variable and one or more independent variables is called polynomial regression (Polynomial Regression).
  • If there is only one independent variable, it is called univariate polynomial regression; if there are multiple independent variables, it is called multivariate polynomial regression.
  • In univariate regression analysis, if the relationship between variable y and independent variable x is nonlinear, but no appropriate function curve can be found to fit, univariate polynomial regression can be used.

insert image description here

insert image description here


Classification

Motivation

Reasons to Use Classification
If you use linear regression to predict whether a tumor is malignant or not, it will lead to obvious errors.

Linear regression predicts a number; classification accepts only one of a few possible values, not an infinite range of numbers.

This kind of classification problem with only two possible outputs is called binary classification (Binary Classification). Binary here means that there are only two possible classes or two possible categories in these problems.

Question Answer “y”
Is this email spam? no yes
Is the transaction fraudulent? no yes
Is the tumor malignant? no yes

Answer “y” can only be one of two values, no or yes, false or true, 0 or 1

class = category

Decision Boundary
Logistic Regression: This algorithm can avoid the effect shown in the figure below. Although logistic regression has the word regression, it is used toClassificationof.
insert image description here

Logistic regression Logistic Regression

Sigmoid Function (Logistic Function) ( sigmoid function, logistic function, sigmoid function )
logistic regression can be used to fit fairly complex data.
The output range is: 0 ~ 1
g ( z ) = 1 1 + e − z , 0 < g ( z ) < 1 g(z) = \frac{1}{1 + e^{-z}}, 0 < g(z) < 1g(z)=1+ez1,0<g(z)<1

insert image description here

逻辑回归:
f w ⃗ , b ( x ⃗ ) = g ( w ⃗ ⋅ x ⃗ + b ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ + b ) f_{\vec{w},b}(\vec{x}) = g(\vec{w} · \vec{x} + b) = \frac{1}{1 + e^{-(\vec{w} · \vec{x} + b)}} fw ,b(x )=g(w x +b)=1+e(w x +b)1

insert image description here

insert image description here

Decision Boundary

A threshold can be set. When the predicted value is higher than this threshold, it can be judged as 1, and when it is lower than this threshold, it can be judged as 0.

insert image description here

Example 1:

insert image description here

Example 2:

insert image description here

Example three:

insert image description here

Cost Function For Logistic Regression Cost Function For Logistic Regression

A cost function measures how well a particular set of parameters fits the training data, thus providing a better way to try and choose better parameters.
The cost function of the squared error is not an ideal cost function for logistic regression, the squared error cost function is the most commonly used function in linear regression.

Loss (L) on a single training subset as a function of the learning algorithm's predictions:

insert image description here

Loss function for logistic regression

y ( i ) y^{(i)} yWhen ( i ) is equal to 1, the loss function helps the algorithm to make more accurate predictions because the loss is lowest when it predicts a value close to 1.

insert image description here

y ( i ) y^{(i)} y( i ) is equal to 0,fw ⃗ , b ( x ( i ) ⃗ ) f_{\vec{w},b}(\vec{x^{(i)}})fw ,b(x(i) ) farther away fromy ( i ) y^{(i)}y( i ) , the greater the loss.

insert image description here

In summary, the definition of the loss function for logistic regression is shown in the figure below, and by choosing the loss function, the total cost function is convex, so that the global minimum can be obtained reliably using extreme descent.

insert image description here

A simplified version of the cost function for logistic regression

It is to combine two expressions into one, regardless of y ( i ) y^{(i)}y( i ) is 1 or 0, will get the original expression:

insert image description here

The cost function J is the average loss over the entire training set of m examples.
This particular cost function is derived from statistics using the statistical principle of maximum likelihood estimation.

insert image description here

Gradient Descent Implementation Gradient Descent Implementation

insert image description here

insert image description here

Because the specific model algorithms of the linear regression model and the logistic regression model are different, although the code written in the figure below looks the same, it is actually two completely different algorithms.

insert image description here

The problem of overfitting The Problem Of Overfitting

When overfitting, despite doing well in training, it does not generalize well to new examples.

Regression example:
overfitting on the right (high variance), underfitting on the left (high bias), just right in the middle (generalization).

insert image description here

Classification example:

insert image description here

Addressing Overfitting Addressing Overfitting

plan:

  1. Collect more training data (Collect more training examples, Collect more data)
  2. Select features to include/exclude, Select features
  3. Regularization (Regularization, Reduce size of parameters)
    • Reducing the size of the parameters, regularization is a way to more gently reduce the influence of some features.
    • Regularization encourages the learning algorithm to scale down the values ​​of the parameters without necessarily requiring the parameters to be set to zero.
    • What regularization does is keep all the features but just prevent a certain feature from having an oversized influence.

insert image description here

insert image description here

insert image description here

Regularization cost function Cost Function With Regularization

insert image description here

Regularization parameter: λ (Regularization Parameter), λ > 0

insert image description here

Cost function for regularized linear regression:

insert image description here

The regularization parameter λ for the two extreme cases:

  1. If λ = 0, that is, no regularization term is used, you will end up overfitting.
  2. If λ = 1 0 10 10^{10}1010 , the only way to minimize it is to have the value of w very close to 0, then f(x) = b will hold, which will lead to underfitting.

insert image description here

Regularized Linear Regression Regularized Linear Regression

Implementing Gradient Descent for Regularized Linear Regression

insert image description here

insert image description here

insert image description here

Regularized Logistic Regression Regularized Logistic Regression

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/qq_41286942/article/details/125688584