Notes - Machine Learning Foundation and linear regression

Verbatim https://www.dazhuanlan.com/2019/08/26/5d62fe3eea881/


Definition:

Octave for Microsoft Windows

  • Software is mainly used for numerical analysis, machine learning beginner in good use
  • Download: Octave official website

  • download link

  • After downloading the installer continue to nextbe able to complete the installation

PS can download any version, but do not download Octave 4.0.0, this version has a major bug

Supervised learing (supervised learning)

  • Definition: a machine to learn training examples (input and expected output) after pre-labeled, to predict the function of the output of any input that may arise
  • Output functions can be divided into two categories:

    • Regression analysis (Regression): output continuous values, for example: Rate
    • Classification (Classification): outputs a classification label, for example: with or without

Unsupervised learing (unsupervised learning)

  • Definition: not given prior labeled training examples, the data is automatically enter classifying or grouping
  • Commonly used in clustering, there are two types of applications:

    • Clustering (Clustering): The data set is divided into several samples usually are disjoint subsets, for example: news classified into different classes of
    • Non-clustering (Non-clustering): For example: cocktail algorithm, to find the data with the valid data from the noise, the speech recognition can be used

Linear regression algorithm (Linear Regression)

  • hθ(x) = θ₀ + θ₁x₁ : Linear regression equation
  • m : The amount of data
  • x⁽ⁱ⁾ : I represents the i-pen data

The cost function (Cost Function)

  • Calculate the cost function
1
2
3
4
5
6
7
8
9
10
11
function J = costFunctionJ(X, y, theta)

m = length(y);
J = 0

predictions = X * theta;
sqrErrors = (predictions - y).^2;

J = 1 / (2 * m) * sum(sqrErrors);

end
  • 在 Octave 上的代价函数函数

  • 找出代价函数的最小值,来找出 θ₀、θ₁

使用梯度下降 ( Gradient descent ) 将函数 J 最小化

  1. 初始化 θ₀、θ₁ ( θ₀=0, θ₁=0 也可以是其他值)
  2. 不断改变 θ₀、θ₁ 直到找到最小值,或许是局部最小值

  • 梯度下降公式,不断运算直到收敛,θ₀、θ₁ 必须同时更新
  • α 后的公式其实就是导数 ( 一点上的切线斜率 )
  • α 是 learning rate

  • 正确的算法

  • 错误的算法,没有同步更新
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)

m = length(y);
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

delta = 1 / m * (X' * X * theta - X' * y);
theta = theta - alpha .* delta;

J_history(iter) = computeCost(X, y, theta);

end

end
  • 在 Octave 上的梯度下降函数

Learning Rate α

  • α 是 learning rate,控制以多大幅度更新 θ₀、θ₁
  • 决定 α 最好的方式是随着绝对值的导数更新,绝对值的导数越大,α 越大
  • α 可以从 0.001 开始 ( 每次 3 倍 )
  • α 太小 : 收敛会很缓慢
  • α 太大 : 可能造成代价函数无法下降,甚至无法收敛

结合梯度下降与代价函数

  • 将代价函数带入梯度下降公式
  • 所有样本带入梯度下降公式不断寻找 θ₀、θ₁,在机器学习里称作批量梯度下降 ( batch gradient descent )

多特征线性回归 ( Linear Regression with multiple variables)

  • hθ(x) = θ₀x₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ : 多特征线性回归算式,x₀ = 1
  • n : 特征量

使用梯度下降解多特征线性回归

  • 相较于一元线性回归,只是多出最后的 xⱼ

  • 拆开后

特征缩放 ( Feature Scaling ) 与均值归一化 ( Mean Normalization )

  • 目的 : 加快梯度下降,因为特征值范围相差过大会导致梯度下降缓慢

  • sᵢ : 特征缩放,通常使用数值范围
  • μᵢ : 均值归一化,通常使用数值的平均
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
function [X_norm, mu, sigma] = featureNormalize(X)

X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));

mu = mean(X);
sigma = std(X);

for i = 1:size(X, 2)

X_mu = X(:, i) - mu(i);
X_norm(:, i) = X_mu ./ sigma(i);

end

end
  • 在 Octave 上的特征缩放与均值归一化函数

多项式回归 ( Polynomial Regression )

  • 我们可以结合多种有关的特征,产生一个新的特征,例如 : 房子长、宽结合成房子面积
  • 假如线性的 ( 直线 ) 函数无法很好的符合数据,我们也可以使用二次、三次或平方根函数 ( 或其他任何的形式 )

正规方程 ( Normal Equation )

X = 各特征值
y = 各结果

  • 算式 : (XᵀX)⁻¹Xᵀy
  • Octave : pinv(X'*X)*X'*y
1
2
3
4
5
6
7
function [theta] = normalEqn(X, y)

theta = zeros(size(X, 2), 1);

theta = pinv(X' * X) * X' * y

end
  • The normal equation function on the Octave

In Octave where we usually pinvinstead inv, because the use of pinveven XᵀXirreversible, or will give the value of θ

  • XᵀX Irreversible reasons:

    • Independent and redundant feature values
    • Excessive eigenvalue (m <= n), or delete some regularization

Gradient descent vs normal equations

  • Gradient descent

    • Advantages:

      • When feature large, can operate normally
      • O (kn²)
    • Disadvantages:

      • You need to choose α
      • It requires constant iteration
  • The normal equation

    • Advantages:

      • Not need to select α
      • Without iteration
    • Disadvantages:

      • For an operation (XᵀX) ⁻¹, when the feature quantity is larger, it can take a lot of time calculating (n> 10000)
      • O (n³)

Guess you like

Origin www.cnblogs.com/petewell/p/11410461.html