Coursera吴恩达机器学习week1笔记

What is Machine Learning?

Arthur Samuel described it as: “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.

Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

Machine learning algorithms:

  • Supervised learning

监督学习中,对于数据集中的每个数据, 都有相应的正确答案,(训练集) 算法就是基于这些来做出预测

主要是分为两类:回归(regression)和分类(classification)

回归:输入导入连续的函数,结果也是连续的

分类:把输入分成不相关的类别

  • Unsupervised learning

无监督学习允许我们解决问题即使我们对结果没有什么想法,我们可以得出结构通过对数据进行聚类分析基于数据的联系

梯度下降算法(Gradient Descent)

θj:=θj−α∂/(∂θj)J(θ0,θ1)
在这里插入图片描述

采用同步更新,θ0和θ1需要同步更新,不能先更新θ0,后更新θ1,虽然也有可能是正确的答案,但是梯度下降的思想就是同步更新。

α是学习速率,它控制我们以多大幅度更新θj

Multivariate Linear Regression(多元线性回归)

Featrue Scaling:

get every feature into approximately a -1<=x<=1 range

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.

Mean normalization(均值归一化):

involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

To implement both of these techniques, adjust your input values as shown in this formula:

xi:=(xi−μi)/si

Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.

Polynomial Regression

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Normal Equation正规方程

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
O (kn2) O (n3), need to calculate inverse of XTX
Works well when n is large Slow if n is very large
With the normal equation, computing the inversion has complexity O(n3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Classification

在这里插入图片描述

在这里插入图片描述

Decision Boundary

Logistic regression cost function

在这里插入图片描述

在这里插入图片描述

Simplified Cost Function

在这里插入图片描述

Gradient Descent

在这里插入图片描述

Advancd Optimazation

在这里插入图片描述

Multiclass Classification: One-vs-all

在这里插入图片描述

在这里插入图片描述

The problem of overfitting

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

  1. Reduce the number of features:
  • Manually select which features to keep.
  • Use a model selection algorithm (studied later in the course).
  1. Regularization
  • Keep all the features, but reduce the magnitude of parameters θj.
  • Regularization works well when we have a lot of slightly useful features.

Cost function for regularization

在这里插入图片描述

Regularized Linear Regression

Gradient Descent

在这里插入图片描述

Normal Equation

在这里插入图片描述

Regularized Logistic Regression

在这里插入图片描述

发布了151 篇原创文章 · 获赞 110 · 访问量 10万+

猜你喜欢

转载自blog.csdn.net/qq_35564813/article/details/104226610