Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

Source: New Ji-won

This article about 4200 words, recommended reading 10+ minutes.

This article easy to understand way to explain machine learning, and strive to give readers no science background can understand.

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

 

[REVIEW] Although there are a lot of courses in machine learning, MIT, on Coursera UC Berkeley, and other experts including Andrew Ng courses have been very classic, but there is a certain science backgrounds are for professionals. This article attempts to machine learning this esoteric courses to be more easy to understand way to speak out, no science background so that readers can understand.

The complex things simple, to lay people can understand in a short time, and exposed suddenly realized expression, is a very powerful skill.

for example. You're applying machine learning engineer, faced with a liberal arts background in HR, if we can in the shortest time to let her know your expertise, can greatly enhance the success rate of the interview.

Now, machine learning such a fire, would like to join more and more people, however, were confused people more and more. Because the public is difficult to understand why the machine learning? Those mysterious concept of a mouthful, such as logistic regression, gradient descent in the end is what?

A 23-year-old pharmacology student professional said that when he went to attend a training course on machine learning, grandmother felt at home who do not understand modern technology.

Then a man named Audrey Lorberfeld graduates trying to divide between the public and machine learning, hands-on filling. So with this series of articles.

The first lecture in this series:

  • Gradient descent
  • Linear Regression
  • Logistic regression

Algorithms vs model

Before understanding begin to understand machine learning, we need to get to know two basic concepts: algorithms and models.

We can model seen as a vending machine, input (money), output (Coke). Algorithm is used to train the model,

Model based on the given input, the corresponding decision-making to achieve the desired output. For example, an algorithm based on the amount invested, Coke's unit price, determine enough money, if more than this much money to find.

All in all, the algorithm is a mathematical model of vitality behind. No model, an algorithm is just a mathematical equation. Different models, depending on the use of different algorithms.

Gradient descent / best-fit line

(Although this is not considered to be a traditional machine learning algorithms, but understand gradient to know how many machine learning algorithms available, and how to optimize critical.) Gradient descent assist us, according to some data, the most accurate predictions.

for example. Do you have a big list, list everyone you know height and weight. This profile is then made of the following:

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

Figure above numbers rather strange? Do not care about these details.

Now, to be held in a residential neighborhood based on height and weight guessing game, the winner being red envelopes. To use this picture. what would you do?

You might want to draw a line on the chart, this line gives the perfect correspondence between height and weight.

For example, according to this perfect line 1.5 meters tall person 60 kg in weight substantially. Ah Well, this is how to find the root of the perfect line out of it? A: gradient descent.

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

Let's mention a concept called RSS (the residual sum of squares). RSS is the square of the difference between the points and lines and this value represents the distance how far points and lines. Gradient descent is to find the minimum value of RSS.

Every time we look for this line to visualize different parameters, you get something called cost curve. The point where the curve is our RSS minimum.

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

Gradient Descent visualized (using Matplotlib)

Data scientists from the incredible Bhavesh Bhatt

Gradient descent there are other segments, such as "step" and "learning rate" (that is, what direction we want to take part in the end of the bottom).

In short, we find the smallest space between data points and the best-fit line by gradient descent; and the best line you are and we do predict the direct basis.

Linear Regression

Linear regression is variable between a further one or more variables (independent variables), the relationship between the intensity of the analysis method.

Linear regression of the signs, as the name implies, namely the relationship between the independent variables and the outcome variables is linear, that is to say the relationship between variables can Citylink a straight line.

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

It looks like we do the above! This is because the best practices of the linear regression line before in our "regression." Best fit line shows the best linear relationship between our point. In turn, this allows us to make predictions.

Another focus is on linear regression, the results of a variable or "varies according to other variables" variable (bit around Ha) is always continuous. But what does this mean?

Suppose we want to measure the factors that influence rainfall, NY: The outcome variable is the rainfall, is what most of our relationship, and the impact of the independent variable precipitation is above sea level.

If the result variable is not continuous, it may appear in a certain altitude, no outcome variables, we can not make predictions lead.

反之,任意给定的海拔,我们都可以做出预测。这就是线性回归最酷的地方!

岭回归与LASSO回归

现在我们知道什么是线性回归,接下来还有更酷的,比如岭回归。在开始理解岭回归之前,我们先来了解正则化。

简单地说,数据科学家使用正则化,确保模型只关注能够对结果变量产生显著影响的自变量。

但是那些对结果影响不显著的自变量会被正则忽略吗?当然不会!原因我们后面再展开细讲。

原则上,我们创建这些模型,投喂数据,然后测试我们的模型是否足够好。

如果不管自变量相关也好不相关都投喂进去,最后我们会发现模型在处理训练数据的时候超棒;但是处理我们的测试数据就超烂。

这是因为我们的模型不够灵活,面对新数据的时候就显得有点不知所措了。这个时候我们称之为“Overfit”,即“过拟合”。

接下来我们通过一个过长的例子,来体会一下过拟合。

比方说,你是一个新妈妈,你的宝宝喜欢吃面条。几个月来,你养成了一个在厨房喂食并开窗的习惯,因为你喜欢新鲜空气。

 

接着你的侄子给宝宝一个围裙,这样他吃东西就不会弄得满身都是,然后你又养成了一个新的习惯:喂宝宝吃面条的时候,必须穿上围裙。

 

随后你又收养了一只流浪狗,每次宝宝吃饭的时候狗就蹲在婴儿椅旁边,等着吃宝宝掉下来的面条。

 

作为一个新妈妈,你很自然的会认为,开着的窗户+围裙+婴儿椅下面的狗,是让你的宝宝能够开心吃面条的必备条件。

 

直到有一天你回娘家过周末。当你发现厨房里没有窗户你有点慌;然后你突然想起来走的匆忙围裙也没带;最要命的是狗也交给邻居照看了,天哪!

 

你惊慌到手足无措以至于忘记给宝宝喂食,就直接把他放床上了。看,当你面对一个完全新的场景时你表现的很糟糕。而在家则完全是另外一种画风了。

 

经过重新设计模型,过滤掉所有的噪音(不相关的数据)后你发现,其实宝宝仅仅是喜欢你亲手做的面条。

 

第二天,你就能坦然的在一个没有窗户的厨房里,没给宝宝穿围裙,也没有狗旁边,开开心心的喂宝宝吃面条了。

这就是机器学习的正则化所干的事情:让你的模型只关注有用的数据,忽略干扰项。

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

在左边:LASSO回归(你可以看到红色梯级表示的系数在穿过y轴时可以等于零)

在右边:岭回归(你可以看到系数接近,但从不等于零,因为它们从不穿过y轴)

图片来源:Prashant Gupta的“机器学习中的正规化”

在各种正规化的,有一些所谓的惩罚因子(希腊字母拉姆达:λ)。这个惩罚因子的作用是在数学计算中,缩小数据中的噪声。

在岭回归中,有时称为“L2回归”,惩罚因子是变量系数的平方值之和。惩罚因子缩小了自变量的系数,但从来没有完全消除它们。这意味着通过岭回归,您的模型中的噪声将始终被您的模型考虑在内。

另一种正则化是LASSO或“L1”正则化。在LASSO正则化中,只需惩罚高系数特征,而不是惩罚数据中的每个特征。

此外,LASSO能够将系数一直缩小到零。这基本上会从数据集中删除这些特征,因为它们的“权重”现在为零(即它们实际上是乘以零)。

通过LASSO回归,模型有可能消除大部分噪声在数据集中。这在某些情况下非常有用!

逻辑回归

现在我们知道,线性回归=某些变量对另一个变量的影响,并且有2个假设:

  • 结果变量是连续的;
  • 变量和结果变量之间的关系是线性的。

但如果结果变量不是连续的而是分类的呢?这个时候就用到逻辑回归了。

分类变量只是属于单个类别的变量。比如每一周都是周一到周日7个日子,那么这个时候你就不能按照天数去做预测了。

每周的第一天都是星期一,周一发生的事情,就是发生在周一。没毛病。

逻辑回归模型只输出数据点在一个或另一个类别中的概率,而不是常规数值。这也是逻辑回归模型主要用于分类的原因。

在逻辑回归的世界中,结果变量与自变量的对数概率(log-odds)具有线性关系。

  • 比率(odds)

逻辑回归的核心就是odds。举个例子:

一个班里有19个学生,其中女生6个,男生13个。假设女性通过考试的几率是5:1,而男性通过考试的几率是3:10。这意味着,在6名女性中,有5名可能通过测试,而13名男性中有3名可能通过测试。

那么,odds和概率(probability)不一样吗?并不。

概率测量的是事件发生的次数与所有事情发生的总次数的比率,例如,投掷40次投币10次是正面的概率是25%;odds测量事件发生的次数与事件的次数的比率,例如抛掷30次有10次是正面,odds指的是10次正面:30次反面。

这意味着虽然概率总是被限制在0-1的范围内,但是odds可以从0连续增长到正无穷大!

这给我们的逻辑回归模型带来了问题,因为我们知道我们的预期输出是概率(即0-1的数字)。

那么,我们如何从odds到概率?

让我们想一个分类问题,比如你最喜欢的足球队和另一只球队比赛,赢了6场。你可能会说你的球队失利的几率是1:6,或0.17。

而你的团队获胜的几率,因为他们是一支伟大的球队,是6:1或6。如图:

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

图片来源:

https://www.youtube.com/watch?v=ARfXDSkQf1Y

现在,你不希望你的模型预测你的球队将在未来的比赛中取胜,只是因为他们过去获胜的几率远远超过他们过去失败的几率,对吧?

还有更多模型需要考虑的因素(可能是天气,也许是首发球员等)!因此,为了使得odds的大小均匀分布或对称,我们计算出一些称为对数比率(log-odds)的东西。

  • log-odds

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

我们所谓的“正态分布”:经典的钟形曲线!

Log-odds是自然对数odds的简写方式。当你采用某种东西的自然对数时,你基本上可以使它更正常分布。当我们制作更正常分布的东西时,我们基本上把它放在一个非常容易使用的尺度上。

当我们采用log-odds时,我们将odds的范围从0正无穷大转换为负无穷正无穷大。可以在上面的钟形曲线上看到这一点。

即使我们仍然需要输出在0-1之间,我们通过获取log-odds实现的对称性使我们比以前更接近我们想要的输出!

  • Logit函数

“logit函数”只是我们为了得到log-odds而做的数学运算!

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

恐怖的不可描述的数学。呃,我的意思是logit函数。

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

logit函数,用图表绘制

正如您在上面所看到的,logit函数通过取其自然对数将我们的odds设置为负无穷大到正无穷大。

  • Sigmoid函数

好的,但我们还没有达到模型给我们概率的程度。现在,我们所有的数字都是负无穷大到正无穷大的数字。名叫:sigmoid函数。

sigmoid函数,以其绘制时呈现的s形状命名,只是log-odds的倒数。通过得到log-odds的倒数,我们将我们的值从负无穷大正无穷大映射到0-1。反过来,让我们得到概率,这正是我们想要的!

与logit函数的图形相反,其中我们的y值范围从负无穷大到正无穷大,我们的sigmoid函数的图形具有0-1的y值。好极了!

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

有了这个,我们现在可以插入任何x值并将其追溯到预测的y值。该y值将是该x值在一个类别或另一个类别中的概率。

  • 最大似然估计

你还记得我们是如何通过最小化RSS(有时被称为“普通最小二乘法”或OLS法)的方法在线性回归中找到最佳拟合线的吗?

在这里,我们使用称为最大似然估计(MLE)的东西来获得最准确的预测。

MLE通过确定最能描述我们数据的概率分布参数,为我们提供最准确的预测。

我们为什么要关心如何确定数据的分布?因为它很酷!(并不是)

它只是使我们的数据更容易使用,并使我们的模型可以推广到许多不同的数据。

Liberal arts students can understand the machine learning Tutorial: gradient descent, linear regression, logistic regression

 

一般来说,为了获得我们数据的MLE,我们将数据点放在s曲线上并加上它们的对数似然。

基本上,我们希望找到最大化数据对数似然性的s曲线。我们只是继续计算每个log-odds行的对数似然(类似于我们对每个线性回归中最佳拟合线的RSS所做的那样),直到我们得到最大数量。

好了,到此为止我们知道了什么是梯度下降、线性回归和逻辑回顾,下一讲,由Audrey妹子来讲解决策树、随机森林和SVM。

参考链接:

https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-1-d0368d769a7b

编辑:黄继彦

校对:林亦霖

—完—

Published 416 original articles · won praise 672 · Views 1.36 million +

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/104097612