Python Bayesian regression analysis of housing affordability data set

 I would like to study how to use pymc3 linear regression within a Bayesian framework. According to extrapolate data from high school to the knowledge. 

 

 What Bayes' rule? 

  In essence, we must know that we have the evidence of the world's knowledge and combine to tell us about the state of the world.

This is an example. Assuming that there is such a rare disease, there is one for every 10,000 people infected with the disease were randomized. In other words, you have a 0.01% chance of suffering from this disease. Fortunately, there may be a test correctly identified 99 percent of people with this disease, if not the disease, it can also correctly stated that 99% of you are not suffering from this disease. You took part in the test result is positive. You should worry about?

Well, let's think about it logically. We know that there is one person per 10,000 people contracting this disease. Suppose there are 10,000 people. They are 9,999 people without the disease, but 1% of people will get positive results. Therefore, even if only one person actually suffering from this disease, and about 101 people get a positive result. This means that even if the result is positive, your only real chance of 101 points suffer the disease (or about 1% chance).

 Mathematical description:

It looks very simple. In fact, it's simple. The formula requires only some knowledge of the probability distribution. But in fact, the denominator of the right usually means we will have to calculate a lot of really heavy integral calculation. Therefore, Bayesian statistics was abandoned for many years. In a sense, it is naturally out of the probability theory. If we have only good things to calculate a lot of numbers, then this problem can be resolved.

The computer calculated very quickly indeed. In fact, I write this article, my clumsy old laptop can make some good Bayesian statistics, such as Bayesian we are going to return.

 

Code

This is the knowledge required Bayesian regression. Usually, we think of this return:

e is the error normally distributed. 

 

 Therefore, we assume that:

And a priori:

So, if we have the data X and Y, you can perform Bayesian linear regression. 

 Code 

 Therefore, the data set we want to use is "  American Housing Survey: 2013 housing affordability data" data set. 

我们感兴趣的是住房负担如何随着年龄而变化。AGE1包含户主的年龄。BURDEN是一个变量,它告诉我们住房费用相对于收入有多大。为简单起见,我们仅关注这两个变量。我们想知道的是,随着年龄的增长,住房负担会变得更容易吗?特别是,我们想知道斜率系数是否为负,并且由于我们处于贝叶斯框架中,因此该概率为负的概率是多少?

因此,让我们从一些先决条件开始,我们将导入所需的库和数据。我们还将进行一些数据清理。

import pandas as pd
import pymc3
import matplotlib.pyplot as plt

df=pd.read_csv('/home/ryan/Documents/thads2013n.txt',sep=',')
df=df[df['BURDEN']>0]
df=df[df['AGE1']>0]

好吧,这很简单。现在,让我们构建上面讨论的模型。让我们做一个散点图,看看数据是什么样子。

plt.scatter(df['AGE1'],df['BURDEN'])
plt.show()

结果如下:

数据看起来住房负担天文数字很高,很容易超过收入的10倍。

现在,我们不必为此担心太多。这是构建和运行模型的代码:


pm.traceplot(trace)
plt.show()

看起来与我们上面的模型完全一样,不同之处在于我们还有一个正态分布的截距额外的beta。最后一行是实际为我们运行模型的内容。现在我们的模型已经训练好了,我们可以继续做一些推论工作了。继续运行,然后在运行时执行其他操作。在较旧的笔记本电脑(例如我的笔记本电脑)上,这可能需要花费大量时间。通常,您将需要在GPU上的云中进行这些计算。在笔记本电脑上运行花了47分钟。完成运行后,会看到类似以下内容:

 

 

可以看到,我们有斜率和截距的后验分布以及回归的标准偏差。

但是就像我一开始就想知道的那样,住房负担会随着年龄的增长而减少吗?我的想法是,也许是的。随着人们的建立,他们的住房成本将相对于收入下降。这将等于年龄变量的负斜率系数。运行以下代码,则可以找出斜率系数为负的确切概率。

 
print(np.mean([1 if obj<0 else 0 for obj in trace['x']]))

,该系数为负的概率约为13.8%。

 

发布了445 篇原创文章 · 获赞 246 · 访问量 97万+

Guess you like

Origin blog.csdn.net/qq_19600291/article/details/104040027