cap2 California Rate Forecast Model

 Obtain the required data sets:

import os
import pandas as pd
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT="https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH="datasets/housing"
HOUSING_URL=DOWNLOAD_ROOT+HOUSING_PATH+"/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH): #下载数据集
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path=os.path.join(housing_path,"housing.tgz") #拼接路径
    urllib.request.urlretrieve(housing_url,tgz_path) #下载为 housing.tgz压缩文件
    housing_tgz=tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path) # 解压
    housing_tgz.close()

def load_housing_data (housing_path = HOUSING_PATH): # loading data sets 
    csv_path = the os.path.join (housing_path, "housing.csv") 
    return pd.read_csv (csv_path) loading a csv data type #

 Quick View data structure (attribute, characteristic information):

fetch_housing_data () 
Housing load_housing_data = () 
# housing.head () # front view datasets 5 rows 
# housing.info () # See brief description of the data set 
# housing [ "ocean_proximity"]. value_counts () # view type attribute value the classification value 
# housing.describe () # summary display attribute values, the attribute value does not include the type 
# housing.hist (bins = 20, figsize = (20,15)) # histogram of each attribute, parameter histogram bins The number of

 Purely random sampling, generating test data set, 20% of the complete data set:

Import from sklearn.model_selection train_test_split # 
# train_set, TEST_SET = train_test_split (Housing, test_size = 0.2, = 42 is random_state) # purely random sampling, a property is not considered in the stratified sampling 
# print (len (train_set), len (test_set) )

 Pure random sampling to obtain test assembly bias. Because purely random sampling procedure did not consider the distribution of different characteristic values. In this example, the average house prices and income data set to be predicted median value of this feature has a great relationship, so sampling should be consistent with the distribution of the median income, which is based on the median income stratification sampling.

Since the median income is a continuous numeric attribute, you must first create a property income categories, then each category as a layer to stratified sampling.

In this example, the majority of median income between 2-5. Data sets, each layer must have a sufficient number of instances, the layers can not be divided too.

Creating a revenue category attributes: the median income divided by 1.5 (to limit the number of income categories), and then use the ceil function of rounding the discrete categories, and finally all categories combined is greater than 5 Category 5:

import numpy as np
housing['income_type']=np.ceil(housing['median_income']/1.5)
housing['income_type'].where(housing['income_type']<5,5.0,inplace=True)

 Stratified sampling according to income category, use StratifiedShuffleSplit method sklearn of:

Import StratifiedShuffleSplit sklearn.model_selection from 

Split = StratifiedShuffleSplit (n_splits =. 1, test_size = 0.2, = 42 is random_state) 
for train_index, test_index in split.split (Housing, Housing [ 'income_type']): 
    strat_train_set = housing.loc [train_index] 
    strat_test_set housing.loc = [test_index] 
Housing [ 'income_type']. value_counts () / len (Housing) # complete data set for each category is calculated proportion

 The output is:

3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_type, dtype: float64

 After stratified sampling, adding the median income category attribute is not used anymore, delete the property:

for set in (strat_train_set,strat_test_set):
    set.drop(['income_type'],axis=1,inplace=True)

 Thus, the pretreatment data.


 

Get insights from the data exploration and visualization

Create a copy of the training set for the operation to avoid damaging the training set.

 

housing=strat_train_set.copy()

 

 The geographic data visualization, alpha parameter set to 0.1, can clearly see the location of high density of data points.

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)

 

The following code can be more clearly visualized information. The radius of each circle represents the population of each area (option s); colors represent the price (option c). Using a predefined color table named the jet (CMap option) to visualize the color range from blue (low price) to red (high price).

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,s=housing['population']/100,
             label="population",c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True) #alpha表示点的透明度

 

 


 

Look for correlations between attributes

Method 1: Use Corr () method of calculating the standard correlation coefficient (Pearson correlation coefficient) between each pair of attributes:

corr_mat=housing.corr()

 View correlation between each attribute and the median house price:

corr_mat['median_house_value'].sort_values(ascending=False)

 Output:

median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

 可以看出,收入中位数与房屋价格中位数相关性最高。

 注意:

1、相关系数只能刻画线性相关性(如果x上升,则y上升/下降),所以它有可能彻底遗漏非线性相关性(例如正弦曲线);

2、相关性大小和斜率大小完全无关。

方法2:使用pandas的scatter_matrix方法可视化每个数值属性相对于其他数值属性的相关性。

此例中有9个数值属性,会产生9*9=81个相关性图像,我们只关注与房屋价格中位数最相关的那些属性。

 

from pandas.plotting import scatter_matrix
attr=['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(housing[attr],figsize=(12,10),color='green',alpha=0.1) 

  输出:

由上图可知,与房屋价格中位数最相关的是收入中位数,放大查看这两个属性的相关性:

housing.plot(kind='scatter',x='median_income',y='median_house_value',alpha=0.1)

 输出:

图中有50万美元、45万美元、35万美元三条直线,这些数据可能会影响算法学习效果,应该删除。


 

 试验不同属性的组合

 

Guess you like

Origin www.cnblogs.com/zhhy236400/p/11111180.html