利用机器学习进行数据分析（一）

原创不易，如需转载，请标明出处。

常用数据分析步骤

1.导入基本工具库：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import types
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

2.导入训练和测试数据：

train_data = pd.read_csv("D://ML//Data//train.csv")
test_data = pd.read_csv("D://ML//Data//test.csv")

3.简单查看数据格式和类型：

train_data.head(10)

train_data.info()

    Country Happiness.Rank  Happiness.Score Whisker.high    Whisker.low Economy..GDP.per.Capita.    Family  Health..Life.Expectancy.    Freedom Generosity  Trust..Government.Corruption.   Dystopia.Residual
0   Norway  1   7.537   7.594445    7.479556    1.616463    1.533524    0.796667    0.635423    0.362012    0.315964    2.277027
1   Denmark 2   7.522   7.581728    7.462272    1.482383    1.551122    0.792566    0.626007    0.355280    0.400770    2.313707
2   Iceland 3   7.504   7.622030    7.385970    1.480633    1.610574    0.833552    0.627163    0.475540    0.153527    2.322715
3   Switzerland 4   7.494   7.561772    7.426227    1.564980    1.516912    0.858131    0.620071    0.290549    0.367007    2.276716
4   Finland 5   7.469   7.527542    7.410458    1.443572    1.540247    0.809158    0.617951    0.245483    0.382612    2.430182
5   Netherlands 6   7.377   7.427426    7.326574    1.503945    1.428939    0.810696    0.585384    0.470490    0.282662    2.294804
6   Canada  7   7.316   7.384403    7.247597    1.479204    1.481349    0.834558    0.611101    0.435540    0.287372    2.187264
7   New Zealand 8   7.314   7.379510    7.248490    1.405706    1.548195    0.816760    0.614062    0.500005    0.382817    2.046456
8   Sweden  9   7.284   7.344095    7.223905    1.494387    1.478162    0.830875    0.612924    0.385399    0.384399    2.097538
9   Australia   10  7.284   7.356651    7.211349    1.484415    1.510042    0.843887    0.601607    0.477699    0.301184    2.065211
10  Israel  11  7.213   7.279853    7.146146    1.375382    1.376290    0.838404    0.405989    0.330083    0.085242    2.801757
11  Costa Rica  12  7.079   7.168112    6.989888    1.109706    1.416404    0.759509    0.580132    0.214613    0.100107    2.898639
12  Austria 13  7.006   7.070670    6.941330    1.487097    1.459945    0.815328    0.567766    0.316472    0.221060    2.138506

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
Country                          155 non-null object
Happiness.Rank                   155 non-null int64
Happiness.Score                  155 non-null float64
Whisker.high                     155 non-null float64
Whisker.low                      155 non-null float64
Economy..GDP.per.Capita.         155 non-null float64
Family                           155 non-null float64
Health..Life.Expectancy.         155 non-null float64
Freedom                          155 non-null float64
Generosity                       155 non-null float64
Trust..Government.Corruption.    155 non-null float64
Dystopia.Residual                155 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 14.6+ KB

通过上面的信息我们可以看到训练集的数据中包含的特征个数，每个特征的数据类型（文字类型或者数字类型），对于文字类型，可以简单看出是否含有可以二值化的特征（如性别）；

4.对于中文数据特征，不能二值化的，但是内容较少（例如国家或者省份），利用以下方法可视化其对要预测特征的影响：

feature_length=len(train_data['Feature'].unique())
print('There have %s Feature in this table'%feature_length)

辅助查看该特征的长度

plt.figure(figsize=(18,18))
plt.title('Feature Correlation with Result ', y=1.05, size=15)
g=sns.stripplot(x='Feature',y='Result',data=train_data,jitter=True)
plt.xticks(rotation=45)

5.查看各个特征之间的相关性，这个很重要：

colormap=plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(data_2016.corr(),cmap=colormap,linecolor='white',linewidths=0.1,vmax=1.0,square=True,annot=True)

这里写图片描述
通过可视化形式，很容易看到特征之间的相关性。
根据上图我们可以轻易地看到特征之间的“影响因子”，横纵坐标交叉的区域颜色越深，代表它们之间关系越深，区域块上面标识的数字同样显示着这点。

6.简单查看重要特征的分布

sns.distplot(train_data['Important Features'])

这里写图片描述

7.如果含有国家或者地区，可以通过以下方法可视化：

data=dict(type='choropleth',locations=train_data['Country'],locationmode='country names',z=data_2015['Result '],text=train_data['Country'],colorbar={'title':'Result'})
layout=dict(title='Global Result',geo=dict(showframe=False,projection={'type':'Mercator'}))
choromap3=go.Figure(data=[data],layout=layout)
iplot(choromap3)

这里写图片描述
8.如果是监督学习，则可以利用以下方法训练模型：

y=train_data['Result']
X=data_2015.drop(['Useless Features'],axis=1)

其中Useless Features就是通过上述步骤得到的对result影响不大的特征，在此处为了预测方便可以drop掉（但是既然存在就是合理，所以drop掉是有问题的，稍后再讲）

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
trian_data,test_data,train_target,test_target=train_test_split(X, y, test_size=0.3, random_state=101)
lr=LinearRegression()
lr.fit(X_train,y_train)

predict=lr.predict(test_data)
print('predict data',predict)
print('-'*60)
print(test_target)

predict data [ 4.875  6.168  5.856  4.907  6.379  4.643  5.314  3.856  6.952  5.458
  4.36   5.129  6.269  5.658  5.919  5.185  5.291  5.615  7.413  7.509
  6.474  5.177  6.218  5.976  3.303  6.478  3.695  4.876  3.36   2.905
  3.832  5.057  5.045  5.987  5.538  4.324  4.201  5.546  6.65   4.217
  3.763  5.033  5.163  4.219  4.508  5.121  5.145  6.084]
------------------------------------------------------------
102    4.875
42     6.168
55     5.856
100    4.907
33     6.379
109    4.643
78     5.314
141    3.856
16     6.952
74     5.458
120    4.360
92     5.129
39     6.269
64     5.658
53     5.919
84     5.185
80     5.291
66     5.615
4      7.413
1      7.509
32     6.474
85     5.177
41     6.218
50     5.976
154    3.303
31     6.478
147    3.695
101    4.876
153    3.360
156    2.905
142    3.832
96     5.057
97     5.045
48     5.987
69     5.538
122    4.324
129    4.201
68     5.546
25     6.650
128    4.217
143    3.763
98     5.033
86     5.163
127    4.219
114    4.508
94     5.121
90     5.145
43     6.084
Name: Happiness Score, dtype: float64

预测结果和实际结果相差无几；
9.测试结果可视化

plt.scatter(predict,test_target)
plt.xlabel('Predict')
plt.ylabel('Test_data')

这里写图片描述

利用机器学习进行数据分析（一）

猜你喜欢