Python 数据处理实例练习1：Pandas与数据可视化入门

量子计算机.jpg

数据收集、清洗、整理与数据可视化是Python数据处理的第一步，本练习通过一个实际的数据集（加拿大移民人口数据），对Pandas，Matplotlib库进行基本讲解。主要的数据可视化在python下依赖matplotlib和pandas，如果已经安装了Anaconda发行版，则这两个库默认都已经安装，如果只是安装Jupyter Notebook，则可以直接通过命令行命令进行安装。

!pip install pandas
!pip install matplotlib

1. 数据集引入

示例数据集来自加拿大官方移民数据，数据年限仅截止到2013年。首先，需要导入numpy和pandas两个库，对数据进行基本分析。因为数据是excel格式的（这是最广泛的数据格式之一，还需要安装xlrd库），在Anaconda和标准Python发行版下通过下列两个命令可以分别实现安装。

!pip install xlrd
!conda install -c anaconda xlrd --yes

需要注意的一点是，在Jupyter Notebook中安装包之后，内核可能不能马上导入使用，可以点击Kernel菜单下的Restart选项重新启动Kernel就可以恢复正常。

import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

读取示例数据并显示头/尾。

df_can = pd.read_excel('https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx',
                       sheet_name='Canada by Citizenship',
                       skiprows=range(20),
                       skipfooter=2)

print ('Data read into a pandas dataframe!')

df_can.head()
# tip: 如果需要显示更多行可以指定数据，比如 df_can.head(10)

5 rows × 43 columns

df_can.tail()

5 rows × 43 columns

其他基本的查询指令，可以参考pandas的API文档。

df_can.info()

df_can.columns.values

df_can.index.values

print(type(df_can.columns))
print(type(df_can.index))

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.range.RangeIndex'>

df_can.columns.tolist()
df_can.index.tolist()

print (type(df_can.columns.tolist()))
print (type(df_can.index.tolist()))

<class 'list'>
<class 'list'>

# size of dataframe (rows, columns)
df_can.shape

(195, 43)

数据清洗与整理

对数据集需要做一些基本的清洗与整理，下列几个步骤分别去掉不需要的列，对部分列重新命名使得更具有可读性，并增加了一个汇总列。

# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)

2 rows × 38 columns

df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns

Index([  'Country', 'Continent',    'Region',   'DevName',        1980,
              1981,        1982,        1983,        1984,        1985,
              1986,        1987,        1988,        1989,        1990,
              1991,        1992,        1993,        1994,        1995,
              1996,        1997,        1998,        1999,        2000,
              2001,        2002,        2003,        2004,        2005,
              2006,        2007,        2008,        2009,        2010,
              2011,        2012,        2013],
      dtype='object')

df_can['Total'] = df_can.sum(axis=1)

df_can.isnull().sum()

2. Pandas中级功能，索引与选择

先看看当下的数据集信息，然后通过练习熟悉各种索引的使用方式。

df_can.describe()

8 rows × 35 columns

df_can.Country  # 查找并返回所有国家列表

df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # 返回特定年份的值
# 需要注意，国家名称是字符串类型而年份是整型
# 为了保证格式统一，可以将所有名称均改为整型

195 rows × 7 columns

df_can.set_index('Country', inplace=True)
# 将国家设置为索引项，与之相反的操作是 df_can.reset_index()

df_can.head(3)

3 rows × 38 columns

# 也可以去掉索引项的名称
df_can.index.name = None

# 1\. 显示日本籍移民（所有列）
print(df_can.loc['Japan'])

# 其他实现方式
print(df_can.iloc[87])
print(df_can[df_can.index == 'Japan'].T.squeeze())

# 2\. 2013年的数据
print(df_can.loc['Japan', 2013])

# 其他实现方式
print(df_can.iloc[87, 36]) # 2013年是最后一列，总共36列

982
982

# 3\. 1980到1985年间的数据
print(df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]])
print(df_can.iloc[87, [3, 4, 5, 6, 7, 8]])

1980    701
1981    756
1982    598
1983    309
1984    246
1984    246
Name: Japan, dtype: object
1980    701
1981    756
1982    598
1983    309
1984    246
1985    198
Name: Japan, dtype: object

df_can.columns = list(map(str, df_can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- 检查是否正确，去掉前面的注释就可以运行

# 将要显示的年份转换为years列表，在后面显示图形时比较实用
years = list(map(str, range(1980, 2014)))
years

# 1\. 通过判断创建布尔量序列
condition = df_can['Continent'] == 'Asia'
print (condition)

# 2\. 将布尔序列传给数据集
df_can[condition]

49 rows × 38 columns

# 可以通过多个条件进行筛选
# l比如同时选择 AreaNAme = Asia 和RegName = Southern Asia

df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]

# 在使用逻辑操作符时， 需要用 '&' 和 '|' 取代 'and' 和 'or'
# 不同条件需要分别通过括号分开。

9 rows × 38 columns

print ('data dimensions:', df_can.shape)
print(df_can.columns)
df_can.head(2)

data dimensions: (195, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', 'Total'],
      dtype='object')

2 rows × 38 columns

3. 使用Matplotlib用以实现数据可视化

# %是Jupyter Notebook的魔术命令，这里使用inline即在文件中显示内容
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0 版本要大于2.0.0

Matplotlib version:  3.0.2

print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style

['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']

haiti = df_can.loc['Haiti', years] # years参数见前节passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()

1980    1666
1981    3692
1982    3498
1983    2860
1984    1418
Name: Haiti, dtype: object

haiti.plot()

haiti.index = haiti.index.map(int) # 将海地的索引项改为整数以显示年份
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show() # 本行用以显示图形

output_41_0.png

haiti.plot(kind='line')

#可以在图形中添加标签
plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# 也可以在指定位置插入数据
# syntax: plt.text(x, y, label)
plt.text(2000, 6000, '2010 Earthquake') # see note below

plt.show()

output_42_0.png

以下程序用以显示中国和印度籍移民图示

df_CI=df_can.loc[['China','India'],years]
df_CI.head()

1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	…	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
China	5123	6682	3308	1863	1527	1816	1960	2643	2758	4323	…	36619	42584	33518	27642	30037	29622	30391	28502	33024	34129
India	8880	8670	8147	7338	5704	4211	7150	10189	11522	10343	…	28235	36210	33848	28742	28261	29456	34235	27509	30933	33087

2 rows × 34 columns

df_CI.plot(kind='line')

<matplotlib.axes._subplots.AxesSubplot at 0x56c6d90>

output_45_1.png

上述的图形显然有问题，这主要是因为横轴纵轴错误，这是个常见的问题，需要通过transpose方法先修正。

df_CI = df_CI.transpose()
df_CI.head()

China	India
1980	5123	8880
1981	6682	8670
1982	3308	8147
1983	1863	7338
1984	1527	5704

df_CI.index = df_CI.index.map(int)
df_CI.plot(kind='line')

plt.title('Immigration from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# annotate the 2010 Earthquake. 
# syntax: plt.text(x, y, label)

plt.show()

output_48_0.png

df_can.sort_values(by='Total', ascending=False, axis=0, inplace=True)
df_top5 = df_can.head(5)
df_top5 = df_top5[years].transpose() 
print(df_top5)

df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size
plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.show()

output_50_0.png