Python机器学习基础之Pandas库的使用

声明：代码的运行环境为Python3。Python3与Python2在一些细节上会有所不同，希望广大读者注意。本博客以代码为主，代码中会有详细的注释。相关文章将会发布在我的个人博客专栏《Python从入门到深度学习》，欢迎大家关注。

四、Python机器学习基础之Pandas库的使用

Pandas是Python的一个数据分析包，最初由AQR Capital Management于2008年4月开发，并与2009年底开源出来，目前由专注于Python数据包开发的PyData开发team继续开发和维护，属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来。因此，Pandas为时间序列分析提供了很好的支持。

Python的名称来自于面板数据（Panel Data）和Python数据分析（Data Analysis）。Panel Data是经济学中关于多维数据集的一个属于，在Pandas中也提供了Panel的数据类型。下面开始我们的第四讲：Python机器学习基础之Pandas库的使用。

【代码】

'''
机器学习基础之Pandas模块的简单使用
'''

# 导入需要的包
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

# pandas数据结构介绍：
# Series：一种类似于一维数组的对象，它是由一组数据（各种Numpy数据类型）以及一组与之相关的数据标签
# （即索引）组成。仅由一组数据即可产生简单的Series；
# DataFrame：一个表格的数据类型，含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）
# ，DataFrame既有行索引也有列索引，可以被看做是由Series组成的字典。

# 通过一维数组创建Series
arr = np.array([1, 2, 3, 4, 5])
series01 = Series(arr)
print("通过一维数组创建的series：", "\n", series01)
series02 = Series([2, 3, 4])
print("通过一维数组创建的series：", "\n", series02)
# 通过数组创建series时，如果没有为数据指定索引，则会自动创建一个从0到N-1（N为数据的长度）的整数型索引 ，默认索引可以通过赋值的方式进行修改
series02.index = ['数字1', '数字2', '数字3']
print("设置了索引的series：", "\n", series02)
# 创建Series时，可以通过index参数传入一个明确的标签索引
series03 = Series([3, 4, 5], index=['数字1', '数字2', '数字3'])
print("创建时设置了索引的series：", "\n", series03)

# 通过字典的方式创建Series:Series可以被看成是一个定长的有序字典，是索引值待数据值的一个映射，因此可以直接通过字典来创建Series。通过字典创建
# Series时，字典中的key组成Series的索引，字典中的value组成Series中的values
dict = {'数字1': 1, '数字2': 2, '数字3': 3}
series04 = Series(dict)
print("通过字典的方式创建series：", "\n", series04)

# Series应用Numpy数组运算:Numpy中的数组运算，在Series中都保留使用，并且Series进行数组运算时，索引与值之间的映射关系不会改变
print("通过索引取值：", series04['数字1'])
print("通过索引取值：", series04[0])
print("series04[series04 > 1]:", series04[series04 > 1])
print("series04 / 100:", series04 / 100)
print("np.exp(series04):", np.exp(series04))

# Series缺失值检测：NaN（not a number）在Pandas中用于表示一个确实或者NA值
scores = Series({'Eric': 99, 'xzw': 98})
new_index = ['Eric', 'xzw', 'Tom']
scores = Series(scores, index=new_index)
print(scores)
# Series缺失值检测：pandas中的isnull和notnull函数可用于Series缺失值的检测。他们都返回一个布尔类型的Series
print("isnull:", pd.isnull(scores))
print("notnull:", pd.notnull(scores))
print("过滤出为缺失值的项：", scores[pd.isnull(scores)])
print("过滤出不为缺失值的项：", scores[pd.notnull(scores)])

# Series自动对齐：不同Series之间进行算术运算，会自动对齐不同索引的数据
series01 = Series([1, 2, 3, 4], index=['s1', 's2', 's4', 's7'])
series02 = Series([2, 3, 4], index=['s2', 's4', 's5'])
series_num = series01 + series02
print("Series自动对齐:", series_num)

# Series及其索引的name属性:Series队形本身及其索引都有一个name属性，可赋值设置
series_num.name = '和'
series_num.index.name = '索引'
print("name属性的使用:", '\n', series_num)

# 通过二维数组创建DataFrame
df01 = DataFrame([['Eric', 'xzw'], [99, 98]])
print("通过二维数组创建DataFrame：", '\n', df01)
df02 = DataFrame([['Eric', 99], ['xzw', 98]])
print("通过二维数组创建DataFrame：", '\n', df02)
arr = np.array([['Eric', 99], ['xzw', 98]])
df03 = DataFrame(arr, columns=['name', 'score'], index=['one', 'two'])
print("DataFrame自定义行索引和列索引:", '\n', df03)

# 通过字典的方式定义DataFrame
data = {'Eric': [99, 98, 97], 'xzw': [98, 97, 96]}
df = DataFrame(data)
print("通过字典的方式定义DataFrame:", '\n', df)
print("df的行索引：", df.index)
print("df的列索引：", df.columns)
print("df的值：", df.values)
df = DataFrame(data, index=['语', '数', '外'])
print("指定行索引：", '\n', df)

# 索引对象：1、不管是Series对象还是DataFrame对象，都有索引对象；
# 2、索引对象负责管理轴标签和其他元数据（比如轴名称等）；
# 3、通过索引可以从Series、DataFrame中取值或对某个位置的值重新赋值；
# 4、Series或者DataFrame自动化对齐功能就是通过索引进行的。

# 通过索引从Series中取值:边界右边是包含的，这与Python基础中的列表等不一样
series01 = Series([1, 2, 3, 4, 5, 6, 7], index=['01', '02', '03', '04', '05', '06', '07'])
print("通过索引从Series中取值:", '\n', series01['02':'05'])
print("通过索引从Series中取值:", '\n', series01['02':])
print("通过索引从Series中取值:", '\n', series01[:'05'])

# 通过索引从DataFrame中取值
data = {"apart": ['1001', '1002', '1003', '1001'],
        "profits": [5645, 5665, 5645, 5648],
        "year": [2001, 2001, 2001, 2000]}
df = DataFrame(data, index=['one', 'two', 'three', 'four'])
# 可以直接通过列索引获取指定列的数据
print(df['year'])
# 要通过行索引获取指定行数据需要ix方法
print(df.ix[0])
# 添加一列空值
df['pdn'] = np.nan
print(df)

# Pandas中常用的数学和统计方法：
#############################################################
#     方法        # 说明
# count           # 非NA值的数量
# describe        # 针对Series或各DataFrame列计算总统计
# min/max         # 计算最小值、最大值
# argmin/argmax   # 计算能够获取到最小值最大值的索引位置（整数）
# idxmin/idxmax   # 计算能够获取到最小值和最大值的索引值
# quantile        # 计算样本的分位数（0到1）
# sum             # 值的总和
# mean            # 值的平均数
# median          # 值的算术中位数（50%分位数）
# mad             # 根据平均值计算平均绝对离差
# var             # 样本数值的方差
# std             # 样本值的标准差
# cumsum          # 样本值的累计和
# cummin/cummax   # 样本值的累计最小值、最大值
# cumprod         # 样本值的累计积
# Pct_change      # 计算百分数变化
##############################################################
print(df.describe())
# 对于DataFrame，这些统计方法默认是计算各列上的数据，如果要应用于各行数据，则增加参数axis=1
df = DataFrame([
    [1, 2, 3],
    [2, 3, 4]
])
print("默认计算各列：", '\n', df.count())
print("计算各行：", '\n', df.count(axis=1))

# 相关系数与协方差
df = DataFrame({"GDP": [12, 23, 34, 45, 56],
                "air_temperature": [23, 25, 26, 27, 30]},
               index=['2001', '2002', '2003', '2004', '2005'])
print(df)
print("相关系数：", '\n', df.corr())
print("协方差：", '\n', df.cov())
print("相关系数：", '\n', df['GDP'].corr(df['air_temperature']))
print("协方差：", '\n', df['GDP'].cov(df['air_temperature']))

# 唯一值、值计数以及成员资格:
# 1、unique方法用于获取Series唯一值数组；
ser = Series([0, 1, 1, 1 ,0])
print("原数据：", ser)
print("唯一值：", ser.unique())
df = DataFrame({"GDP": [12, 23, 34, 45, 56],
                "air_temperature": [22, 22, 21, 22, 31]},
               index=['2001', '2002', '2003', '2004', '2005'])
print("原数据：", df['air_temperature'])
print("唯一值：", df['air_temperature'].unique())
# 2、value_counts方法，用于计算一个Series中各值出现的频率；
# 默认情况下会按照值出现的频率降序排列
print(ser.value_counts())
print(df['air_temperature'].value_counts())
# 3、isin方法，用于判断矢量化集合的成员资格，可用于选取Series中或者DataFrame中列中数据的子集。
mask = ser.isin([0])
print(mask)
# 可以选出值为0的项
print(ser[mask])

# 处理缺失数据：
####################################################################################
# 方法      # 说明
# dropna   # 根据标签的值中是否存在缺失数据对轴标签进行过滤（删除），可通过阈值调节对缺失值的容忍度
# fillna   # 用指定值或插值方法（如ffill或bfill）填充缺失数据
# isnull   # 返回一个含有布尔值的对象，这些布尔值表示哪些值是缺失值
# notnull  # isnull的否定式
####################################################################################
# 缺失值检测
df = DataFrame([
    ['xzw', np.nan, 456, 'M'],
    ['Eric', 34, 132, np.nan],
    ['Tom', 23, np.nan, 'F']
], columns=['name', 'age', 'salary', 'gender'])
print(df)
print(df.isnull())
print(df.notnull())
# 过滤缺失数据
series = Series([1, 2, np.nan, 4])
print(series.dropna())
# 默认丢弃只要含有缺失值的行
print(df.dropna())
# 丢弃全部为缺失值的行
print(df.dropna(how='all'))
# 丢弃全部为缺失值的列
print(df.dropna(axis=1, how='all'))

# 填充缺失数据
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = np.nan
df.ix[:2, 2] = np.nan
print(df)
print(df.fillna(0))
print(df.fillna({1: 0.5, 2: -1, 3: -2}))

# 层次化索引：
# 1、在某个方向上拥有多个（两个或两个以上）索引级别；
# 2、通过层次化索引，Pandas能够以低维度形式处理高纬度数据；
# 3、通过层次化索引，可以按层级统计数据

# Series层次化索引
data = Series([98, 99, 97, 96, 95],
              index=[['2001', '2001', '2001', '2002', '2002'],
                     ['苹果', '香蕉', '西瓜', '苹果', '西瓜']])
data.index.names = ['年份', '水果']
print(data)

# DataFrame层次化索引
df = DataFrame({'year': [2001, 2001, 2001, 2002, 2002],
                'fruit': ['apple', 'banana', 'apple', 'banana', 'apple'],
                'production': [2345, 4556, 1223, 4589, 8978],
                'profits': [1223, 2345, 5645, 7889, 4556]})
df = df.set_index(['year', 'fruit'])
print(df)

# 按层级统计数据
print(df.index)
print(df.sum(level='year'))
print(df.sum(level='fruit'))

你们在此过程中遇到了什么问题，欢迎留言，让我看看你们都遇到了哪些问题。

Python机器学习基础之Pandas库的使用

猜你喜欢