Pandas学习笔记（一）—— Pandas基础

更多文章代码详情:

可以查看博主个人网站：https://www.iwtmbtly.com/

Pandas简介

Pandas 是 Python 的核心数据分析支持库，提供了快速、灵活、明确的数据结构，旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具，其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。经过多年不懈的努力，Pandas 离这个目标已经越来越近了。

Pandas 适用于处理以下类型的数据：

与 SQL 或 Excel 表类似的，含异构列的表格数据;
有序和无序（非固定频率）的时间序列数据;
带行列标签的矩阵数据，包括同构或异构型数据;
任意其它形式的观测、统计数据集, 数据转入 Pandas 数据结构时不必事先标记。

Pandas 的主要数据结构是 Series（一维数据）与 DataFrame（二维数据），这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。对于 R 用户，DataFrame 提供了比 R 语言 data.frame 更丰富的功能。Pandas 基于 NumPy 开发，可以与其它第三方科学计算支持库完美集成。

Pandas 就像一把万能瑞士军刀，下面仅列出了它的部分优势：

处理浮点与非浮点数据里的缺失数据，表示为 NaN；
大小可变：插入或删除 DataFrame 等多维对象的列；
自动、显式数据对齐：显式地将对象与一组标签对齐，也可以忽略标签，在 Series、DataFrame 计算时自动与数据对齐；
强大、灵活的分组（group by）功能：拆分-应用-组合数据集，聚合、转换数据；
把 Python 和 NumPy 数据结构里不规则、不同索引的数据轻松地转换为 DataFrame 对象；
基于智能标签，对大型数据集进行切片、花式索引、子集分解等操作；
直观地合并（merge）、连接（join）数据集；
灵活地重塑（reshape）、透视（pivot）数据集；
轴支持结构化标签：一个刻度支持多个标签；
成熟的 IO 工具：读取文本文件（CSV 等支持分隔符的文件）、Excel 文件、数据库等来源的数据，利用超快的 HDF5 格式保存 / 加载数据；
时间序列：支持日期范围生成、频率转换、移动窗口统计、移动窗口线性回归、日期位移等时间序列功能。

这些功能主要是为了解决其它编程语言、科研环境的痛点。处理数据一般分为几个阶段：数据整理与清洗、数据分析与建模、数据可视化与制表，Pandas 是处理数据的理想工具。

其它说明：

Pandas 速度很快。Pandas 的很多底层算法都用 Cython 优化过。然而，为了保持通用性，必然要牺牲一些性能，如果专注某一功能，完全可以开发出比 Pandas 更快的专用工具。
Pandas 是 statsmodels 的依赖项，因此，Pandas 也是 Python 中统计计算生态系统的重要组成部分。
Pandas 已广泛应用于金融领域。

导入Pandas

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__ # 查看Pandas版本
'1.3.4'

一、文件的读写

读取

1. csv格式

>>> df = pd.read_csv("data/table.csv", encoding="utf_8_sig")
>>> df.head() # 查看部分前面的内容
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

2. txt格式

>>> df_txt = pd.read_table("data/table.txt")
>>> df_txt
   col1 col2  col3    col4
0     2    a   1.4   apple
1     3    b   3.4  banana
2     6    c   2.5  orange
3     5    d   3.2   lemon

3. xls或者xlsx格式

# 需要安装xlrd包
>>> df_excel = pd.read_excel("data/table.xlsx")
>>> df_excel.head()  # 查看部分前面的内容
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

写入

1. csv格式

# df.to_csv("data/new_table.csv", index=False) # 保存时去除索引
>>> df.to_csv("data/new_table.csv", encoding="utf_8_sig")  # 写入中文时容易乱码，使用encoding="utf_8_sig"

2. xls或者xlsx格式

# 需要安装openpyxl
>>> df.to_excel("data/new_table.xlsx", sheet_name="Sheet1")

二、基本数据结构

Series

Pandas Series 类似表格中的一个列（column），类似于一维数组，可以保存任何数据类型。Series 由索引（index）和列组成，常用功能如下：

1. 创建一个Series

对于一个Series，其最常用的属性为值（values），索引（index），名字（name），类型（dtype）

# 使用numpy的np.random.randn(6)随机生成服从标准正态分布的随机数
>>> s = pd.Series(np.random.randn(6), index=['a', 'b', 'c', 'd', 'e', 'f'], name="这是一个Series", dtype='float64')
>>> s
a    0.945151
b    2.071048
c    0.983989
d    0.032149
e    1.465196
f    0.416414
Name: 这是一个Series, dtype: float64

2. 访问Series属性

>>> s.values  # 查看值
array([ 0.07553884, -2.52708429,  0.90744636,  0.95449944,  1.2871629 ,
       -0.03133632])
>>> s.name  # 查看名字
'这是一个Series'
>>> s.index  # 查看索引
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
>>> s.dtype  # 查看类型
dtype('float64')

3. 取出一个元素

>>> s["a"]
0.9451514803304423

4. 调用方法

>>> s.mean()
0.9856579470429936

# Series有很多的方法可以调用
>>> print([attr for attr in dir(s) if not attr.startswith('_')])
['T', 'a', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'b', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'c', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'd', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtype', 'dtypes', 'duplicated', 'e', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'f', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'set_flags', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']

DataFrame

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。

DataFrame 构造方法如下：

pandas.DataFrame( data, index, columns, dtype, copy)

参数说明：

data：一组数据(ndarray、series, map, lists, dict 等类型)。
index：索引值，或者可以称为行标签。
columns：列标签，默认为 RangeIndex (0, 1, 2, …, n) 。
dtype：数据类型。
copy：拷贝数据，默认为 False。

Pandas DataFrame 是一个二维的数组结构，类似二维数组。

1. 创建一个DataFrame

>>> df = pd.DataFrame({
    
    'col1': list('abcde'), 'col2': range(5, 10), 'col3': [1.3, 2.5, 3.6, 4.6, 5.8]},
...                   index=list('一二三四五'))
>>> df
  col1  col2  col3
一    a     5   1.3
二    b     6   2.5
三    c     7   3.6
四    d     8   4.6
五    e     9   5.8

2. 从DataFrame取出一个Series

>>> df['col1']
一    a
二    b
三    c
四    d
五    e
Name: col1, dtype: object
>>> type(df)
pandas.core.frame.DataFrame
>>>type(df['col1'])
pandas.core.series.Series

3. 修改行或者列名

>>> df.rename(index={
    
    '一': 'one'}, columns={
    
    'col1': 'new_col1'})
    new_col1  col2  col3
one        a     5   1.3
二          b     6   2.5
三          c     7   3.6
四          d     8   4.6
五          e     9   5.8

4. 访问属性和调用方法

>>> df.index
Index(['一', '二', '三', '四', '五'], dtype='object')
>>> df.columns
Index(['col1', 'col2', 'col3'], dtype='object')
>>> df.values
array([['a', 5, 1.3],
       ['b', 6, 2.5],
       ['c', 7, 3.6],
       ['d', 8, 4.6],
       ['e', 9, 5.8]], dtype=object)
>>> df.shape
(5, 3)

5. 索引对其特性

这是Pandas中非常强大的特性，不理解这一特性有时就会造成一些麻烦

>>> df1 = pd.DataFrame({
    
    'A': [1, 2, 3]}, index=[1, 2, 3])
>>> df2 = pd.DataFrame({
    
    'A': [1, 2, 3]}, index=[3, 1, 2])
>>> df1 - df2  # 由于索引并未对齐，因此结果不是0
   A
1 -1
2 -1
3  2

6. 列的删除与添加

对于删除而言，可以使用drop函数或del或pop

>>> # 删除“第五行”和“第一列”
>>> df.drop(index='五', columns='col1')  # 设置inplace=True后会直接在原DataFrame中改动
   col2  col3
一     5   1.3
二     6   2.5
三     7   3.6
四     8   4.6

>>> df['col1'] = [1, 2, 3, 4, 5]
>>> del df['col2']
>>> df
   col1  col3
一     1   1.3
二     2   2.5
三     3   3.6
四     4   4.6
五     5   5.8

pop方法直接在原来的DataFrame上操作，且返回被删除的列，与python中的pop函数类似

>>> df['col2'] = [6, 7, 8, 9, 10]
>>> df.pop('col1')
一    1
二    2
三    3
四    4
五    5
Name: col1, dtype: int64

>>> df
   col3  col2
一   1.3     6
二   2.5     7
三   3.6     8
四   4.6     9
五   5.8    10

可以直接增加新的列，也可以使用assign方法

>>> df1['B'] = list('abc')
>>> df1
   A  B
1  1  a
2  2  b
3  3  c

>>> df1.assign(C=pd.Series(list('def')))  # 为什么会出现NaN？（提示：索引对齐）assign左右两边的索引不一样，请问结果的索 引谁说了算？
   A  B    C
1  1  a    e
2  2  b    f
3  3  c  NaN

但assign方法不会对原DataFrame做修改

>>> df1
   A  B
1  1  a
2  2  b
3  3  c

7. 根据类型选择列

>>> df.select_dtypes(include=['number']).head()
   col3  col2
一   1.3     6
二   2.5     7
三   3.6     8
四   4.6     9
五   5.8    10

>>> df.select_dtypes(include=['float']).head()
   col3
一   1.3
二   2.5
三   3.6
四   4.6
五   5.8

8. 将Series转换为DataFrame

>>> s = df.mean()
>>> s.name = 'to_DataFrame'
>>> s
col3    3.56
col2    8.00
Name: to_DataFrame, dtype: float64

>>> s.to_frame()
      to_DataFrame
col3          3.56
col2          8.00

使用T符号可以转置

>>> s.to_frame().T
              col3  col2
to_DataFrame  3.56   8.0

三、常用基本函数

从下面开始，包括后面所有章节，我们都会用到这份数据集。

df = pd.read_csv("data/table.csv")

1. head和tail

>>> df.head()  # 查看文件前面部分数据
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

>>> df.tail()  # 查看文件后面部分数据
   School Class    ID Gender   Address  Height  Weight  Math Physics
30    S_2   C_4  2401      F  street_2     192      62  45.3       A
31    S_2   C_4  2402      M  street_7     166      82  48.7       B
32    S_2   C_4  2403      F  street_6     158      60  59.7      B+
33    S_2   C_4  2404      F  street_2     160      84  67.7       B
34    S_2   C_4  2405      F  street_6     193      54  47.6       B

可以指定n参数显示多少行

>>> df.head(3)
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+

2. unique和nunique

nunique显示有多少个唯一值,unique显示所有的唯一值

>>> df['Physics'].nunique()
7
>>> df['Physics'].unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

3. count和value_counts

count返回非缺失值元素个数，value_counts返回每个元素有多少个

>>> df['Physics'].count()
35
>>> df['Physics'].value_counts()
B+    9
B     8
B-    6
A     4
A+    3
A-    3
C     2
Name: Physics, dtype: int64

4. describe和info

info函数返回有哪些列、有多少非缺失值、每列的类型

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   School   35 non-null     object
 1   Class    35 non-null     object
 2   ID       35 non-null     int64
 3   Gender   35 non-null     object
 4   Address  35 non-null     object
 5   Height   35 non-null     int64
 6   Weight   35 non-null     int64
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB

describe默认统计数值型数据的各个统计量

>>> df.describe()
               ID      Height      Weight       Math
count    35.00000   35.000000   35.000000  35.000000
mean   1803.00000  174.142857   74.657143  61.351429
std     536.87741   13.541098   12.895377  19.915164
min    1101.00000  155.000000   53.000000  31.500000
25%    1204.50000  161.000000   63.000000  47.400000
50%    2103.00000  173.000000   74.000000  61.700000
75%    2301.50000  187.500000   82.000000  77.100000
max    2405.00000  195.000000  100.000000  97.000000

可以自行选择分位数

>>> df.describe(percentiles=[.05, .25, .75, .95])
               ID      Height      Weight       Math
count    35.00000   35.000000   35.000000  35.000000
mean   1803.00000  174.142857   74.657143  61.351429
std     536.87741   13.541098   12.895377  19.915164
min    1101.00000  155.000000   53.000000  31.500000
5%     1102.70000  157.000000   56.100000  32.640000
25%    1204.50000  161.000000   63.000000  47.400000
50%    2103.00000  173.000000   74.000000  61.700000
75%    2301.50000  187.500000   82.000000  77.100000
95%    2403.30000  193.300000   97.600000  90.040000
max    2405.00000  195.000000  100.000000  97.000000

对于非数值型也可以用describe函数

>>> df['Physics'].describe()
count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

5. idxmax和nlargest

idxmax函数返回最大值所在索引，在某些情况下特别适用，idxmin功能类似，nlargest函数返回前几个大的元素值，nsmallest功能类似

>>> df['Math'].idxmax()
5
>>> df['Math'].nlargest(3)
5     97.0
28    95.5
11    87.7
Name: Math, dtype: float64

6. clip和replace

clip和replace是两类替换函数

clip是对超过或者低于某些值的数进行截断

>>> df['Math'].head()
0    34.0
1    32.5
2    87.2
3    80.4
4    84.8
Name: Math, dtype: float64

>>> df['Math'].clip(33, 80).head()
0    34.0
1    33.0
2    80.0
3    80.0
4    80.0
Name: Math, dtype: float64

replace是对某些值进行替换

>>> df['Address'].head()
0    street_1
1    street_2
2    street_2
3    street_2
4    street_4
Name: Address, dtype: object

>>> df['Address'].replace(['street_1', 'street_2'], ['one', 'two']).head()
0         one
1         two
2         two
3         two
4    street_4
Name: Address, dtype: object

>>> df.replace({
    
    'Address': {
    
    'street_1': 'one', 'street_2': 'two'}}).head()
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M       one     173      63  34.0      A+
1    S_1   C_1  1102      F       two     192      73  32.5      B+
2    S_1   C_1  1103      M       two     186      82  87.2      B+
3    S_1   C_1  1104      F       two     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

7. apply函数

apply是一个自由度很高的函数，对于Series，它可以迭代每一列的值操作：

>>> df['Math'].apply(lambda x: str(x) + '!').head()  # 可以使用lambda表达式，也可以使用函数
0    34.0!
1    32.5!
2    87.2!
3    80.4!
4    84.8!
Name: Math, dtype: object

对于DataFrame，它在默认axis=0下可以迭代每一个列操作：

>>> df.apply(lambda x: x.apply(lambda y: str(y) + '!')).head()  # 这是一个稍显复杂的例子，有利于理解apply的功能
  School Class     ID Gender    Address Height Weight   Math Physics
0   S_1!  C_1!  1101!     M!  street_1!   173!    63!  34.0!     A+!
1   S_1!  C_1!  1102!     F!  street_2!   192!    73!  32.5!     B+!
2   S_1!  C_1!  1103!     M!  street_2!   186!    82!  87.2!     B+!
3   S_1!  C_1!  1104!     F!  street_2!   167!    81!  80.4!     B-!
4   S_1!  C_1!  1105!     F!  street_4!   159!    64!  84.8!     B+!

Pandas中的axis参数=0时，永远表示的是处理方向而不是聚合方向，当axis='index’或=0时，对列迭代对行聚合，行即为跨列，axis=1同理

四、排序

1. 索引排序

>>> df.set_index('Math').head()  # set_index函数可以设置索引
     School Class    ID Gender   Address  Height  Weight Physics
Math
34.0    S_1   C_1  1101      M  street_1     173      63      A+
32.5    S_1   C_1  1102      F  street_2     192      73      B+
87.2    S_1   C_1  1103      M  street_2     186      82      B+
80.4    S_1   C_1  1104      F  street_2     167      81      B-
84.8    S_1   C_1  1105      F  street_4     159      64      B+

>>> df.set_index('Math').sort_index().head()  # 可以设置ascending参数，默认为升序，True
     School Class    ID Gender   Address  Height  Weight Physics
Math
31.5    S_1   C_3  1301      M  street_4     161      68      B+
32.5    S_1   C_1  1102      F  street_2     192      73      B+
32.7    S_2   C_3  2302      M  street_5     171      88       A
33.8    S_1   C_2  1204      F  street_5     162      63       B
34.0    S_1   C_1  1101      M  street_1     173      63      A+

2. 值排序

>>> df.sort_values(by='Class').head()  # 默认升序
   School Class    ID Gender   Address  Height  Weight  Math Physics
0     S_1   C_1  1101      M  street_1     173      63  34.0      A+
19    S_2   C_1  2105      M  street_4     170      81  34.2       A
18    S_2   C_1  2104      F  street_5     159      97  72.2      B+
16    S_2   C_1  2102      F  street_6     161      61  50.6      B+
15    S_2   C_1  2101      M  street_7     174      84  83.3       C

多个值排序，即先对第一层排，在第一层相同的情况下对第二层排序

>>> df.sort_values(by=['Address', 'Height']).head()
   School Class    ID Gender   Address  Height  Weight  Math Physics
0     S_1   C_1  1101      M  street_1     173      63  34.0      A+
11    S_1   C_3  1302      F  street_1     175      57  87.7      A-
23    S_2   C_2  2204      M  street_1     175      74  47.2      B-
33    S_2   C_4  2404      F  street_2     160      84  67.7       B
3     S_1   C_1  1104      F  street_2     167      81  80.4      B-