Pandas DataFrame 是一个包含二维数据及其对应索引的结构。DataFrame 广泛用于数据科学、机器学习、科学计算和许多其他数据密集型领域。

DataFrame 类似于SQL 表或在 Excel 中使用的电子表格。在许多情况下DataFrame 比表格或电子表格更快、更易于使用且功能更强大。

在这里插入图片描述

文章目录

Pandas DataFrame
创建 DataFrame
- 使用 Dict 创建
- 使用 List 创建
- 使用 NumPy 数组创建
- 文件读取创建
检索索引和数据
- 索引作为序列
- 数据转为 NumPy 数组
- 数据类型
- DataFrame 大小
访问和修改数据
- 使用访问器获取数据
- 使用访问器设置数据
插入和删除数据
- 插入和删除行
- 插入和删除列
应用算术运算
应用 NumPy 和 SciPy 函数
DataFrame 进行排序
DataFrame 过滤数据
DataFrame 数据统计
DataFrame 处理缺失数据
- 缺失数据进行计算
- 填充缺失数据
- 删除缺少数据的行和列
遍历 DataFrame
DataFrame 时间序列
- 使用时间序列创建index
- 索引和切片
- 重采样
- 窗口滚动
DataFrame 绘图

Pandas DataFrame

Pandas DataFrame 是包含以二维、行和列组织的数据、对应于行和列的索引的数据结构。

例如构建一个面试求职者的数据框。其中列数据包含姓名、城市、年龄和笔试分数。

-	name	city	age	py-score
101	Xavier	Mexico City	41	88.0
102	Ann	Toronto	28	79.0
103	Jana	Prague	33	81.0
104	Yi	Shanghai	34	80.0
105	Robin	Manchester	38	68.0
106	Amal	Cairo	31	61.0
107	Nori	Osaka	37	84.0

使用字典的方式创建DataFrame。

import pandas as pd

data = {
    
    
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

df = pd.DataFrame(data=data, index=row_labels)

在这里插入图片描述
设定条件查询数据的前 N 行或者后 N 行内容。

df.head(2)
       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0

df.tail(2)
     name   city  age  py-score
106  Amal  Cairo   31      61.0
107  Nori  Osaka   37      84.0

查看某列数据的话直接使用字典取值的方式获取即可。

cities = df['city']
cities 

101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

也可以像获取类实例的属性一样访问该列数据。

df.city
101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

上面两种方法对比一下获取的数据是一样的。
在这里插入图片描述
Pandas DataFrame 的每一列都是一个 pandas.Series 实例，保存一维数据及其索引的结构。可以像使用字典一样获取对象的单个项目，Series 方法是使用其索引作为键。

cities[102]
'Toronto'

可以使用 .loc[] 访问器访问整行数据。

df.loc[103]
name          Jana
city        Prague
age             33
py-score        81
Name: 103, dtype: object

type(df.loc[103])
pandas.core.series.Series

label 对应的行103，其中包含对应行数据之外，还提取了相应列的索引，返回的行也是一个 pandas.Series 实例。
在这里插入图片描述

创建 DataFrame

分别使用不同的方式创建DataFrame，创建之前先要导入对应的三方库。

import numpy as np
import pandas as pd

使用 Dict 创建

data = {
    
    'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': 100}
pd.DataFrame(data)

   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100

可以用 columns参数控制列的顺序，用index控制行索引的顺序。

pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x'])
       z  y  x
100  100  2  1
200  100  4  2
300  100  8  3

使用 List 创建

字典键是列索引，字典值是 DataFrame 中的数据值。

l = [{
    
    'x': 1, 'y': 2, 'z': 100},
     {
    
    'x': 2, 'y': 4, 'z': 100},
     {
    
    'x': 3, 'y': 8, 'z': 100}]

pd.DataFrame(l)
   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100

还可以使用嵌套列表或列表列表作为数据值，并且创建时需要指明行、列索引。元组和列表创建的方式相同。

l = [[1, 2, 100],
     [2, 4, 100],
     [3, 8, 100]]

pd.DataFrame(l, columns=['x', 'y', 'z'])
   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100

使用 NumPy 数组创建

arr = np.array([[1, 2, 100],
                [2, 4, 100],
                [3, 8, 100]])

df_ = pd.DataFrame(arr, columns=['x', 'y', 'z'])
df_
   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100

文件读取创建

可以在多种文件类型（包括 CSV、Excel、SQL、JSON 等）中保存和加载Pandas DataFrame 中的数据和索引。

先将生成的数据保存到不同的文件中。

import pandas as pd

data = {
    
    
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]

df = pd.DataFrame(data=data, index=row_labels)

df.to_csv('data.csv')
df.to_excel('data.xlsx')

检索索引和数据

创建 DataFrame 后可以进行一些检索、修改操作。

索引作为序列

df.index
Int64Index([1, 2, 3, 4, 5, 6, 7], dtype='int64')

df.columns
Index(['name', 'city', 'age', 'py-score'], dtype='object')

df.columns[1]
'city'

用序列修改索引。

df.index = np.arange(10, 17)

df.index
Int64Index([10, 11, 12, 13, 14, 15, 16], dtype='int64')

df

在这里插入图片描述

数据转为 NumPy 数组

转化之后取值方式同List操作。

df.to_numpy()
array([['Xavier', 'Mexico City', 41, 88.0],
       ['Ann', 'Toronto', 28, 79.0],
       ['Jana', 'Prague', 33, 81.0],
       ['Yi', 'Shanghai', 34, 80.0],
       ['Robin', 'Manchester', 38, 68.0],
       ['Amal', 'Cairo', 31, 61.0],
       ['Nori', 'Osaka', 37, 84.0]], dtype=object)

在这里插入图片描述

数据类型

数据值的类型，也称为数据类型或 dtypes，决定了 DataFrame 使用的内存量，以及计算速度和精度水平。

查看数据类型。

df.dtypes
name         object
city         object
age           int64
py-score    float64
dtype: object

使用.astype() 更改数据类型。

df_ = df.astype(dtype={
    
    'age': np.int32, 'py-score': np.float32})
df_.dtypes
name         object
city         object
age           int32
py-score    float32
dtype: object

DataFrame 大小

.ndim、.size和.shape分别返回维度数、每个维度上的数据值数和数据值总数。

df_.ndim
2

df_.shape
(7, 4)

df_.size
28

访问和修改数据

使用访问器获取数据

除了.loc[] 可以使用通过索引获取行或列的访问器之外，Pandas 还提供了访问器.iloc[]，它通过整数索引检索行或列。

df.loc[10]
name             Xavier
city        Mexico City
age                  41
py-score             88
Name: 10, dtype: object

df.iloc[0]
name             Xavier
city        Mexico City
age                  41
py-score             88
Name: 10, dtype: object

.loc[] 和 .iloc[] 支持切片和 NumPy 的索引操作。

df.loc[:, 'city']
10    Mexico City
11        Toronto
12         Prague
13       Shanghai
14     Manchester
15          Cairo
16          Osaka
Name: city, dtype: object

df.iloc[:, 1]
10    Mexico City
11        Toronto
12         Prague
13       Shanghai
14     Manchester
15          Cairo
16          Osaka
Name: city, dtype: object

提供切片以及列表或数组而不是索引来获取多行或多列。

df.loc[11:15, ['name', 'city']]
     name        city
11    Ann     Toronto
12   Jana      Prague
13     Yi    Shanghai
14  Robin  Manchester
15   Amal       Cairo

df.iloc[1:6, [0, 1]]
     name        city
11    Ann     Toronto
12   Jana      Prague
13     Yi    Shanghai
14  Robin  Manchester
15   Amal       Cairo

.iloc[] 使用与切片元组、列表和 NumPy 数组相同的方式跳过行和列。

df.iloc[1:6:2, 0]
11     Ann
13      Yi
15    Amal
Name: name, dtype: object

使用 Python 内置的 **slice()**类，numpy.s_[] 或者 pd.IndexSlice[]。

df.iloc[slice(1, 6, 2), 0]
11     Ann
13      Yi
15    Amal
Name: name, dtype: object

df.iloc[np.s_[1:6:2], 0]
11     Ann
13      Yi
15    Amal
Name: name, dtype: object

df.iloc[pd.IndexSlice[1:6:2], 0]
11     Ann
13      Yi
15    Amal
Name: name, dtype: object

使用 .loc[] 和 .iloc[] 获取特定的数据值。但只需要一个值时建议使用专门的访问器 .at[] 和 .iat[]。

df.at[12, 'name']
'Jana'

df.iat[2, 0]
'Jana'

使用访问器设置数据

可以使用访问器通过传递 Python 序列、NumPy 数组或单个值来修改数据。

df.loc[:, 'py-score']
10    88.0
11    79.0
12    81.0
13    80.0
14    68.0
15    61.0
16    84.0
Name: py-score, dtype: float64

df.loc[:13, 'py-score'] = [40, 50, 60, 70]
df.loc[14:, 'py-score'] = 0

df['py-score']
10    40.0
11    50.0
12    60.0
13    70.0
14     0.0
15     0.0
16     0.0
Name: py-score, dtype: float64

使用负索引 .iloc[] 来访问或修改数据。

df.iloc[:, -1] = np.array([88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0])

>>> df['py-score']
10    88.0
11    79.0
12    81.0
13    80.0
14    68.0
15    61.0
16    84.0
Name: py-score, dtype: float64

插入和删除数据

Pandas 提供了几种方便的方法来插入和删除行或列。

插入和删除行

创建一个插入的新数据。

john = pd.Series(data=['John', 'Boston', 34, 79],index=df.columns, name=17)
john
name          John
city        Boston
age             34
py-score        79
Name: 17, dtype: object

使用 df.append() 加入新的数据。

df = df.append(john)
df
      name         city  age  py-score
10  Xavier  Mexico City   41      88.0
11     Ann      Toronto   28      79.0
12    Jana       Prague   33      81.0
13      Yi     Shanghai   34      80.0
14   Robin   Manchester   38      68.0
15    Amal        Cairo   31      61.0
16    Nori        Osaka   37      84.0
17    John       Boston   34      79.0

使用 df.drop() 删除新的数据。

df = df.drop(labels=[17])
df
      name         city  age  py-score
10  Xavier  Mexico City   41      88.0
11     Ann      Toronto   28      79.0
12    Jana       Prague   33      81.0
13      Yi     Shanghai   34      80.0
14   Robin   Manchester   38      68.0
15    Amal        Cairo   31      61.0
16    Nori        Osaka   37      84.0

插入和删除列

直接赋值定义列名和定义数据。

df['js-score'] = np.array([71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0])
df
      name         city  age  py-score  js-score
10  Xavier  Mexico City   41      88.0      71.0
11     Ann      Toronto   28      79.0      95.0
12    Jana       Prague   33      81.0      88.0
13      Yi     Shanghai   34      80.0      79.0
14   Robin   Manchester   38      68.0      91.0
15    Amal        Cairo   31      61.0      91.0
16    Nori        Osaka   37      84.0      80.0

df['total-score'] = 0.0
df
      name         city  age  py-score  js-score  total-score
10  Xavier  Mexico City   41      88.0      71.0          0.0
11     Ann      Toronto   28      79.0      95.0          0.0
12    Jana       Prague   33      81.0      88.0          0.0
13      Yi     Shanghai   34      80.0      79.0          0.0
14   Robin   Manchester   38      68.0      91.0          0.0
15    Amal        Cairo   31      61.0      91.0          0.0
16    Nori        Osaka   37      84.0      80.0          0.0

.insert() 在列的指定位置插入列数据。

df.insert(loc=4, column='django-score',value=np.array([86.0, 81.0, 78.0, 88.0, 74.0, 70.0, 81.0]))
df
      name         city  age  py-score  django-score  js-score  total-score
10  Xavier  Mexico City   41      88.0          86.0      71.0          0.0
11     Ann      Toronto   28      79.0          81.0      95.0          0.0
12    Jana       Prague   33      81.0          78.0      88.0          0.0
13      Yi     Shanghai   34      80.0          88.0      79.0          0.0
14   Robin   Manchester   38      68.0          74.0      91.0          0.0
15    Amal        Cairo   31      61.0          70.0      91.0          0.0
16    Nori        Osaka   37      84.0          81.0      80.0          0.0

del 删除一列或者多列。

del df['total-score']
df
      name         city  age  py-score  django-score  js-score
10  Xavier  Mexico City   41      88.0          86.0      71.0
11     Ann      Toronto   28      79.0          81.0      95.0
12    Jana       Prague   33      81.0          78.0      88.0
13      Yi     Shanghai   34      80.0          88.0      79.0
14   Robin   Manchester   38      68.0          74.0      91.0
15    Amal        Cairo   31      61.0          70.0      91.0
16    Nori        Osaka   37      84.0          81.0      80.0

使用 df.drop() 删除列。

df = df.drop(labels='age', axis=1)
df
      name         city  py-score  django-score  js-score
10  Xavier  Mexico City      88.0          86.0      71.0
11     Ann      Toronto      79.0          81.0      95.0
12    Jana       Prague      81.0          78.0      88.0
13      Yi     Shanghai      80.0          88.0      79.0
14   Robin   Manchester      68.0          74.0      91.0
15    Amal        Cairo      61.0          70.0      91.0
16    Nori        Osaka      84.0          81.0      80.0

应用算术运算

应用基本的算术运算。

df['py-score'] + df['js-score']
10    159.0
11    174.0
12    169.0
13    159.0
14    159.0
15    152.0
16    164.0
dtype: float64

df['py-score'] / 100
10    0.88
11    0.79
12    0.81
13    0.80
14    0.68
15    0.61
16    0.84
Name: py-score, dtype: float64

线性组合公式计算汇总数据。

df['total'] =0.4 * df['py-score'] + 0.3 * df['django-score'] + 0.3 * df['js-score']
df
      name         city  py-score  django-score  js-score  total
10  Xavier  Mexico City      88.0          86.0      71.0   82.3
11     Ann      Toronto      79.0          81.0      95.0   84.4
12    Jana       Prague      81.0          78.0      88.0   82.2
13      Yi     Shanghai      80.0          88.0      79.0   82.1
14   Robin   Manchester      68.0          74.0      91.0   76.7
15    Amal        Cairo      61.0          70.0      91.0   72.7
16    Nori        Osaka      84.0          81.0      80.0   81.9

应用 NumPy 和 SciPy 函数

大多数 NumPy 和 SciPy 都可以作为参数而不是 NumPy 数组应用于 Pandas Series 或 DataFrame 对象。可以使用 NumPy 的 numpy.average() 计算考生的总考试成绩。

import numpy as np

score = df.iloc[:, 2:5]
score
    py-score  django-score  js-score
10      88.0          86.0      71.0
11      79.0          81.0      95.0
12      81.0          78.0      88.0
13      80.0          88.0      79.0
14      68.0          74.0      91.0
15      61.0          70.0      91.0
16      84.0          81.0      80.0

np.average(score, axis=1,weights=[0.4, 0.3, 0.3])
array([82.3, 84.4, 82.2, 82.1, 76.7, 72.7, 81.9])

del df['total']
df['total'] = np.average(df.iloc[:, 2:5], axis=1,weights=[0.4, 0.3, 0.3])
df
      name         city  py-score  django-score  js-score  total
10  Xavier  Mexico City      88.0          86.0      71.0   82.3
11     Ann      Toronto      79.0          81.0      95.0   84.4
12    Jana       Prague      81.0          78.0      88.0   82.2
13      Yi     Shanghai      80.0          88.0      79.0   82.1
14   Robin   Manchester      68.0          74.0      91.0   76.7
15    Amal        Cairo      61.0          70.0      91.0   72.7
16    Nori        Osaka      84.0          81.0      80.0   81.9

DataFrame 进行排序

.sort_values() 进行数据排序，需要指定排序的列。

df.sort_values(by='js-score', ascending=False)
      name         city  py-score  django-score  js-score  total
11     Ann      Toronto      79.0          81.0      95.0   84.4
14   Robin   Manchester      68.0          74.0      91.0   76.7
15    Amal        Cairo      61.0          70.0      91.0   72.7
12    Jana       Prague      81.0          78.0      88.0   82.2
16    Nori        Osaka      84.0          81.0      80.0   81.9
13      Yi     Shanghai      80.0          88.0      79.0   82.1
10  Xavier  Mexico City      88.0          86.0      71.0   82.3

也可以指定多个列和多个列的排序方式。

df.sort_values(by=['total', 'py-score'], ascending=[False, False])
      name         city  py-score  django-score  js-score  total
11     Ann      Toronto      79.0          81.0      95.0   84.4
10  Xavier  Mexico City      88.0          86.0      71.0   82.3
12    Jana       Prague      81.0          78.0      88.0   82.2
13      Yi     Shanghai      80.0          88.0      79.0   82.1
16    Nori        Osaka      84.0          81.0      80.0   81.9
14   Robin   Manchester      68.0          74.0      91.0   76.7
15    Amal        Cairo      61.0          70.0      91.0   72.7

DataFrame 过滤数据

Pandas 的过滤功能工作方式类似于在 NumPy 中使用布尔数组进行索引。

filter_ = df['django-score'] >= 80
filter_
10     True
11     True
12    False
13     True
14    False
15    False
16     True
Name: django-score, dtype: bool

使用表达式 df[filter_] 返回一个 DataFrame 中 True 的行数据。

df[filter_]
      name         city  py-score  django-score  js-score  total
10  Xavier  Mexico City      88.0          86.0      71.0   82.3
11     Ann      Toronto      79.0          81.0      95.0   84.4
13      Yi     Shanghai      80.0          88.0      79.0   82.1
16    Nori        Osaka      84.0          81.0      80.0   81.9

可是使用逻辑运算进行多条件的筛选。

df[(df['py-score'] >= 80) & (df['js-score'] >= 80)]
    name    city  py-score  django-score  js-score  total
12  Jana  Prague      81.0          78.0      88.0   82.2
16  Nori   Osaka      84.0          81.0      80.0   81.9

使用.where() 可以替换不满足所提供条件的位置的值。

df['django-score'].where(cond=df['django-score'] >= 80, other=0.0)
10    86.0
11    81.0
12     0.0
13    88.0
14     0.0
15     0.0
16    81.0
Name: django-score, dtype: float64

DataFrame 数据统计

基本统计信息使用 .describe()。

df.describe()
        py-score  django-score   js-score      total
count   7.000000      7.000000   7.000000   7.000000
mean   77.285714     79.714286  85.000000  80.328571
std     9.446592      6.343350   8.544004   4.101510
min    61.000000     70.000000  71.000000  72.700000
25%    73.500000     76.000000  79.500000  79.300000
50%    80.000000     81.000000  88.000000  82.100000
75%    82.500000     83.500000  91.000000  82.250000
max    88.000000     88.000000  95.000000  84.400000

特定统计信息可以直接进行索引方法调用。

df.mean()
py-score        77.285714
django-score    79.714286
js-score        85.000000
total           80.328571
dtype: float64

df['py-score'].mean()
77.28571428571429

df.std()
py-score        9.446592
django-score    6.343350
js-score        8.544004
total           4.101510
dtype: float64

df['py-score'].std()
9.446591726019244

DataFrame 处理缺失数据

缺失数据在数据科学和机器学习中非常常见。Pandas 具有非常强大的处理缺失数据的功能。

Pandas 通常用 NaN（不是数字）值表示缺失数据。在 Python 中可以使用 float(‘nan’)、math.nan 或 numpy.nan 获取 NaN。从 Pandas 1.0 开始BooleanDtype、Int8Dtype、Int16Dtype、Int32Dtype 和 Int64Dtype 等新类型使用 pandas.NA 作为缺失值。

df_ = pd.DataFrame({
    
    'x': [1, 2, np.nan, 4]})
df_
     x
0  1.0
1  2.0
2  NaN
3  4.0

缺失数据进行计算

许多 Pandas 方法在执行计算时会忽略 nan 值，除非明确指示。

df_.mean()
x    2.333333
dtype: float64

df_.mean(skipna=False)
x   NaN
dtype: float64

填充缺失数据

.fillna() 进行缺失数据填充。

# 指定填充数据
df_.fillna(value=0)
     x
0  1.0
1  2.0
2  0.0
3  4.0

# 向前填充
df_.fillna(method='ffill')
     x
0  1.0
1  2.0
2  2.0
3  4.0

# 向后填充
df_.fillna(method='bfill')
     x
0  1.0
1  2.0
2  4.0
3  4.0

删除缺少数据的行和列

使用 .dropna() 直接进行处理。

df_.dropna()
     x
0  1.0
1  2.0
3  4.0

遍历 DataFrame

使用.items()and .iteritems() 遍历 Pandas DataFrame 的列。每次迭代都会产生一个以列名和列数据作为Series对象的元组。

for col_label, col in df.iteritems():
	print(col_label, col, sep='\n', end='\n\n')

name
10    Xavier
11       Ann
12      Jana
13        Yi
14     Robin
15      Amal
16      Nori
Name: name, dtype: object

city
10    Mexico City
11        Toronto
12         Prague
13       Shanghai
14     Manchester
15          Cairo
16          Osaka
Name: city, dtype: object

py-score
10    88.0
11    79.0
12    81.0
13    80.0
14    68.0
15    61.0
16    84.0
Name: py-score, dtype: float64

django-score
10    86.0
11    81.0
12    78.0
13    88.0
14    74.0
15    70.0
16    81.0
Name: django-score, dtype: float64

js-score
10    71.0
11    95.0
12    88.0
13    79.0
14    91.0
15    91.0
16    80.0
Name: js-score, dtype: float64

total
10    82.3
11    84.4
12    82.2
13    82.1
14    76.7
15    72.7
16    81.9
Name: total, dtype: float64

使用 .iterrows() 遍历 DataFrame 的行。

for row_label, row in df.iterrows():
	print(row_label, row, sep='\n', end='\n\n')

10
name                 Xavier
city            Mexico City
py-score                 88
django-score             86
js-score                 71
total                  82.3
Name: 10, dtype: object

11
name                Ann
city            Toronto
py-score             79
django-score         81
js-score             95
total              84.4
Name: 11, dtype: object

12
name              Jana
city            Prague
py-score            81
django-score        78
js-score            88
total             82.2
Name: 12, dtype: object

13
name                  Yi
city            Shanghai
py-score              80
django-score          88
js-score              79
total               82.1
Name: 13, dtype: object

14
name                 Robin
city            Manchester
py-score                68
django-score            74
js-score                91
total                 76.7
Name: 14, dtype: object

15
name             Amal
city            Cairo
py-score           61
django-score       70
js-score           91
total            72.7
Name: 15, dtype: object

16
name             Nori
city            Osaka
py-score           84
django-score       81
js-score           80
total            81.9
Name: 16, dtype: object

DataFrame 时间序列

使用时间序列创建index

创建一个一天中的每小时温度数据 DataFrame。

temp_c = [ 8.0,  7.1,  6.8,  6.4,  6.0,  5.4,  4.8,  5.0,
           9.1, 12.8, 15.3, 19.1, 21.2, 22.1, 22.4, 23.1,
           21.0, 17.9, 15.5, 14.4, 11.9, 11.0, 10.2,  9.1]

使用 date_range() 构建时间索引。

dt = pd.date_range(start='2022-04-16 00:00:00.0', periods=24,freq='H')
df
DatetimeIndex(['2022-04-16 00:00:00', '2022-04-16 01:00:00',
               '2022-04-16 02:00:00', '2022-04-16 03:00:00',
               '2022-04-16 04:00:00', '2022-04-16 05:00:00',
               '2022-04-16 06:00:00', '2022-04-16 07:00:00',
               '2022-04-16 08:00:00', '2022-04-16 09:00:00',
               '2022-04-16 10:00:00', '2022-04-16 11:00:00',
               '2022-04-16 12:00:00', '2022-04-16 13:00:00',
               '2022-04-16 14:00:00', '2022-04-16 15:00:00',
               '2022-04-16 16:00:00', '2022-04-16 17:00:00',
               '2022-04-16 18:00:00', '2022-04-16 19:00:00',
               '2022-04-16 20:00:00', '2022-04-16 21:00:00',
               '2022-04-16 22:00:00', '2022-04-16 23:00:00'],
              dtype='datetime64[ns]', freq='H')

使用日期时间值作为行索引很方便。

                    temp_c
2022-04-16 00:00:00	8.0
2022-04-16 01:00:00	7.1
2022-04-16 02:00:00	6.8
2022-04-16 03:00:00	6.4
2022-04-16 04:00:00	6.0
2022-04-16 05:00:00	5.4
2022-04-16 06:00:00	4.8
2022-04-16 07:00:00	5.0
2022-04-16 08:00:00	9.1
2022-04-16 09:00:00	12.8
2022-04-16 10:00:00	15.3
2022-04-16 11:00:00	19.1
2022-04-16 12:00:00	21.2
2022-04-16 13:00:00	22.1
2022-04-16 14:00:00	22.4
2022-04-16 15:00:00	23.1
2022-04-16 16:00:00	21.0
2022-04-16 17:00:00	17.9
2022-04-16 18:00:00	15.5
2022-04-16 19:00:00	14.4
2022-04-16 20:00:00	11.9
2022-04-16 21:00:00	11.0
2022-04-16 22:00:00	10.2
2022-04-16 23:00:00	9.1

索引和切片

用切片来获取部分信息。

temp["2022-04-16 02:00:00":"2022-04-16 08:00:00"]
	                temp_c
2022-04-16 02:00:00	6.8
2022-04-16 03:00:00	6.4
2022-04-16 04:00:00	6.0
2022-04-16 05:00:00	5.4
2022-04-16 06:00:00	4.8
2022-04-16 07:00:00	5.0
2022-04-16 08:00:00	9.1

重采样

使用 .resample() 进行重采样选取数据。

temp.resample(rule='6h').mean()
	                temp_c
2022-04-16 00:00:00	6.616667
2022-04-16 06:00:00	11.016667
2022-04-16 12:00:00	21.283333
2022-04-16 18:00:00	12.016667

窗口滚动

使用 .rolling() 进行固定长度滚动窗口分析，指定数量的相邻行计算统计数据。

temp.rolling(window=3).mean()
temp_c
2022-04-16 00:00:00	NaN
2022-04-16 01:00:00	NaN
2022-04-16 02:00:00	7.300000
2022-04-16 03:00:00	6.766667
2022-04-16 04:00:00	6.400000
2022-04-16 05:00:00	5.933333
2022-04-16 06:00:00	5.400000
2022-04-16 07:00:00	5.066667
2022-04-16 08:00:00	6.300000
2022-04-16 09:00:00	8.966667
2022-04-16 10:00:00	12.400000
2022-04-16 11:00:00	15.733333
2022-04-16 12:00:00	18.533333
2022-04-16 13:00:00	20.800000
2022-04-16 14:00:00	21.900000
2022-04-16 15:00:00	22.533333
2022-04-16 16:00:00	22.166667
2022-04-16 17:00:00	20.666667
2022-04-16 18:00:00	18.133333
2022-04-16 19:00:00	15.933333
2022-04-16 20:00:00	13.933333
2022-04-16 21:00:00	12.433333
2022-04-16 22:00:00	11.033333
2022-04-16 23:00:00	10.100000

DataFrame 绘图

Pandas 允许基于 DataFrames 可视化数据或创建绘图。它在后台使用Matplotlib ，因此利用 Pandas 的绘图功能与使用 Matplotlib 非常相似。

import matplotlib.pyplot as plt
temp.plot()
plt.show()

在这里插入图片描述
图像的保存。

temp.plot().get_figure().savefig('tmp.png')

其他图表，比如直方图。

df.loc[:, ['py-score', 'total']].plot.hist(bins=5, alpha=0.4)
plt.show()

在这里插入图片描述

数据科学必备Pandas DataFrame：让数据处理变得更简单

文章目录

Pandas DataFrame

创建 DataFrame

使用 Dict 创建

使用 List 创建

使用 NumPy 数组创建

文件读取创建

检索索引和数据

索引作为序列

数据转为 NumPy 数组

数据类型

DataFrame 大小

访问和修改数据

使用访问器获取数据

使用访问器设置数据

插入和删除数据

插入和删除行

插入和删除列

应用算术运算

应用 NumPy 和 SciPy 函数

DataFrame 进行排序

DataFrame 过滤数据

DataFrame 数据统计

DataFrame 处理缺失数据

缺失数据进行计算

填充缺失数据

删除缺少数据的行和列

遍历 DataFrame

DataFrame 时间序列

使用时间序列创建index

索引和切片

重采样

窗口滚动

DataFrame 绘图

猜你喜欢