一、说明

Pandas无疑是Python有史以来最好的库之一，用于表格数据整理和处理任务。但是，如果您是新手并试图牢牢掌握 Pandas 库，那么如果您从 Pandas 的官方文档开始，一开始事情可能会显得非常令人生畏和不知所措。

二、pandas主题概述

主题列表如下所示：

官方 Pandas API 文档中的主题列表（图片来自作者）（来源：这里)

您可以在此处找到本文的代码。

三、使用方法列举

3.1 导入库

当然，如果你想使用熊猫库，你应该导入它。这里广泛采用的约定是将别名设置为。pandaspd

import pandas as pd

3.2 读取 CSV

CSV 通常是从中读取 Pandas 数据帧的最流行的文件格式。您可以使用该方法创建熊猫数据帧：pd.read_csv()

file = "file.csv"

df = pd.read_csv(file)
print(df)

>>
   col1  col2 col3
0     1     2    A
1     3     4    B

我们可以验证使用该方法创建的对象类型。type()

type(df)
>>
pandas.core.frame.DataFrame

3.3 将数据帧存储到 CSV

正如 CSV 普遍用于从中读取数据帧一样，它们也被广泛用于将数据帧转储到。

使用如下所示的方法：df.to_csv()

df.to_csv("file.csv", sep = "|", index = False)

分隔符（）指示列分隔符，并指示 Pandas 不要在 CSV 文件中写入数据帧的索引。sepindex=False

!cat file.csv
>>
col1|col2|col3
1|2|A
3|4|B

3.4 创建数据帧

若要创建熊猫数据帧，请使用以下方法：pd.DataFrame()

data = [[1, 2, "A"], 
        [3, 4, "B"]]

df = pd.DataFrame(data, 
                  columns = ["col1", "col2", "col3"])
print(df)
>>
   col1  col2 col3
0     1     2    A
1     3     4    B

3.4.1 从列表中列创建

一种流行的方法是将给定的列表列表转换为数据帧：

data = [[1, 2, "A"], 
        [3, 4, "B"]]

df = pd.DataFrame(data, 
                  columns = ["col1", "col2", "col3"])
print(df)
》》
   col1  col2 col3
0     1     2    A
1     3     4    B

3.4.2 来自字典

另一种流行的方法是将Python字典转换为DataFrame：

data = {'col1': [1, 2], 
        'col2': [3, 4], 
        'col3': ["A", "B"]}

df = pd.DataFrame(data=data)
print(df)
》》
   col1  col2 col3
0     1     3    A
1     2     4    B

您可以在此处阅读有关创建数据帧的详细信息。

3.5 数据帧的形状

数据帧本质上是带有列标题的矩阵。因此，它具有特定数量的行和列。

您可以使用参数打印尺寸，如下所示：shape

print(df)
print("Shape:", df.shape)
》》
   col1  col2 col3
0     1     3    A
1     2     4    B
Shape: (2, 3)

这里，元组（）的第一个元素是行数，第二个元素（）是列数。23

3.6 查看前 N 行

通常，在实际数据集中，您将有很多行。

在这种情况下，人们通常只对查看数据帧的第一行感兴趣。n

您可以使用该方法打印第一行：df.head(n)n

print(df.head(5))
》》
   col1  col2 col3
0     1     2    A
1     3     4    B
2     5     6    C
3     7     8    D
4     9    10    E

3.7打印列的数据类型

Pandas 为数据帧中的每个列分配适当的数据类型。

您可以使用以下参数打印所有列的数据类型：dtypes

df.dtypes
》》
col1      int8
col2     int64
col3    object
dtype: object

3.8 修改列的数据类型

如果要更改列的数据类型，可以使用如下方法：astype()

df["col1"] = df["col1"].astype(np.int8)

print(df.dtypes)
>>
col1      int8
col2     int64
col3    object
dtype: object

3.9 打印有关数据帧的描述性信息

3.9.1 方法 1

第一种方法（）用于打印缺失值统计信息和数据类型。df.info()

df.info()
》》
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col1    10 non-null     int8  
 1   col2    10 non-null     int64 
 2   col3    10 non-null     object
dtypes: int64(1), int8(1), object(1)
memory usage: 298.0+ bytes

3.9.2 方法 2

这相对更具描述性，并打印标准统计信息，如、等。每个数值列。meanstandard deviationmaximum

方法是。df.describe()

print(df.describe())
>>
        col1   col2
count  10.00  10.00
mean   10.00  11.00
std     6.06   6.06
min     1.00   2.00
25%     5.50   6.50
50%    10.00  11.00
75%    14.50  15.50
max    19.00  20.00

3.10 填充 NaN 值

在真实数据集中，丢失数据几乎是不可避免的。在这里，您可以使用该方法将它们替换为特定值。df.fillna()

df = pd.DataFrame([[1, 2, "A"], [np.nan, 4, "B"]], 
                  columns = ["col1", "col2", "col3"])
print(df)
>>
   col1  col2 col3
0   1.0     2    A
1   NaN     4    B

在我之前的博客中阅读有关处理缺失数据的更多信息：

df.fillna(0, inplace = True) print(df)
>>
   col1  col2 col3
0   1.0     2    A
1   0.0     4    B

3.11 加入数据帧

如果要使用联接键合并两个数据帧，请使用以下方法：pd.merge()

df1 = ...
df2 = ...

print(df1)
print(df2)
>>
   col1  col2 col3
0     1     2    A
1     3     4    A
2     5     6    B
  col3 col4
0    A    X
1    B    Y

pd.merge(df1, df2, on = "col3")
>>
   col1  col2 col3 col4
0     1     2    A    X
1     3     4    A    X
2     5     6    B    Y

3.12 对数据帧进行排序

排序是数据科学家用来订购数据帧的另一种典型操作。可以使用该方法对数据帧进行排序。df.sort_values()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df.sort_values("col1"))
>>
   col1  col2 col3
0     1     2    A
2     3    10    B
1     5     8    B

3.13 对数据帧进行分组

要对数据帧进行分组并执行聚合，请使用 Pandas 中的方法，如下所示：groupby()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

df.groupby("col3").agg({"col1":sum, "col2":max})
>>
      col1  col2
col3            
A        1     2
B        8    10

3.14 重命名列

如果要重命名列标题，请使用该方法，如下所示：df.rename()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col_A", "col2", "col3"])

df.rename(columns = {"col_A":"col1"})
>>
   col1  col2 col3
0     1     2    A
1     5     8    B
2     3    10    B

3.15 删除列

如果要删除列，请使用以下方法：df.drop()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df.drop(columns = ["col1"]))
>>
   col2 col3
0     2    A
1     8    B
2    10    B

3.16 添加新列

添加新列的两种广泛使用的方法是：

3.16.1 方法 1

您可以使用赋值运算符添加新列：

df = pd.DataFrame([[1, 2], [3, 4]], 
                  columns = ["col1", "col2"])

df["col3"] = df["col1"] + df["col2"]
print(df)
>>
   col1  col2  col3
0     1     2     3
1     3     4     7

3.16.1 方法 2

或者，您也可以按如下方式使用该方法：df.assign()

df = pd.DataFrame([[1, 2], [3, 4]], 
                  columns = ["col1", "col2"])

df = df.assign(col3 = df["col1"] + df["col2"])

print(df)
>>
   col1  col2  col3
0     1     2     3
1     3     4     7

3.17 筛选数据帧

有多种方法可以根据条件筛选数据帧。

方法 1：布尔过滤

在这里，如果该行的条件计算结果为。True

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  columns = ["col1", "col2", "col3"])

print(df[df["col2"] > 5])
>>
   col1  col2 col3
1     5     8    B
2     3    10    B

对于要筛选的行，中的值应大于 5。col2

该方法用于选择其值属于值列表的行。isin()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "C"]], 
                  columns = ["col1", "col2", "col3"])

filter_list = ["A", "C"]
print(df[df["col3"].isin(filter_list)])
>>
   col1  col2 col3
0     1     2    A
2     3    10    C

方法 2：获取列

您还可以按如下方式筛选整个列：

df["col1"] ## or df.col1
>>
0    1
1    5
2    3
Name: col1, dtype: int64

方法3：按标签选择

在基于标签的选择中，请求的每个标签都必须位于数据帧的索引中。

整数也是有效的标签，但它们指的是标签而不是位置。

请考虑以下数据帧。

df = pd.DataFrame([[6, 5,  10], 
                   [5, 8,  6], 
                   [3, 10, 4]], 
                  columns = ["Maths", "Science", "English"],
                  index = ["John", "Mark", "Peter"])

print(df)
>>
       Maths  Science  English
John       6        5       10
Mark       5        8        6
Peter      3       10        4

我们使用基于标签的选择方法。df.loc

df.loc["John"]
>>
Maths       6
Science     5
English    10
Name: John, dtype: int64

df.loc["Mark", ["Maths", "English"]]
>>
Maths      5
English    6
Name: Mark, dtype: int64

但是，在中，不允许使用 position 来筛选数据帧，如下所示：df.loc[]

df.loc[0]
>>
Execution Error

KeyError: 0

要实现上述目标，您应该使用使用基于位置的选择。df.iloc[]

方法4：按位置选择

df.iloc[0]
>>
Maths       6
Science     5
English    10
Name: John, dtype: int64

3.18 在数据帧中查找唯一值

若要打印列中的所有非重复值，请使用该方法。unique()

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "A"]], 
                  columns = ["col1", "col2", "col3"])

df["col3"].unique()
>>
array(['A', 'B'], dtype=object)

如果要打印唯一值的数量，请改用。nunique()

df["col3"].nunique()
>>
2

3.19 将函数应用于数据帧

如果要将函数应用于列，请使用如下所示的方法：apply()

def add_cols(row):
    return row.col1 + row.col2

df = pd.DataFrame([[1, 2], 
                   [5, 8], 
                   [3, 9]], 
                  columns = ["col1", "col2"])
                  
df["col3"] = df.apply(add_cols, axis=1)
print(df)
>>
   col1  col2  col3
0     1     2     3
1     5     8    13
2     3     9    12

您还可以将方法应用于单个列，如下所示：

def square_col(num):
    return num**2

df = pd.DataFrame([[1, 2], 
                   [5, 8], 
                   [3, 9]], 
                  columns = ["col1", "col2"])
                  
df["col3"] = df.col1.apply(square_col)
print(df)
>>
   col1  col2  col3
0     1     2     1
1     5     8    25
2     3     9     9

3.20 处理重复数据

您可以使用以下方法标记所有重复行：df.duplicate()

df = pd.DataFrame([[1, "A"], 
                   [2, "B"], 
                   [1, "A"]], 
                  columns = ["col1", "col2"])
                  
df.duplicated(keep=False)

所有重复的行都标记为 True，并保持 = False。

此外，您可以使用该方法删除重复的行，如下所示：df.drop_duplicates()

df = pd.DataFrame([[1, "A"], 
                   [2, "B"], 
                   [1, "A"]], 
                  columns = ["col1", "col2"])
                  
print(df.drop_duplicates())
》》
   col1 col2
0     1    A
1     2    B

将保留重复行的一个副本。

3.21 查找值的分布

若要查找列中每个唯一值的频率，请使用以下方法： value_counts()

df = pd.DataFrame([[1, "A"], 
                   [2, "B"], 
                   [1, "A"]], 
                  columns = ["col1", "col2"])
                  
print(df.value_counts("col2"))
》》
col2
A    2
B    1
dtype: int64

3.22 重置数据帧的索引

若要重置数据帧的索引，请使用以下方法：df.reset_index()

df = pd.DataFrame([[6, 5,  10], 
                   [5, 8,  6], 
                   [3, 10, 4]], 
                  columns = ["col1", "col2", "col3"],
                  index = [2, 3, 1])

print(df.reset_index())
》》
   index  col1  col2  col3
0      2     6     5    10
1      3     5     8     6
2      1     3    10     4

要删除旧索引，请将参数传递给上述方法：drop=True

df.reset_index(drop=True)
》》
   col1  col2  col3
0     6     5    10
1     5     8     6
2     3    10     4

3.23 查找交叉制表

若要返回两列中每个值组合的频率，请使用以下方法：pd.crosstab()

df = pd.DataFrame([["A", "X"], 
                   ["B", "Y"], 
                   ["C", "X"],
                   ["A", "X"]], 
                  columns = ["col1", "col2"])

print(pd.crosstab(df.col1, df.col2))
》》
col2  X  Y
col1      
A     2  0
B     0  1
C     1  0

3.24 旋转数据帧

数据透视表是Excel中常用的数据分析工具。与上面讨论的交叉表类似，Pandas 中的数据透视表提供了一种交叉制表数据的方法。

请考虑以下数据帧：

df = ...

print(df)
》》
    Name  Subject  Marks
0   John    Maths      6
1   Mark    Maths      5
2  Peter    Maths      3
3   John  Science      5
4   Mark  Science      8
5  Peter  Science     10
6   John  English     10
7   Mark  English      6
8  Peter  English      4

使用该方法，您可以将列条目转换为列标题：pd.pivot_table()

pd.pivot_table(df, 
               index = ["Name"],
               columns=["Subject"], 
               values='Marks',
               fill_value=0)
》》
Subject  English  Maths  Science
Name                            
John          10      6        5
Mark           6      5        8
Peter          4      3       10

四、后记

以上列举了部分pandas数据操作，这是基础的一些操作；在现实中，panda有一个庞大的操作手册，我们无法描述其全部，只能部分阐述，其它的只能边用边学。

成为Pandas专业人士应该掌握的 30 种方法