【Pandas】①Pandas Data Processing Basics

introduce

        Pandas is a very well-known open source data processing library, through which we can complete a series of operations such as fast reading, conversion, filtering, and analysis of datasets . In addition, Pandas has powerful missing data processing and data pivoting functions , which can be described as an essential tool in data preprocessing.

knowledge points

  • type of data
  • data read
  • data selection
  • data deletion
  • data padding

       Pandas is a very well-known open source data processing library, developed based on NumPy, which is designed in the Scipy ecosystem to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models that provide the functions and methods needed to efficiently manipulate large datasets.

        The unique data structure is the strength and core of Pandas. Simply put, we can convert data in any format into Pandas data types, and use a series of methods provided by Pandas to convert and operate, and finally get the results we expect.

        So, we first need to understand and be familiar with the data types supported by Pandas.

type of data

        The data types of Pandas mainly include the following types, which are: Series (one-dimensional array), DataFrame (two-dimensional array) , Panel (three-dimensional array), Panel4D (four-dimensional array), and PanelND (more dimensional array). Among them, Series and DataFrame are the most widely used, accounting for almost 90% of the frequency of use.

Series

        Series  is the most basic form of one-dimensional array in Pandas. It can store data of types such as integers, floating-point numbers, and strings . The basic structure of Series is as follows:

pandas.Series(data=None, index=None)

        Among them, data it can be a dictionary, or an ndarray object in NumPy , etc. index It is a data index . Index is a major feature of Pandas data structure. Its main function is to help us locate data more quickly.

        Below, we create an example Series based on a Python dictionary.

%matplotlib inline
import pandas as pd

s = pd.Series({'a': 10, 'b': 20, 'c': 30})
s

          As shown above, the data values ​​of the Series are 10, 20, 30, the indexes are a, b, c, and the type of the data value is identified as by default  int64. You can  type confirm  s the type by .

type(s)
# pandas.core.series.Series

        Since Pandas is developed based on NumPy. Then the NumPy data type  ndarray multidimensional array can naturally be converted to the data in Pandas. And Series can be based on one-dimensional data conversion in NumPy.

import numpy as np

s = pd.Series(np.random.randn(5))
s

         As shown above, we give a one-dimensional random array generated by NumPy, and the resulting Series index starts from 0 by default, while the numeric type is  float64.

DataFrame

        DataFrame is the most common, important and frequently used data structure in Pandas. A DataFrame is structured like a normal spreadsheet or SQL table . You can think of DataFrame as an extended type of Series, which seems to be composed of multiple Series. The intuitive difference between it and Series is that the data not only has row indexes, but also has column indexes.

 source

        The basic structure of DataFrame is as follows:

pandas.DataFrame(data=None, index=None, columns=None)

        Different from Series, it adds  columns column index . DataFrame can be constructed from multiple types of data:

  • One-dimensional array, list, dictionary or Series dictionary.
  • two-dimensional or structured  numpy.ndarray.
  • A Series or another DataFrame .

For example, we first build the DataFrame using a dictionary of Series.

df = pd.DataFrame({'one': pd.Series([1, 2, 3]),
                   'two': pd.Series([4, 5, 6])})
df
one two
0 1 4
1 2 5
2 3 6

        When no index is specified, the index of the DataFrame also starts from 0. We can also generate a DataFrame directly from a dictionary of lists.

df = pd.DataFrame({'one': [1, 2, 3],
                   'two': [4, 5, 6]})
df
one two
0 1 4
1 2 5
2 3 6

         Or vice versa, generate a DataFrame from a list with dictionaries .

df = pd.DataFrame([{'one': 1, 'two': 4},
                   {'one': 2, 'two': 5},
                   {'one': 3, 'two': 6}])
df
one two
0 1 4
1 2 5
2 3 6

        NumPy's multidimensional arrays are very commonly used, and it is also possible to build a DataFrame based on two-dimensional values.

pd.DataFrame(np.random.randint(5, size=(2, 4)))
0 1 2 3
0 3 2 1 4
1 4 1 1 2

        At this point, you should have a clear understanding of the commonly used Series and DataFrame data types in Pandas. Series can actually be seen at first glance as a DataFrame with only 1 column of data . Of course, this statement is not rigorous. The core difference between the two is still that Series has no column index . You can observe the Series and DataFrame generated from NumPy 1D random arrays as shown below.

pd.Series(np.random.randint(5, size=(5,)))

  

pd.DataFrame(np.random.randint(5, size=(5,)))

        We will not introduce data types such as Panel in Pandas. First of all, these data types are rarely used, and second, even if you use them, you can also migrate and apply them through the skills you have learned from DataFrame, etc., and they remain the same.

data read

        We want to use Pandas to analyze data, so we need to read the data first. In most cases, data comes from external data files or databases . Pandas provides a series of methods to read external data, very comprehensive. Below, we take the most commonly used CSV data file as an example for introduction.

        To read data CSV files  pandas.read_csv(), you can directly pass in a relative path , or a network URL .

df = pd.read_csv("https://labfile.oss.aliyuncs.com/courses/906/los_census.csv")
df

         Since CSV is stored as a two-dimensional table, Pandas will automatically read it as a DataFrame type.

        Now you should understand that DataFrame is the core of Pandas. All data, whether it is read externally or generated by itself, we need to convert it to Pandas DataFrame or Series data type first. In fact, most of the time this is all by design and no extra conversion work is required.

  pd.read_The method starting with  the prefix can also read all kinds of data files, and supports connecting to the database . Here, we will not go into details one by one. You can read   the corresponding chapters of the official documentation  to get familiar with these methods and find out the parameters contained in these methods.

You may have another question: why convert data to Series or DataFrame structure?

        In fact, I can answer this question right now. Because all methods of Pandas for data manipulation are designed based on the data structure supported by Pandas. That is, only Series or DataFrame can be processed using the methods and functions provided by Pandas. Therefore, before learning the real data processing method, we need to convert the data into a Series or DataFrame type.

basic operation

        Through the above content, we already know that a DataFrame structure roughly consists of 3 parts, which are column names, indexes and data.

 source

        Next, we learn the basic operations on DataFrame. In this course, we will not deliberately emphasize Series, because most of the methods and techniques you learn on DataFrame are applicable to Series processing, and both have the same root.

Above, we have read an external data, which is the census data of Los Angeles. Sometimes, the files we read are very large. If all these files are output and previewed, it is not beautiful and time-consuming. Fortunately, Pandas provides  head() and  tail() methods, which can help us preview only a small piece of data.

df.head()  # 默认显示前 5 条
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
0 91371 1 73.5 0 1 1 1.00
1 90001 57110 26.6 28468 28642 12971 4.40
2 90002 51223 25.5 24876 26347 11731 4.36
3 90003 66266 26.3 32631 33635 15642 4.22
4 90004 62180 34.8 31302 30878 22547 2.73
df.tail(7)  # 指定显示后 7 条
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
312 93550 74929 27.5 36414 38515 20864 3.58
313 93551 50798 37.0 25056 25742 15963 3.18
314 93552 38158 28.4 18711 19447 9690 3.93
315 93553 2138 43.3 1121 1017 816 2.62
316 93560 18910 32.4 9491 9419 6469 2.92
317 93563 388 44.5 263 125 103 2.53
318 93591 7285 30.9 3653 3632 1982 3.67

          Pandas also provides statistical and descriptive methods that allow you to understand your dataset from a macro perspective. It is equivalent to an overview of the data set, and will output the count, maximum value, minimum value, etc. of each column ofdescribe()  the data set .

df.describe()
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
count 319.000000 319.000000 319.000000 319.000000 319.000000 319.000000 319.000000
mean 91000.673981 33241.341693 36.527586 16391.564263 16849.777429 10964.570533 2.828119
std 908.360203 21644.417455 8.692999 10747.495566 10934.986468 6270.646400 0.835658
min 90001.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 90243.500000 19318.500000 32.400000 9763.500000 9633.500000 6765.500000 2.435000
50% 90807.000000 31481.000000 37.100000 15283.000000 16202.000000 10968.000000 2.830000
75% 91417.000000 44978.000000 41.000000 22219.500000 22690.500000 14889.500000 3.320000
max 93591.000000 105549.000000 74.000000 52794.000000 53185.000000 31087.000000 4.670000

         Pandas 基于 NumPy 开发,所以任何时候你都可以通过 .values 将 DataFrame 转换为 NumPy 数组。

df.values

         这也就说明了,你可以同时使用 Pandas 和 NumPy 提供的 API 对同一数据进行操作,并在二者之间进行随意转换。这就是一个非常灵活的工具生态圈。

       除了 .values,DataFrame 支持的常见属性可以通过  官方文档相应章节 查看。其中常用的有:

df.index  # 查看索引
# RangeIndex(start=0, stop=319, step=1)
df.columns  # 查看列名

df.shape  # 查看形状
# (319, 7)

数据选择

        在数据预处理过程中,我们往往会对数据集进行切分,只将需要的某些行、列,或者数据块保留下来,输出到下一个流程中去。这也就是所谓的数据选择,或者数据索引。由于 Pandas 的数据结构中存在索引、标签,所以我们可以通过多轴索引完成对数据的选择。

基于索引选择

        当我们新建一个 DataFrame 之后,如果未自己指定行索引或者列对应的标签,那么 Pandas 会默认从 0 开始以数字的形式作为行索引,并以数据集的第一行作为列对应的标签。其实,这里的「列」也有数字索引,默认也是从 0 开始,只是未显示出来。

        所以,我们首先可以基于数字索引对数据集进行选择。这里用到的 Pandas 中的 df.iloc[:3] 方法。该方法可以接受的类型有:

  1. 整数。例如:5
  2. 整数构成的列表或数组。例如:[1, 2, 3]
  3. 布尔数组。
  4. 可返回索引值的函数或参数。

        下面,我们使用上方的示例数据进行演示。

        首先,我们可以选择前 3 行数据。这和 Python 或者 NumPy 里面的切片很相似。

df.iloc[:3]
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
0 91371 1 73.5 0 1 1 1.00
1 90001 57110 26.6 28468 28642 12971 4.40
2 90002 51223 25.5 24876 26347 11731 4.36

        我们还可以选择特定的一行

df.iloc[5]

         那么选择多行,是不是 df.iloc[1, 3, 5] 这样呢?答案是错误的。df.iloc[] 的 [[行],[列]] 里面可以同时接受行和列的位置,如果你直接键入 df.iloc[1, 3, 5] 就会报错。

        所以,很简单。如果你想要选择 2,4,6 行,可以这样做。

df.iloc[[1, 3, 5]]
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
1 90001 57110 26.6 28468 28642 12971 4.40
3 90003 66266 26.3 32631 33635 15642 4.22
5 90005 37681 33.9 19299 18382 15044 2.50

选择行学会以后,选择列就应该能想到怎么办了。例如,我们要选择第 2-4 列。

df.iloc[:, 1:4]

         这里选择 2-4 列,输入的却是 1:4。这和 Python 或者 NumPy 里面的切片操作非常相似。既然我们能定位行和列,那么只需要组合起来,我们就可以选择数据集中的任何数据了。

基于标签名称选择(重要)

        除了根据数字索引选择,还可以直接根据标签对应的名称选择。这里用到的方法和上面的 iloc 很相似,少了个 i 为 df.loc[ ]

df.loc[] 可以接受的类型有:

  1. 单个标签。例如:2 或 'a',这里的 2 指的是标签而不是索引位置。
  2. 列表或数组包含的标签。例如:['A', 'B', 'C']
  3. 切片对象。例如:'A':'E',注意这里和上面切片的不同之处,首尾都包含在内
  4. 布尔数组。
  5. 可返回标签的函数或参数。

下面,我们来演示 df.loc[] 的用法。先选择前 3 行:

df.loc[0:2]
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
0 91371 1 73.5 0 1 1 1.00
1 90001 57110 26.6 28468 28642 12971 4.40
2 90002 51223 25.5 24876 26347 11731 4.36

再选择 1,3,5 行:

df.loc[[0, 2, 4]]
Zip Code Total Population Median Age Total Males Total Females Total Households Average Household Size
0 91371 1 73.5 0 1 1 1.00
2 90002 51223 25.5 24876 26347 11731 4.36
4 90004 62180 34.8 31302 30878 22547 2.73

 然后,选择 2-4 列:

df.loc[:, 'Total Population':'Total Males']

 最后,选择 1,3 行和 Median Age 后面的列:

df.loc[[0, 2], 'Median Age':]
Median Age Total Males Total Females Total Households Average Household Size
0 73.5 0 1 1 1.00
2 25.5 24876 26347 11731 4.36

 数据删减

        虽然我们可以通过数据选择方法从一个完整的数据集中拿到我们需要的数据,但有的时候直接删除不需要的数据更加简单直接。Pandas 中,以 .drop 开头的方法都与数据删减有关。

        DataFrame.drop 可以直接去掉数据集中指定的列和行。一般在使用时,我们指定 labels 标签参数,然后再通过 axis 指定按列(axis=1)或按行(axis=0)删除即可。当然,你也可以通过索引参数删除数据,具体查看官方文档。

df.drop(labels=['Median Age', 'Total Males'], axis=1)

       DataFrame.drop_duplicates 则通常用于数据去重,即剔除数据集中的重复值。使用方法非常简单,默认情况下,它会根据所有列删除重复的行。也可以使用 subset 指定要删除的特定列上的重复项,要删除重复项并保留最后一次出现,请使用 keep='last'。

df.drop_duplicates()

注:这里输出还是319行,说明该数据集没有重复项。 

         除此之外,另一个用于数据删减的方法 DataFrame.dropna 也十分常用,其主要的用途是删除缺少值,即数据集中空缺的数据列或行。

df.dropna()

          对于提到的这 3 个常用的数据删减方法,大家一定要通过给出的链接去阅读官方文档。这些常用方法没有太多需要注意的地方,通过文档了解其用法即可,所以我们也不会化简为繁地进行介绍。

数据填充

        既然提到了数据删减,反之则可能会遇到数据填充的情况。而对于一个给定的数据集而言,我们一般不会乱填数据,而更多的是对缺失值进行填充。

        在真实的生产环境中,我们需要处理的数据文件往往没有想象中的那么美好。其中,很大几率会遇到的情况就是缺失值。缺失值主要是指数据丢失的现象,也就是数据集中的某一块数据不存在。除此之外、存在但明显不正确的数据也被归为缺失值一类。例如,在一个时间序列数据集中,某一段数据突然发生了时间流错乱,那么这一小块数据就是毫无意义的,可以被归为缺失值。

检测缺失值

        Pandas 为了更方便地检测缺失值,将不同类型数据的缺失均采用 NaN 标记。这里的 NaN 代表 Not a Number,它仅仅是作为一个标记。例外是,在时间序列里,时间戳的丢失采用 NaT 标记。

        Pandas 中用于检测缺失值主要用到两个方法,分别是:isna() 和 notna(),故名思意就是「是缺失值」和「不是缺失值」。默认会返回布尔值用于判断。

        接下来,我们人为生成一组包含缺失值的示例数据

df = pd.DataFrame(np.random.rand(9, 5), columns=list('ABCDE'))
# 插入 T 列,并打上时间戳
df.insert(value=pd.Timestamp('2017-10-1'), loc=0, column='Time')
# 将 1, 3, 5 列的 2,4,6,8 行置为缺失值
df.iloc[[1, 3, 5, 7], [0, 2, 4]] = np.nan
# 将 2, 4, 6 列的 3,5,7,9 行置为缺失值
df.iloc[[2, 4, 6, 8], [1, 3, 5]] = np.nan
df
Time A B C D E
0 2017-10-01 0.740266 0.673770 0.688963 0.484102 0.262929
1 NaT 0.357889 NaN 0.857515 NaN 0.533836
2 2017-10-01 NaN 0.714951 NaN 0.296258 NaN
3 NaT 0.873991 NaN 0.732998 NaN 0.710549
4 2017-10-01 NaN 0.967554 NaN 0.470329 NaN
5 NaT 0.551162 NaN 0.787744 NaN 0.090273
6 2017-10-01 NaN 0.205467 NaN 0.962078 NaN
7 NaT 0.312581 NaN 0.053198 NaN 0.433269
8 2017-10-01 NaN 0.071785 NaN 0.111216 NaN

 然后,通过 isna() 或 notna() 中的一个即可确定数据集中的缺失值。

df.isna()
Time A B C D E
0 False False False False False False
1 True False True False True False
2 False True False True False True
3 True False True False True False
4 False True False True False True
5 True False True False True False
6 False True False True False True
7 True False True False True False
8 False True False True False True

         上面已经对缺失值的产生、检测进行了介绍。实际上,面对缺失值一般就是填充和剔除两项操作。填充和清除都是两个极端。如果你感觉有必要保留缺失值所在的列或行,那么就需要对缺失值进行填充。如果没有必要保留,就可以选择清除缺失值。

        其中,缺失值剔除的方法 dropna() 已经在上面介绍过了。下面来看一看填充缺失值 fillna() 方法。        

首先,我们可以用相同的标量值替换 NaN,比如用 0

df.fillna(0)
Time A B C D E
0 2017-10-01 00:00:00 0.740266 0.673770 0.688963 0.484102 0.262929
1 0 0.357889 0.000000 0.857515 0.000000 0.533836
2 2017-10-01 00:00:00 0.000000 0.714951 0.000000 0.296258 0.000000
3 0 0.873991 0.000000 0.732998 0.000000 0.710549
4 2017-10-01 00:00:00 0.000000 0.967554 0.000000 0.470329 0.000000
5 0 0.551162 0.000000 0.787744 0.000000 0.090273
6 2017-10-01 00:00:00 0.000000 0.205467 0.000000 0.962078 0.000000
7 0 0.312581 0.000000 0.053198 0.000000 0.433269
8 2017-10-01 00:00:00 0.000000 0.071785 0.000000 0.111216 0.000000

         除了直接填充值,我们还可以通过参数,将缺失值前面或者后面的值填充给相应的缺失值。①使用缺失值前面的值进行填充:

df.fillna(method='pad')
Time A B C D E
0 2017-10-01 0.740266 0.673770 0.688963 0.484102 0.262929
1 2017-10-01 0.357889 0.673770 0.857515 0.484102 0.533836
2 2017-10-01 0.357889 0.714951 0.857515 0.296258 0.533836
3 2017-10-01 0.873991 0.714951 0.732998 0.296258 0.710549
4 2017-10-01 0.873991 0.967554 0.732998 0.470329 0.710549
5 2017-10-01 0.551162 0.967554 0.787744 0.470329 0.090273
6 2017-10-01 0.551162 0.205467 0.787744 0.962078 0.090273
7 2017-10-01 0.312581 0.205467 0.053198 0.962078 0.433269
8 2017-10-01 0.312581 0.071785 0.053198 0.111216 0.433269

 ②使用缺失值后面的值

df.fillna(method='bfill')
Time A B C D E
0 2017-10-01 0.740266 0.673770 0.688963 0.484102 0.262929
1 2017-10-01 0.357889 0.714951 0.857515 0.296258 0.533836
2 2017-10-01 0.873991 0.714951 0.732998 0.296258 0.710549
3 2017-10-01 0.873991 0.967554 0.732998 0.470329 0.710549
4 2017-10-01 0.551162 0.967554 0.787744 0.470329 0.090273
5 2017-10-01 0.551162 0.205467 0.787744 0.962078 0.090273
6 2017-10-01 0.312581 0.205467 0.053198 0.962078 0.433269
7 2017-10-01 0.312581 0.071785 0.053198 0.111216 0.433269
8 2017-10-01 NaN 0.071785 NaN 0.111216 NaN

         最后一行由于没有对于的后序值,自然继续存在缺失值。

        上面的例子中,我们的缺失值是间隔存在的。那么,如果存在连续的缺失值是怎样的情况呢?试一试。首先,我们将数据集的第 2,4 ,6 列的第 3,5 行也置为缺失值。

df.iloc[[3, 5], [1, 3, 5]] = np.nan

然后来正向填充:

df.fillna(method='pad')
Time A B C D E
0 2017-10-01 0.740266 0.673770 0.688963 0.484102 0.262929
1 2017-10-01 0.357889 0.673770 0.857515 0.484102 0.533836
2 2017-10-01 0.357889 0.714951 0.857515 0.296258 0.533836
3 2017-10-01 0.357889 0.714951 0.857515 0.296258 0.533836
4 2017-10-01 0.357889 0.967554 0.857515 0.470329 0.533836
5 2017-10-01 0.357889 0.967554 0.857515 0.470329 0.533836
6 2017-10-01 0.357889 0.205467 0.857515 0.962078 0.533836
7 2017-10-01 0.312581 0.205467 0.053198 0.962078 0.433269
8 2017-10-01 0.312581 0.071785 0.053198 0.111216 0.433269

         可以看到,连续缺失值也是按照前序数值进行填充的,并且完全填充。这里,我们可以通过 limit= 参数设置连续填充的限制数量

df.fillna(method='pad', limit=1)  # 最多填充一项
Time A B C D E
0 2017-10-01 0.740266 0.673770 0.688963 0.484102 0.262929
1 2017-10-01 0.357889 0.673770 0.857515 0.484102 0.533836
2 2017-10-01 0.357889 0.714951 0.857515 0.296258 0.533836
3 2017-10-01 NaN 0.714951 NaN 0.296258 NaN
4 2017-10-01 NaN 0.967554 NaN 0.470329 NaN
5 2017-10-01 NaN 0.967554 NaN 0.470329 NaN
6 2017-10-01 NaN 0.205467 NaN 0.962078 NaN
7 2017-10-01 0.312581 0.205467 0.053198 0.962078 0.433269
8 2017-10-01 0.312581 0.071785 0.053198 0.111216 0.433269

         除了上面的填充方式,还可以通过 Pandas 自带的求平均值方法等来填充特定列或行。举个例子:

df.fillna(df.mean()['C':'E'])
Time A B C D E
0 2017-10-01 0.740266 0.673770 0.688963 0.484102 0.262929
1 NaT 0.357889 NaN 0.857515 0.464797 0.533836
2 2017-10-01 NaN 0.714951 0.533225 0.296258 0.410011
3 NaT NaN NaN 0.533225 0.464797 0.410011
4 2017-10-01 NaN 0.967554 0.533225 0.470329 0.410011
5 NaT NaN NaN 0.533225 0.464797 0.410011
6 2017-10-01 NaN 0.205467 0.533225 0.962078 0.410011
7 NaT 0.312581 NaN 0.053198 0.464797 0.433269
8 2017-10-01 NaN 0.071785 0.533225 0.111216 0.410011

         对 C 列到 E 列用平均值填充。

插值填充

        插值是数值分析中一种方法。简而言之,就是借助于一个函数(线性或非线性),再根据已知数据去求解未知数据的值。插值在数据领域非常常见,它的好处在于,可以尽量去还原数据本身的样子。

我们可以通过 interpolate() 方法完成线性插值。当然,其他一些插值算法可以阅读官方文档了解。

# 生成一个 DataFrame
df = pd.DataFrame({'A': [1.1, 2.2, np.nan, 4.5, 5.7, 6.9],
                   'B': [.21, np.nan, np.nan, 3.1, 11.7, 13.2]})
df

        对于上面存在的缺失值,如果通过前后值,或者平均值来填充是不太能反映出趋势的。这时候,插值最好使。我们用默认的线性插值试一试。

df_interpolate = df.interpolate()
df_interpolate
A B
0 1.10 0.210000
1 2.20 1.173333
2 3.35 2.136667
3 4.50 3.100000
4 5.70 11.700000
5 6.90 13.200000

         下图展示了插值后的数据,明显看出插值结果符合数据的变化趋势。如果按照前后数据顺序填充,则无法做到这一点。

对于 interpolate() 支持的插值算法,也就是 method=。下面给出几条选择的建议:

  1. 如果你的数据增长速率越来越快,可以选择 method='quadratic'二次插值
  2. 如果数据集呈现出累计分布的样子,推荐选择 method='pchip'
  3. 如果需要填补缺失值,以平滑绘图为目标,推荐选择 method='akima'

当然,最后提到的 method='akima',需要你的环境中安装了 Scipy 库。除此之外,method='barycentric' 和 method='pchip' 同样也需要 Scipy 才能使用。

数据可视化

        NumPy,Pandas,Matplotlib 构成了一个完善的数据分析生态圈,所以 3 个工具的兼容性也非常好,甚至共享了大量的接口。当我们的数据是以 DataFrame 格式呈现时,可以直接使用 Pandas 提供的 DataFrame.plot 方法调用 Matplotlib 接口绘制常见的图形。

例如,我们使用上面插值后的数据 df_interpolate 绘制线形图。

df_interpolate.plot()

         其他样式的图形也很简单,指定 kind= 参数即可。

df_interpolate.plot(kind='bar')

         For more graphic styles and parameters, read the detailed description in the official documentation. Although Pandas plotting cannot achieve the flexibility of Matplotlib, it is easy to use and suitable for fast presentation and preview of data.

other usage

        Because Pandas contains too much content, it is difficult to get a comprehensive understanding through an experiment or a course except for reading the complete official documentation. Of course, the purpose of this course is to familiarize you with the common basic methods of Pandas. At least you have a general idea of ​​what Pandas is and what it can do.

In addition to some of the methods and techniques mentioned above, in fact Pandas commonly used are:

        For more knowledge about pandas, you can pay attention to my follow-up articles, thank you for your strong support! !

Experiment summary

        In this experiment, we focus on the data structure of Pandas. You need to have a deep understanding of Series and DataFrame in order to have a deeper understanding of using Pandas for data preprocessing later. In addition, we have learned about the methods and techniques of Pandas data reading, data selection, data deletion, and data filling. I hope that you can further understand it in combination with official documents.

Guess you like

Origin blog.csdn.net/m0_69478345/article/details/130062544