One of the Three Musketeers of Data Analysis: Detailed Explanation of Pandas

Table of contents

1 Introduction to Pandas

2 Installation and import of Pandas

2.1 Pandas module installation

2.2 Pandas module import

3 pandas data structures and functions

3.1 Series structure

3.1.1 ndarray creates Series object

3.1.2 dict creates Series object

3.1.3 Scalar creation of Series objects

3.1.4 Position index access Series data

3.1.5 Tag index access Series data

3.1.6 Series common attributes axes 

3.1.7 Series common attribute index

3.1.8 Series common attributes values

3.2 DataFrame structure

3.2.1 Create DataFame object from list

3.2.2 Create DataFrame object from nested list of dictionaries

3.2.3 List nested dictionary to create DataFrame object

3.2.4 Series creates DataFrame object

3.2.5 Select DataFrame data by column index

3.2.6 Add DataFrame data by column index

3.2.7 Delete DataFrame data by column index

3.2.8 Row label index selects DataFrame data

3.2.9 Select DataFrame data using integer index

3.2.10 Slicing operation to select DataFrame data from multiple rows

3.2.11 Add DataFrame data rows

3.2.12 Delete DataFrame data rows

3.2.13 DataFrame attribute methods info(), index, coloumns, values, axes

3.2.14 head()&tail() to view DataFrame data

3.2.15 dtypes View DataFrame data type

3.2.16 empty determines whether the DataFrame is empty

3.2.17 ndim&shape to view DataFrame dimensions and dimensions

3.2.18 size Check the number of elements of DataFrame

3.2.19 T (Transpose) transposes DataFrame

3.3 pandas descriptive statistics

3.3.1 Sum of all values ​​in vertical and horizontal directions

3.3.2 mean() finds the average value

3.3.3 std() finds the standard deviation

3.3.4 Custom function: operate the entire data table pipe()

3.3.5 Custom function: operate row or column apply()

3.4 pandas iteration traversal

3.4.1 Traverse rows iterrows() in the form of (row_index, row):

3.4.2 Traverse rows using named tuples itertuples()

3.5 pandas sorting sorting

 3.5.1 axis=0, ascending=True defaults to ascending order by "row label"

3.5.2 axis=1 Sort by "column label" in ascending order

3.6 pandas deduplication function

3.6.1 Keep first occurrence of row duplicates

3.6.2 keep=False removes all row duplicates

3.6.3 subset deletes the specified single column to remove duplicates

3.6.4 subset specifies multiple columns to remove duplicates at the same time

3.7 Pandas missing value processing

3.7.1 Checking for missing values

3.7.2 Missing data calculation

3.7.3 Clean and fill missing values

3.7.4 Use replace to replace common values

3.7.5 Delete missing values

3.8 pandas csv operation

3.8.1 read_csv() reads files

3.8.2 names change file header name

3.8.3 skiprows skips the specified number of rows

3.8.4 to_csv() converts data

3.9 pandas operating Excel

3.9.1 to_excel() data conversion

3.9.2 Insert multiple sheet data at one time

3.9.3 Additional sheet table contents

3.9.4 read_excel() reads data

3.10 File formats supported by pandas


1 Introduction to Pandas

Pandas is an open source third-party Python library built on the basis of Numpy and Matplotlib. It enjoys the reputation of being one of the "Three Musketeers" of data analysis (NumPy, Matplotlib, Pandas). Pandas has become an essential advanced tool for Python data analysis. Its goal is to become a powerful, flexible data analysis tool that can support any programming language. Pandas has the following characteristics:

  • DataFrame is an efficient and fast data structure mode. Pandas supports the DataFrame format so that you can customize the index.
  • Data files in different formats can be loaded into memory
  • Data that is not aligned and indexed differently can be automatically aligned by axis
  • Can handle time series or non-time series data
  • Indexes can be sliced ​​based on tags to obtain subsets of large data sets
  • High-performance data grouping, aggregation, addition, and deletion are possible
  • Flexibly handle missing data, reorganization, and spaces

Pandas is widely used in business fields such as finance, economics, data analysis, statistics, etc., providing convenience for data practitioners in various fields.

Official website address: https://pandas.pydata.org/

2 Installation and import of Pandas

2.1 Pandas module installation

Python comes with the package management tool pip to install:

pip install pandas

2.2 Pandas module import

Import of Pandas

import pandas as pd
import numpy as np   # pandas和numpy常常结合在一起使用

3 pandas data structures and functions

Constructing and processing two-dimensional and multi-dimensional arrays is a tedious task. To solve this problem, Pandas builds two different data structures based on ndarray arrays (arrays in NumPy), namely Series (one-dimensional data structure) and DataFrame (two-dimensional data structure):

  • Series is a one-dimensional array with labels. The label here can be understood as an index, but this index is not limited to integers. It can also be a character type, such as a, b, c, etc.;
  • DataFrame is a tabular data structure that has both row and column labels.
data structure Dimensions illustrate
Series 1 This structure can store various data types, such as character numbers, integers, floating point numbers, Python objects, etc. Series uses name and index attributes to describe
data values. Series is a one-dimensional data structure, so its dimensionality cannot be changed.
DataFrame 2 DataFrame is a two-dimensional tabular data structure with both row and column indexes. The row index is index and the column index is columns.
When creating the structure, you can specify the corresponding index value.

3.1 Series structure

The Series structure, also known as the Series sequence, is one of the commonly used data structures in Pandas. It is a structure similar to a one-dimensional array, consisting of a set of data values ​​(value) and a set of labels, where the distance between the labels and the data values It is a one-to-one correspondence.
Series can save any data type, such as integers, strings, floating point numbers, Python objects, etc. Its labels default to integers, starting from 0 and increasing in sequence. The structure diagram of Series is as follows:

Through labels, we can more intuitively view the index location of the data.

Function prototype:

pandas.Series( data, index, dtype, copy)

Parameter Description:

#data The input data can be a list, constant, ndarray array, etc. 
#index The index value must be unique, if no index is passed it defaults to #np.arrange(n). 
#dtype dtype represents the data type. If it is not provided, it will be automatically determined. 
#copy means copying data, the default is False.

Series objects can be created from arrays, dictionaries, scalar values, or Python objects

There are two ways to access data in Series, one is position index access; the other is label index access.

Commonly used properties and methods of Series. The following table lists the common properties of Series objects.

name Attributes
axes Returns all row index labels as a list
dtype Returns the data type of the object
empty Determine whether the Series object is empty
it's me Returns the dimensionality of the input data
size Returns the number of elements of the input data
values Return Series object as ndarray
index Returns a RangeIndex object used to describe the value range of the index.

3.1.1 ndarray creates Series object

import pandas as pd
import numpy as np
data = np.array(['a', 'b', 'c', 'd'])

# 使用默认索引,创建 Series 序列对象
s1 = pd.Series(data)
print(f'默认索引:\n{s1}')

# 使用“显式索引”的方法自定义索引标签
s2 = pd.Series(data, index=[100, 101, 102, 103])
print(f'自定义索引\n{s2}')

The running results are shown as follows:

默认索引:
0    a
1    b
2    c
3    d
dtype: object

自定义索引
100    a
101    b
102    c
103    d
dtype: object

In the example, s1 does not pass any index, so the index is allocated starting from 0 by default, and its index range is from 0 tolen(data)-1。

3.1.2 dict creates Series object

Use dict as input data. If no index is passed in, the index will be constructed according to the keys of the dictionary; conversely, when the index is passed, the index labels need to be matched one-to-one with the values ​​in the dictionary.

import pandas as pd
import numpy as np

data = {'a': 0, 'b': 1, 'c': 2}

# 没有传递索引时 会按照字典的键来构造索引
s1_dict = pd.Series(data)
print(f'没有传递索引\n{s1_dict}')

# 字典类型传递索引时 索引时需要将索引标签与字典中的值一一对应
# 当传递的索引值无法找到与其对应的值时,使用 NaN(非数字)填充
s2_dict = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(f'传递索引\n{s2_dict}')

The running results are shown as follows:

没有传递索引
a    0
b    1
c    2
dtype: int64

传递索引
a    0.0
b    1.0
c    2.0
d    NaN
dtype: float64

3.1.3 Scalar creation of Series objects

import pandas as pd
import numpy as np

# 如果data是标量值,则必须提供索引: 标量值按照 index 的数量进行重复,并与其一一对应
s3 = pd.Series(6, index=[0,1,2,3])
print(f'标量值,则必须提供索引\n{s3}')

The running results are shown as follows:

标量值,则必须提供索引
0    6
1    6
2    6
3    6
dtype: int64

3.1.4 Position index access Series data

import pandas as pd

s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
print(f'Series数据\n{s}')

# 位置索引 第一个位置索引:0
print(f'位置索引={s[0]}')

# 标签索引 第一个标签索引:a
print(f'标签索引={s["a"]}')#

# 通过切片的方式访问 Series 序列中的数据
print(f'前两个元素\n{s[:2]}')

print(f'最后三个元素\n{s[-3:]}')

The running results are shown as follows:

Series数据
a    1
b    2
c    3
d    4
e    5
dtype: int64
位置索引=1
标签索引=1
前两个元素
a    1
b    2
dtype: int64
最后三个元素
c    3
d    4
e    5
dtype: int64

3.1.5 Tag index access Series data

Series is similar to a fixed-size dict. The index label in the index is used as the key, and the element value in the Series sequence is used as the value. Then the element value is accessed or modified through the index index label.

import pandas as pd

s = pd.Series([1, 2, 3, 4, 5],index=['a', 'b', 'c', 'd', 'e'])
print(f'Series数据\n{s}')

# 标签索引访问单个元素
print(f'标签索引访问单个元素={s["a"]}') 

# 标签索引访问多个元素
print(f'标签索引访问多个元素\n{s[["a","b","c"]]}')

The running results are shown as follows:

Series数据
a    1
b    2
c    3
d    4
e    5
dtype: int64
标签索引访问单个元素=1
标签索引访问多个元素
a    1
b    2
c    3
dtype: int64

Accessing a tag that is not included will throw an exception

3.1.6 Series common attributes axes 

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5))
print(f'默认索引\n{s}')

s1 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(f'自定义索引\n{s1}')

# axes以列表的形式返回所有行索引标签
print(f'默认索引:{s.axes}')
print(f'自定义索引:{s1.axes}')

The running results are shown as follows:

默认索引
0    0.327024
1    0.679870
2    0.714354
3   -0.215886
4   -1.857184
dtype: float64
自定义索引
a   -0.375701
b   -1.400197
c   -0.187348
d   -0.853269
e    0.129702
dtype: float64
默认索引:[RangeIndex(start=0, stop=5, step=1)]
自定义索引:[Index(['a', 'b', 'c', 'd', 'e'], dtype='object')]

3.1.7 Series common attribute index

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5))
print(f'默认索引\n{s}')

s1 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(f'自定义索引\n{s1}')

# index返回一个RangeIndex对象,用来描述索引的取值范围
print(f'默认索引:{s.index}')

#
print(f'自定义索引:{s1.index}')

# 通过.index.values 获取索引列表
print(s.index.values)
print(s1.index.values)

The running results are shown as follows:

默认索引
0    0.200998
1    0.469934
2    0.096422
3   -0.399627
4    0.783720
dtype: float64
自定义索引
a   -1.639293
b   -0.128694
c   -0.940741
d   -1.547780
e    0.670969
dtype: float64
默认索引:RangeIndex(start=0, stop=5, step=1)
自定义索引:Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
[0 1 2 3 4]
['a' 'b' 'c' 'd' 'e']

3.1.8 Series common attributes values

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5))
print(f'默认索引\n{s}')

# values以数组的形式返回Series对象中的数据。
print(s.values)

 The running results are shown as follows:

默认索引
0   -0.772736
1   -0.473425
2   -0.588307
3    0.723052
4    0.601033
dtype: float64
[-0.77273598 -0.47342456 -0.5883065   0.72305156  0.60103283]

3.2 DataFrame structure

DataFrame is a tabular data structure with both row labels (index) and column labels (columns). It is also called a heterogeneous data table. The so-called heterogeneous means that the data type of each column in the table can be different. For example, it can be a string, integer or floating point type, etc. Its structural diagram is as follows:

The function prototype is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter Description:

data input data can be ndarray, series, list, dict, scalar and a DataFrame. 
index row label, if no index value is passed, the default row label is np.arange(n), n represents the number of elements of data. 
columns column label, if no columns value is passed, the default column label is np.arange(n). 
dtype dtype represents the data type of each column. 
copy defaults to False, which means copying data data.

The properties and methods of DataFrame are as follows:

name Property & method description
index Return row index
coloumns Return column index
values Use numpy array to represent element values ​​in Dataframe
head() Return the first n rows of data.
tail() Return the last n rows of data.
axes Returns a list with only row and column axis labels as members.
dtypes Returns the data type of each column of data.
empty If there is no data in the DataFrame or the length of any coordinate axis is 0, True will be returned.
 it's me  The number of axes, also refers to the dimension of the array.
 shape  The number of elements in the DataFrame.
 shift()  Move rows or columns by specified stride length
 T  Row and column transpose.
 info()  Returns relevant information: number of rows and columns, number of non-null values ​​in the column index column, column type

 

3.2.1 Create DataFame object from list

import pandas as pd

# 单一列表创建 DataFrame
data = [1, 2, 3]
df1 = pd.DataFrame(data)
print(f'单一列表\n{df1}')

# 使用嵌套列表创建 DataFrame 对象
data = [['java', '10'], ['python', '20'], ['C++', '30']]
df2 = pd.DataFrame(data)
print(f'嵌套列表创建\n{df2}')

# 指定数值元素的数据类型为float: 并指定columns
df3 = pd.DataFrame(data, columns=['name', 'age'], dtype=(str, float))
print(f'指定数据类型和colums\n{df3}')

The running results are shown as follows:

单一列表
   0
0  1
1  2
2  3
嵌套列表创建
        0   1
0    java  10
1  python  20
2     C++  30
指定数据类型和colums
     name age
0    java  10
1  python  20
2     C++  30

3.2.2 Create DataFrame object from nested list of dictionaries

In the data dictionary, the lengths of the elements corresponding to the values ​​of the keys must be equal (that is, the lengths of the lists are equal). If the index is passed, the length of the index must be equal to the length of the list; if no index is passed, the index should be range (n) by default. The length of the list represented by n

import pandas as pd

data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df1 = pd.DataFrame(data)
print(f'默认索引\n{df1}')

df2 = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
print(f'自定义索引\n{df2}')

The running results are shown as follows:

默认索引
    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42
自定义索引
    Name  Age
a    Tom   28
b   Jack   34
c  Steve   29
d  Ricky   42

3.2.3 List nested dictionary to create DataFrame object

When a list nested dictionary is passed in as a value, by default the dictionary keys are used as names (coloumns)

Note: If the value of an element is missing, that is, the dictionary key cannot find the corresponding Value, NaN will be used instead.

import pandas as pd

# 字典的键被用作列名 如果其中某个元素值缺失,也就是字典的 key 无法找到对应的 value,将使用 NaN 代替。
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data)
print(df1)

# 自定义行标签索引
df2 = pd.DataFrame(data, index=['first', 'second'])
print(df2)

# 如果列名 在字典键中不存在,所以对应值为 NaN。
df3 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
df4 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(f'df3的列名在字典键中存在\n{df3}')
print(f'df4的列名b1在字典键不中存在\n{df4}')

The running results are shown as follows:

   a   b     c
0  1   2   NaN
1  5  10  20.0

        a   b     c
first   1   2   NaN
second  5  10  20.0

df3的列名在字典键中存在
        a   b
first   1   2
second  5  10

df4的列名b1在字典键不中存在
        a  b1
first   1 NaN
second  5 NaN

3.2.4 Series creates DataFrame object

Pass a Series in the form of a dictionary to create a DataFrame object, the row index of the output result is the collection of all indexes

import pandas as pd

# Series创建DataFrame对象 其输出结果的行索引是所有index的合集
data = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
        'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)

The running results are shown as follows:

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

3.2.5 Select DataFrame data by column index

DataFrame can use column index to complete data selection, addition and deletion operations

import pandas as pd

data = [['java', 10, 9], ['python', 20, 100], ['C++', 30, 50]]
df1 = pd.DataFrame(data, columns=['name', 'age', 'number'])
print(f'数据df1\n{df1}')

# 获取数据方式一:使用列索引,实现数据获取某一行数据 df[列名]等于df.列名
print(f'通过df1.name方式获取\n{df1.name}')
print(f'通过df1["name"]方式获取\n{df1["name"]}')

# 获取数据方式二:使用列索引,实现数据获取某多行数据 df[list]
print(f'通过df[list]方式获取多列数据\n{df1[["name","number"]]}')

# 获取数据方式三:使用布尔值筛选获取某行数据
# 不同的条件用()包裹起来,并或非分别使用&,|,~而非and,or,not
print(f'获取name=python的数据\n{df1[df1["name"]=="python"]}')

print(f'获取age大于等于20的数据\n{df1[df1["age"]>=20]}')

print(f'获取name=python的数据或者是age等于30\n{df1[(df1["name"]=="python") | (df1["age"]==30)]}')

The running results are shown as follows:

数据df1
     name  age  number
0    java   10       9
1  python   20     100
2     C++   30      50

通过df1.name方式获取
0      java
1    python
2       C++
Name: name, dtype: object

通过df1["name"]方式获取
0      java
1    python
2       C++
Name: name, dtype: object

通过df[list]方式获取多列数据
     name  number
0    java       9
1  python     100
2     C++      50

获取name=python的数据
     name  age  number
1  python   20     100

获取age大于等于20的数据
     name  age  number
1  python   20     100
2     C++   30      50

获取name=python的数据或者是age等于30
     name  age  number
1  python   20     100
2     C++   30      50

3.2.6 Add DataFrame data by column index

Use the columns column index table label to add new data columns

import pandas as pd

# 列索引添加数据列
data = {'one': [1, 2, 3], 'two': [2, 3, 4]}
df1 = pd.DataFrame(data, index=['a', 'b', 'c'])
print(f'原数据\n{df1}')

# 方式一:使用df['列']=值,插入新的数据列
df1['three'] = pd.Series([10, 20, 30], index=list('abc'))
print(f'使用df["列"]=值,插入新的数据\n{df1}')

# 方式二:#将已经存在的数据列做相加运算
df1['four'] = df1['one']+df1['three']
print(f'将已经存在的数据列做相加运算\n{df1}')

# 方式三:使用 insert() 方法插入新的列
# 数值4代表插入到columns列表的索引位置
df1.insert(4, column='score', value=[50, 60, 70])
print(f'使用insert()方法插入\n{df1}')

The running results are shown as follows:

原数据
   one  two
a    1    2
b    2    3
c    3    4
使用df["列"]=值,插入新的数据
   one  two  three
a    1    2     10
b    2    3     20
c    3    4     30
将已经存在的数据列做相加运算
   one  two  three  four
a    1    2     10    11
b    2    3     20    22
c    3    4     30    33
使用insert()方法插入
   one  two  three  four  score
a    1    2     10    11     50
b    2    3     20    22     60
c    3    4     30    33     70

3.2.7 Delete DataFrame data by column index

Data columns in DataFrame can be deleted through both del and pop()

import pandas as pd

data = {'one': [1, 2, 3], 'two': [20, 30, 40], 'three': [20, 30, 40]}
df1 = pd.DataFrame(data, index=['a', 'b', 'c'])
print(f'原数据\n{df1}')

# 方式一 del 删除某一列
del df1["one"]
print(f'通过del df["列名"]删除\n{df1}')

# 方式er pop() 删除某一列
df1.pop("two")
print(f'通过pop("列名")删除\n{df1}')

The running results are shown as follows:

原数据
   one  two  three
a    1   20     20
b    2   30     30
c    3   40     40

通过del df["列名"]删除
   two  three
a   20     20
b   30     30
c   40     40

通过pop("列名")删除
   three
a     20
b     30
c     40

3.2.8 Row label index selects DataFrame data

You can pass row labels to the loc function to select data. loc allows two parameters, rows and columns. The parameters need to be separated by "comma", but this function can only receive label indexes.

import pandas as pd

data = {'one': [1, 2, 3, 4], 'two': [20, 30, 40, 50], 'three': [60, 70, 80, 90]}
df1 = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
print(f'原数据\n{df1}')

# 取某一行数据
print(f'取某一行数据\n{df1.loc["a"]}')

# loc允许接两个参数分别是行和列,参数之间需要使用“逗号”隔开,但该函数只能接收标签索引
# 获取某一个单元格的数据
print(f"取某一个单元格的数据\n{df1.loc['a','two']}")

# 更改某一个单元格的数据
df1.loc['a', 'two'] = 'abc'
print(f"更改后的数据\n{df1}")

The running results are shown as follows:

原数据
   one  two  three
a    1   20     60
b    2   30     70
c    3   40     80
d    4   50     90

取某一行数据
one       1
two      20
three    60
Name: a, dtype: int64

取某一个单元格的数据
20

更改后的数据
   one  two  three
a    1  abc     60
b    2   30     70
c    3   40     80
d    4   50     90

3.2.9 Select DataFrame data using integer index

Data row selection can also be achieved by passing the index position of the data row to the iloc function. iloc allows to accept two parameters, row and column, separated by "comma", but this function can only accept integer indexes.

import pandas as pd

data = {'one': [1, 2, 3, 4], 'two': [20, 30, 40, 50],'three': [60, 70, 80, 90]}
df1 = pd.DataFrame(data,index=['a', 'b', 'c', 'd'])
print(f'原数据\n{df1}')

# 取某一行的数据 索引是从0开始
print(f'取某一行的数据\n{df1.iloc[0]}')

The running results are shown as follows:

原数据
   one  two  three
a    1   20     60
b    2   30     70
c    3   40     80
d    4   50     90
取某一行的数据
one       1
two      20
three    60
Name: a, dtype: int64

3.2.10 Slicing operation to select DataFrame data from multiple rows

loc allows two parameters: rows and columns. The parameters need to be separated by "comma", but this function can only receive label indexes.

iloc allows to accept two parameters, row and column, separated by "comma", but this function can only accept integer indexes.

import pandas as pd

data = {'one': [1, 2, 3, 4], 'two': [20, 30, 40, 50], 'three': [60, 70, 80, 90]}
df1 = pd.DataFrame(data,index=['a', 'b', 'c', 'd'])
print(f'原数据\n{df1}')

# loc[] 允许接两个参数分别是行和列,参数之间需要使用“逗号”隔开,但该函数只能接收标签索引
print(f"#loc[]方式获取第三行最后两列数据\n{df1.loc['c','two':'three']}")

# iloc[] 允许接受两个参数分别是行和列,参数之间使用“逗号”隔开,但该函数只能接收整数索引。
print(f"#iloc[]方式获取第三行最后两列数据\n{df1.iloc[2,1:3]}")

The running results are shown as follows:

原数据
   one  two  three
a    1   20     60
b    2   30     70
c    3   40     80
d    4   50     90

#loc[]方式获取第三行最后两列数据
two      40
three    80
Name: c, dtype: int64

#iloc[]方式获取第三行最后两列数据
two      40
three    80
Name: c, dtype: int64

3.2.11 Add DataFrame data rows

New data rows can be added to the DataFrame using the append() function, which appends data rows at the end of the row.

import pandas as pd

data = {'one': [1, 2, 3, 4], 'two': [20, 30, 40, 50], 'three': [60, 70, 80, 90]}
df1 = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
print(f'#原数据\n{df1}')

df2 = pd.DataFrame({'one': 'Q', 'two': 'W'}, index=['e'])

# 使用append()返回一个新的是DataFrame的对象
df = df1._append(df2)
print(f'#在行末追加新数据行\n{df}')

The running results are shown as follows:

#原数据
   one  two  three
a    1   20     60
b    2   30     70
c    3   40     80
d    4   50     90

#在行末追加新数据行
  one two  three
a   1  20   60.0
b   2  30   70.0
c   3  40   80.0
d   4  50   90.0
e   Q   W    NaN

3.2.12 Delete DataFrame data rows

You can use row index tags to delete a row of data from the DataFrame. If there are duplicate index tags, they will be deleted together

pop (row index) deletes a row

pop(column name) deletes a column

Note: If there are duplicate row indexes, they will be deleted simultaneously through drop().

import pandas as pd

data = {'one': [1, 2, 3, 4], 'two': [20, 30, 40, 50], 'three': [60, 70, 80, 90]}
df1 = pd.DataFrame(data,index=['a', 'b', 'c', 'd'])
print(f'原数据\n{df1}')

# pop(行索引)  删除某一行
df = df1.drop('a')
print(f'pop(行索引)  删除某一行\n{df}')

# pop(列名)    删除某一列
df1.pop("one")
print(f'#pop(列名)    删除某一列\n{df1}')

The running results are shown as follows:

原数据
   one  two  three
a    1   20     60
b    2   30     70
c    3   40     80
d    4   50     90
pop(行索引)  删除某一行
   one  two  three
b    2   30     70
c    3   40     80
d    4   50     90
#pop(列名)    删除某一列
   two  three
a   20     60
b   30     70
c   40     80
d   50     90

3.2.13 DataFrame attribute methods info(), index, coloumns, values, axes

  • info(): Returns relevant information about the DataFrame object
  • index: Returns the row index
  • coloumns: returns column index
  • values: Use numpy array to represent element values ​​in Dataframe
  • axes: Returns a list of row labels and column labels
import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)
print(f'#原数据\n{df}')

# info() 获取相关信息
print(f'#df.info()获取DataFrame相关信息\n{df.info()}')

# index 获取行索引
print(f'#df.index 获取行索引\n{df.index}')

# coloumns 获取行索引
print(f'#df.columns 获取列索引\n{df.columns}')

# axes 获取行标签、列标签组成的列表
print(f'#df.axes 获取行标签、列标签组成的列表\n{df.axes}')

# values 使用numpy数组表示Dataframe中的元素值
print(f'#df.values获取Dataframe中的元素值\n{df.values}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name:   7 non-null      object 
 1   year    7 non-null      int64  
 2   Rating  7 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes
#df.info()获取DataFrame相关信息
None
#df.index 获取行索引
RangeIndex(start=0, stop=7, step=1)
#df.columns 获取列索引
Index(['name:', 'year', 'Rating'], dtype='object')
#df.axes 获取行标签、列标签组成的列表
[RangeIndex(start=0, stop=7, step=1), Index(['name:', 'year', 'Rating'], dtype='object')]
#df.values获取Dataframe中的元素值
[['c语言中文网' 5 4.23]
 ['百度' 6 3.24]
 ['360搜索' 15 3.98]
 ['谷歌' 28 2.56]
 ['Bing搜索' 3 3.2]
 ['CSDN' 19 4.6]
 ['华为云' 23 3.8]]

3.2.14 head()&tail() to view DataFrame data

If you want to view a part of the DataFrame, you can use the head() or tail() method. Among them, head() returns the first n rows of data, and the first 5 rows of data are displayed by default.

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}

df = pd.DataFrame(data)
print(f'#原数据\n{df}')

# head(n) 返回前n行数据 默认是前5行
print(f'#df.head(n) 返回前n行数据\n{df.head(2)}')

# tail(n) 返回后n行数据
print(f'#df.tail(n) 返回后n行数据\n{df.tail(2)}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80

#df.head(n) 返回前n行数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24

#df.tail(n) 返回后n行数据
  name:  year  Rating
5  CSDN    19     4.6
6   华为云    23     3.8

3.2.15 dtypes View DataFrame data type

Returns the type of data in each column

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}

df = pd.DataFrame(data)
print(f'#原数据\n{df}')

# dtpes 获取每一列数据的数据类型
print(f'#df.dtpes返回每一列的数据类型\n{df.dtypes}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80

#df.dtpes返回每一列的数据类型
name:      object
year        int64
Rating    float64
dtype: object

3.2.16 empty determines whether the DataFrame is empty

Returns a Boolean value to determine whether the output data object is empty. If True, the object is empty.

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}

df = pd.DataFrame(data)
print(f'#原数据\n{df}')

# empty 判断输出的数据对象是否为空,若为 True 表示对象为空
print(f'#df.empty 对象是否为空,若为 True 表示对象为空\n{df.empty}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80
#df.empty 对象是否为空,若为 True 表示对象为空
False

3.2.17 ndim&shape to view DataFrame dimensions and dimensions

ndimf: returns the dimension of the data object

shape: Returns a tuple representing the dimensions of the DataFrame. Return value tuple (a,b), where a represents the number of rows and b represents the number of columns

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)
print(f'#原数据\n{df}')

# ndim 查看DataFrame的维数 同时也适合Series
print(f"#df.ndim 查看DataFrame的维数\n{df.ndim}")

# shape 维度的元组。返回值元组 (a,b),其中 a 表示行数,b 表示列数 同时也适合Series
print(f"#df.shape 维度的元组。返回值元组 (a,b),其中 a 表示行数,b 表示列数\n{df.shape}")

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80
#df.ndim 查看DataFrame的维数
2
#df.shape 维度的元组。返回值元组 (a,b),其中 a 表示行数,b 表示列数
(7, 3)

3.2.18 size Check the number of elements of DataFrame

Returns the number of elements of the DataFrame object

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)
print(f'#原数据\n{df}')

# size查看DataFrame对象元素的数量
print(f'#df.size 查看DataFrame对象元素的数量\n{df.size}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80
#df.size 查看DataFrame对象元素的数量
21

3.2.19 T (Transpose) transposes DataFrame

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)
print(f'#原数据\n{df}')

#  T(Transpose)转置  把行和列进行交换
print(f'#df.T把行和列进行交换\n{df.T}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80
#df.T把行和列进行交换
             0     1      2     3       4     5    6
name:   c语言中文网    百度  360搜索    谷歌  Bing搜索  CSDN  华为云
year         5     6     15    28       3    19   23
Rating    4.23  3.24   3.98  2.56     3.2   4.6  3.8

3.3 pandas descriptive statistics

Descriptive statistics is a discipline in the field of statistics. It mainly studies how to obtain data that reflect objective phenomena, process and display the collected data in the form of charts, and finally synthesize the laws and characteristics of the data. Descriptive analysis of sex. The Pandas library is the embodiment of the perfect application of descriptive statistics knowledge. It can be said that without "descriptive statistics" as the theoretical foundation, it is still unclear whether Pandas would exist. The following table briefly summarizes the commonly used statistical functions in Pandas:

function name Description
count()  Count the number of non-null values.
sum() Sum
mean() Find the mean
median() Find the median
mode() number of requests
std() Find the standard deviation
min() Find the minimum value
max() Find the maximum value
abs() Find the absolute value
prod() Find the product of all values.
cumsum() Calculate the cumulative sum, axis=0, accumulate according to rows; axis=1, accumulate according to columns.
cumprod() Calculate the cumulative product, axis=0, accumulate according to rows; axis=1, accumulate according to columns.
corr() Calculate the correlation coefficient between sequences or variables, ranging from -1 to 1. The larger the value, the stronger the correlation.

 

 

 

 

 

 

 

 

 

 

 

 

 

In DataFrame, you need to specify the axis parameter when using the aggregation class method. Two methods of passing parameters are introduced below:

  • For row operations, use axis=0 or "index" by default;
  • For column operations, use axis=1 or "columns" by default.

As can be seen from the figure, axis=0 means calculation in the vertical direction, while axis=1 means calculation in the horizontal direction.

If you want to apply custom functions or apply functions from other libraries to Pandas objects, there are three methods:

  • Function that operates the entire DataFrame: pipe()
  • Functions that operate on rows or columns: apply()
  • Function that operates on a single element: applymap()

3.3.1 Sum of all values ​​in vertical and horizontal directions

import pandas as pd

data = {
    'name:': pd.Series(['c语言中文网', "百度", '360搜索', '谷歌', 'Bing搜索', 'CSDN', '华为云']),
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)

print(f'#原数据\n{df}')

# sum() 默认返回axis=0 (垂直方向)的所有值的和
print(f'#df.sum() 默认返回axis=0(垂直方向)的所有值的和\n{df.sum()}')

df2 = pd.DataFrame(data, columns=['year', 'Rating'])
# sum() 当axis=1 (水平方向)的所有值的和
print(f'#df2.sum(axis=1) 默认返回axis=1 (水平方向)的所有值的和\n{df2.sum(axis=1)}')

The running results are shown as follows:

#原数据
    name:  year  Rating
0  c语言中文网     5    4.23
1      百度     6    3.24
2   360搜索    15    3.98
3      谷歌    28    2.56
4  Bing搜索     3    3.20
5    CSDN    19    4.60
6     华为云    23    3.80

#df.sum() 默认返回axis=0(垂直方向)的所有值的和
name:     c语言中文网百度360搜索谷歌Bing搜索CSDN华为云
year                                99
Rating                           25.61
dtype: object

#df2.sum(axis=1) 默认返回axis=1 (水平方向)的所有值的和
0     9.23
1     9.24
2    18.98
3    30.56
4     6.20
5    23.60
6    26.80
dtype: float64

 Note: The sum() and cumsum() functions can handle both numeric and string data. Although character aggregation is usually not used, using these two functions will not throw an exception; while the abs() and cumprod() functions will throw exceptions because they cannot operate on string data.

3.3.2 mean() finds the average value

import pandas as pd

data = {
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)

print(f'#原数据\n{df}')
# mean() 求平均值
print(f'#mean() 平均值\n{df.mean()}')

The running results are shown as follows:

#原数据
   year  Rating
0     5    4.23
1     6    3.24
2    15    3.98
3    28    2.56
4     3    3.20
5    19    4.60
6    23    3.80
#mean() 平均值
year      14.142857
Rating     3.658571
dtype: float64

3.3.3 std() finds the standard deviation

Returns the standard deviation of a numeric column. The standard deviation is the arithmetic square root of the variance, which reflects the dispersion of a data set. Note that two sets of data with the same mean may not have the same standard deviation.

import pandas as pd

data = {
    'year': pd.Series([5, 6, 15, 28, 3, 19, 23]),
    'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8])}
df = pd.DataFrame(data)
print(f'#原数据\n{df}')

print(f'#df.std()求标准差\n{df.std()}')

The running results are shown as follows:

#原数据
   year  Rating
0     5    4.23
1     6    3.24
2    15    3.98
3    28    2.56
4     3    3.20
5    19    4.60
6    23    3.80
#df.std()求标准差
year      9.737018
Rating    0.698628
dtype: float64

3.3.4 Custom function: operate the entire data table pipe()

All elements in DataFrme are manipulated by passing a custom function and the appropriate number of parameter values ​​to the pipe() function. The following example implements adding 3 to the element values ​​in the data table.

The parameter at the first position corresponding to the incoming function of pip() must be the target Seriesor DataFrame, and other related parameters can be passed in using regular key-value pairs.

import pandas as pd
import numpy as np

# 自定义函数
def adder(ele1, ele2):
    return ele1+ele2

# 操作DataFrame
df = pd.DataFrame(np.random.randn(4, 3), columns=['c1', 'c2', 'c3'])
# 相加前
print(f'#原数据\n{df}')
# 相加后
print(f'#df.pipe()相加后的数据\n{df.pipe(adder,3)}')

The running results are shown as follows:

#原数据
         c1        c2        c3
0 -0.374634  0.290875  0.021671
1  0.757403  0.218652  0.160206
2 -0.177390 -0.891544 -1.550597
3 -0.118167 -0.921873  0.890214
#df.pipe()相加后的数据
         c1        c2        c3
0  2.625366  3.290875  3.021671
1  3.757403  3.218652  3.160206
2  2.822610  2.108456  1.449403
3  2.881833  2.078127  3.890214

3.3.5 Custom function: operate row or column apply()

If you want to operate a certain row or column of the DataFrame, you can use the apply() method. This method is similar to the descriptive statistics method and has the optional parameter axis.

import pandas as pd
import numpy as np

# 自定义函数
def adder(df, data):
    data_list =[]
    columns = df.index.values
    for i in columns:
        value = df[i]
        data_list.append(value+data)
    return np.sum(data_list, axis=0)

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])
print(f'#原始数据\n{df}')
# axis=0默认按列操作,计算每一列均值
print(f'#df.apply(函数)计算每一列均值\n{df.apply(np.mean)}')

df = pd.DataFrame(np.random.randn(5,3),columns=['col1', 'col2', 'col3'])
print(f'#原始数据\n{df}')

# axis=1操作行,对指定行执行自定义函数
df['col4'] = df.apply(adder, args=(3,), axis=1)
print(f'#调用自定义函数\n{df}')

The running results are shown as follows:

#原始数据
       col1      col2      col3
0  1.407879 -1.057357 -0.847865
1  0.389119 -1.620390 -1.269465
2 -0.740838 -0.699992  0.429402
3 -1.431036  1.091103 -0.757014
4  1.264738 -0.162598  0.253011
#df.apply(函数)计算每一列均值
col1    0.177973
col2   -0.489847
col3   -0.438386
dtype: float64
#原始数据
       col1      col2      col3
0  1.056548 -0.064314  1.306463
1  0.485457 -0.067215 -1.634539
2  0.120638 -1.214249  0.135860
3 -1.293730  0.477338 -0.925762
4  0.053357 -1.766716  0.050723
#调用自定义函数
       col1      col2      col3       col4
0  1.056548 -0.064314  1.306463  11.298697
1  0.485457 -0.067215 -1.634539   7.783704
2  0.120638 -1.214249  0.135860   8.042250
3 -1.293730  0.477338 -0.925762   7.257846
4  0.053357 -1.766716  0.050723   7.337365

3.3.5 Custom function: operate single element applymap()

The applymap() function of DataFrame can process each value in the DataFrame and then return a new DataFrame

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [10, 20, 30],
    'c': [5, 10, 15]
})
print(f'#原始数据\n{df}')

def add_one(x, data):
    print(f'x的值 = {x}')
    print(f'data的值={data}')
    return x + 1

df1 = df.applymap(add_one, data=3)
print(f'#applymap()对每个元素操作后\n{df1}')

The running results are shown as follows:

#原始数据
   a   b   c
0  1  10   5
1  2  20  10
2  3  30  15
x的值 = 1
data的值=3
x的值 = 2
data的值=3
x的值 = 3
data的值=3
x的值 = 10
data的值=3
x的值 = 20
data的值=3
x的值 = 30
data的值=3
x的值 = 5
data的值=3
x的值 = 10
data的值=3
x的值 = 15
data的值=3
#applymap()对每个元素操作后
   a   b   c
0  2  11   6
1  3  21  11
2  4  31  16

3.4 pandas iteration traversal

If we want to iterate through each row of the DataFrame, we use the following function:

  • iterrows(): Traverse rows in the form of (row_index, row);
  • itertuples(): Iterates over rows using named tuples.

3.4.1 Traverse rows iterrows() in the form of (row_index, row):

This method traverses by row and returns an iterator with the row index label as the key and each row of data as the value.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
print(f'#原始数据\n{df}')

# iteritems():以键值对 (key,value) 的形式遍历 以列标签为键,以对应列的元素为值
for key, row in df.iterrows():
    print(f'#key以列标签为键:{key}')
    print(f'#row以对应列的元素为值\n{row}')

The running results are shown as follows:

#原始数据
       col1      col2      col3
0 -0.968361 -0.980524  0.645811
1 -1.742061 -0.034852  1.625160
2 -0.152453 -0.186645  0.330469
3  0.837739  0.687838 -0.991223
#key以列标签为键:0
#row以对应列的元素为值
col1   -0.968361
col2   -0.980524
col3    0.645811
Name: 0, dtype: float64
#key以列标签为键:1
#row以对应列的元素为值
col1   -1.742061
col2   -0.034852
col3    1.625160
Name: 1, dtype: float64
#key以列标签为键:2
#row以对应列的元素为值
col1   -0.152453
col2   -0.186645
col3    0.330469
Name: 2, dtype: float64
#key以列标签为键:3
#row以对应列的元素为值
col1    0.837739
col2    0.687838
col3   -0.991223
Name: 3, dtype: float64

3.4.2 Traverse rows using named tuples itertuples()

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
print(f'#原始数据\n{df}')

for row in df.itertuples():
    print(f'#每一行生成一个元组\n{row}')

The running results are shown as follows:

#原始数据
       col1      col2      col3
0 -1.050943  1.098056 -0.858725
1 -0.348473  0.604341  0.249866
2  0.709212 -0.807796 -1.241162
3 -2.333712 -0.830910 -0.952756

#每一行生成一个元组
Pandas(Index=0, col1=-1.0509429373784085, col2=1.098055755892262, col3=-0.8587250615671127)
#每一行生成一个元组
Pandas(Index=1, col1=-0.34847318195598975, col2=0.604340877173634, col3=0.24986633604748865)
#每一行生成一个元组
Pandas(Index=2, col1=0.7092120669600998, col2=-0.8077962199969602, col3=-1.241162396630433)
#每一行生成一个元组
Pandas(Index=3, col1=-2.3337119180323316, col2=-0.8309096657807309, col3=-0.9527559438251861)

3.5 pandas sorting sorting

By default, all rows are sorted by row labels, all columns are sorted by column labels, or rows are sorted by a specified column or columns.

Function prototype:

sort_index(axis=0, level=None, ascending=True, 
           inplace=False, kind='quicksort', 
           na_position='last', sort_remaining=True, by=None)

Parameter Description:

axis: 0 is sorted by row name; 1 is sorted by column name 
level: Default is None, otherwise it is arranged in the given level order --- it seems not, document 
ascending: Default True is arranged in ascending order; False is arranged in descending order 
inplace: Default is False, otherwise The sorted data directly replaces the original data frame 
kind: sorting method, {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'. It doesn't seem like you need to care too much. 
na_position: Missing values ​​are ranked last by default {"first", "last"} 
by: Sort according to a certain column or several columns of data, but the by parameter does not seem to be recommended.

 3.5.1 axis=0, ascending=True 默认按“行标签”升序排列

import pandas as pd

df = pd.DataFrame({'b': [1, 2, 2, 3], 'a': [4, 3, 2, 1], 'c': [1, 3, 8, 2]}, index=[2, 0, 1, 3])
print(f'#原始数据\n{df}')

print(f'#默认按“行标签”升序排序,或df.sort_index(axis=0, ascending=True)\n{df.sort_index()}')

The running results are shown as follows:

#原始数据
   b  a  c
2  1  4  1
0  2  3  3
1  2  2  8
3  3  1  2
#默认按“行标签”升序排序,或df.sort_index(axis=0, ascending=True)
   b  a  c
0  2  3  3
1  2  2  8
2  1  4  1
3  3  1  2

3.5.2 axis=1 Sort by "column label" in ascending order

import pandas as pd

df = pd.DataFrame({'b': [1, 2, 2, 3], 'a': [4, 3, 2, 1], 'c': [1, 3, 8, 2]}, index=[2, 0, 1, 3])
print(f'#原始数据\n{df}')

print(f'#按“列标签”升序排序,或df.sort_index(axis=1, ascending=True)\n{df.sort_index(axis=1)}')

The running results are shown as follows:

#原始数据
   b  a  c
2  1  4  1
0  2  3  3
1  2  2  8
3  3  1  2
#按“列标签”升序排序,或df.sort_index(axis=1, ascending=True)
   a  b  c
2  4  1  1
0  3  2  3
1  2  2  8
3  1  3  2

3.6 pandas deduplication function

Function prototype:

df.drop_duplicates(subset=['A','B','C'],keep='first',inplace=True)

Parameter Description:

subset: Indicates the column name to be added, the default is None. 
 keep: There are three optional parameters, namely first, last, and False. The default is first, which means to keep only the first occurrence of duplicates and delete the remaining duplicates. last means to keep only the last occurrence of duplicates, and False means Remove all duplicates. 
 inplace: Boolean parameter, the default is False, which means returning a copy after deleting duplicates, and if it is True, it means deleting duplicates directly on the original data.

3.6.1 Keep first occurrence of row duplicates

import pandas as pd
data = {
    'A': [1, 0, 1, 1],
    'B': [0, 2, 5, 0],
    'C': [4, 0, 4, 4],
    'D': [1, 0, 1, 1]
}
df = pd.DataFrame(data)
print(f'#原始数据\n{df}')

# 默认是keep=first 保留第一次出现的重复项  inplace=False 删除后返回一个副本
df_drop = df.drop_duplicates()
print(f'#去重后的数据\n{df_drop}')

# 也可以使用以下参数
df_drop = df.drop_duplicates(keep='first', inplace=False)
print(f'#去重后的数据2\n{df_drop}')

The running results are shown as follows:

#原始数据
   A  B  C  D
0  1  0  4  1
1  0  2  0  0
2  1  5  4  1
3  1  0  4  1
#去重后的数据
   A  B  C  D
0  1  0  4  1
1  0  2  0  0
2  1  5  4  1
#去重后的数据2
   A  B  C  D
0  1  0  4  1
1  0  2  0  0
2  1  5  4  1

3.6.2 keep=False removes all row duplicates

import pandas as pd

data = {
    'A': [1, 0, 1, 1],
    'B': [0, 2, 5, 0],
    'C': [4, 0, 4, 4],
    'D': [1, 0, 1, 1]
}
df = pd.DataFrame(data)
print(f'#原始数据\n{df}')

# keep=False 删除所有重复项(行)  inplace=True 在原始的数据进行删除重复项(行)
df.drop_duplicates(keep=False, inplace=True)
print(f'#去重后的数据\n{df}')

The running results are shown as follows:

#原始数据
   A  B  C  D
0  1  0  4  1
1  0  2  0  0
2  1  5  4  1
3  1  0  4  1
#去重后的数据
   A  B  C  D
1  0  2  0  0
2  1  5  4  1

3.6.3 subset deletes the specified single column to remove duplicates

import pandas as pd

data = {
    'A': [1, 0, 1, 1],
    'B': [0, 2, 5, 0],
    'C': [4, 0, 4, 4],
    'D': [1, 0, 1, 1]
}
df = pd.DataFrame(data)
print(f'#原始数据\n{df}')

# subset:表示要进去重的列名,默认为 None。
# 去除所有重复项,对于B列来说两个0是重复项
df_drop = df.drop_duplicates(subset=['B'], inplace=False, keep=False)

# 简写,省去subset参数
# df.drop_duplicates(['B'],keep=False)

print(f'#删除指定的列\n{df_drop}')

# reset_index() 函数会直接使用重置后的索引,索引从0开始
df_reset = df_drop.reset_index(drop=True)

print(f'重新设置行索引后的数据\n{df_reset}')

The running results are shown as follows:

#原始数据
   A  B  C  D
0  1  0  4  1
1  0  2  0  0
2  1  5  4  1
3  1  0  4  1
#删除指定的列
   A  B  C  D
1  0  2  0  0
2  1  5  4  1
重新设置行索引后的数据
   A  B  C  D
0  0  2  0  0
1  1  5  4  1

After deleting duplicates, the row labels use the original numbers and do not start over from 0. The reset_index() function provided by Pandas will directly use the reset index.

3.6.4 subset specifies multiple columns to remove duplicates at the same time

import pandas as pd

df = pd.DataFrame({'C_ID': [1, 1, 2, 12, 34, 23, 45, 34, 23, 12, 2, 3, 4, 1],
                    'Age': [12, 12, 15, 18, 12, 25, 21, 25, 25, 18, 25,12,32,18],
                   'G_ID': ['a', 'a', 'c', 'a', 'b', 's', 'd', 'a', 'b', 's', 'a', 'd', 'a', 'a']})

print(f'#原始数据\n{df}')

# last只保留最后一个重复项  去除重复项后并不更改行索引
df_drop = df.drop_duplicates(['Age', 'G_ID'], keep='last')
print(f'#去除指定多列的数据\n{df_drop}')

The running results are shown as follows:

#原始数据
    C_ID  Age G_ID
0      1   12    a
1      1   12    a
2      2   15    c
3     12   18    a
4     34   12    b
5     23   25    s
6     45   21    d
7     34   25    a
8     23   25    b
9     12   18    s
10     2   25    a
11     3   12    d
12     4   32    a
13     1   18    a
#去除指定多列的数据
    C_ID  Age G_ID
1      1   12    a
2      2   15    c
4     34   12    b
5     23   25    s
6     45   21    d
8     23   25    b
9     12   18    s
10     2   25    a
11     3   12    d
12     4   32    a
13     1   18    a

3.7 Pandas missing value processing

3.7.1 Checking for missing values

To make detecting missing values ​​easier, Pandas provides two functions, isnull() and notnull(), which work on both Series and DataFrame objects.

isnull() returns True if it is judged to be a missing value, otherwise it returns False

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=list("ace"), columns=['one', 'two', 'three'])
print(f'原始数据\n{df}')

# 通过使用reindex(重构索引),创建了一个存在缺少值的 DataFrame对象
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(f'#使用 reindex(重构索引)后的数据\n{df}')

# isnull() 检查是否是缺失值,若是则返回True 反之返回False
print(f'#isnull()判断第one列的每个元素是否是缺失值\n{df["one"].isnull()}')

The running results are shown as follows:

原始数据
        one       two     three
a -0.946582  0.054540  0.586515
c  1.756336  0.082180  0.174922
e -2.136985  0.247677 -1.501012
#使用 reindex(重构索引)后的数据
        one       two     three
a -0.946582  0.054540  0.586515
b       NaN       NaN       NaN
c  1.756336  0.082180  0.174922
d       NaN       NaN       NaN
e -2.136985  0.247677 -1.501012
f       NaN       NaN       NaN
#isnull()判断第one列的每个元素是否是缺失值
a    False
b     True
c    False
d     True
e    False
f     True
Name: one, dtype: bool

notnull() returns True if it is not a missing value, otherwise it returns False

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=list("ace"), columns=['one', 'two', 'three'])
print(f'原始数据\n{df}')

# 通过使用 reindex(重构索引),创建了一个存在缺少值的 DataFrame对象
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(f'#使用 reindex(重构索引)后的数据\n{df}')

# notnull() 检查是否不是缺失值,若不是则返回True 反之返回False
print(f'判断是第one列的每个元素是否不是缺失值\n{df["one"].notnull()}')

The running results are shown as follows:

原始数据
        one       two     three
a -0.998457  1.810817  0.348848
c  1.831015  0.319635  0.903095
e -0.572937  1.237014 -0.093289
#使用 reindex(重构索引)后的数据
        one       two     three
a -0.998457  1.810817  0.348848
b       NaN       NaN       NaN
c  1.831015  0.319635  0.903095
d       NaN       NaN       NaN
e -0.572937  1.237014 -0.093289
f       NaN       NaN       NaN
判断是第one列的每个元素是否不是缺失值
a     True
b    False
c     True
d    False
e     True
f    False
Name: one, dtype: bool

3.7.2 Missing data calculation

When calculating missing data, you need to pay attention to two points: first, when summing the data, NA values ​​are treated as 0. Second, if the data to be calculated is NA, then the result is NA.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=list("ace"), columns=['one', 'two', 'three'])
print(f'#原始数据\n{df}')

# 通过使用 reindex(重构索引),创建了一个存在缺少值的 DataFrame对象
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(f'#使用 reindex(重构索引)后的数据\n{df}')

# 计算缺失数据时,需要注意两点:首先数据求和时,将 NA 值视为 0 ,其次,如果要计算的数据为 NA,那么结果就是 NA
print(df['one'].sum())

The running results are shown as follows:

#原始数据
        one       two     three
a  0.274570 -0.007715 -0.138648
c  0.428160 -0.878011  0.165583
e -0.338313  0.643098 -0.715703
#使用 reindex(重构索引)后的数据
        one       two     three
a  0.274570 -0.007715 -0.138648
b       NaN       NaN       NaN
c  0.428160 -0.878011  0.165583
d       NaN       NaN       NaN
e -0.338313  0.643098 -0.715703
f       NaN       NaN       NaN
0.3644171755923789

3.7.3 Clean and fill missing values

fillna() scalar replacement NaN

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=list("ace"), columns=['one', 'two', 'three'])
print(f'#原始数据\n{df}')

# 通过使用 reindex(重构索引),创建了一个存在缺少值的 DataFrame对象
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(f'#使用 reindex(重构索引)后的数据\n{df}')

# 用fillna(6)标量替换NaN
print(f'用fillna(6)标量替换NaN后的数据\n{df.fillna(6)}')

The running results are shown as follows:

#原始数据
        one       two     three
a  0.577051  1.152249  0.614189
c -1.957000  1.306602 -0.463318
e  0.103491  0.280445 -2.530827

#使用 reindex(重构索引)后的数据
        one       two     three
a  0.577051  1.152249  0.614189
b       NaN       NaN       NaN
c -1.957000  1.306602 -0.463318
d       NaN       NaN       NaN
e  0.103491  0.280445 -2.530827
f       NaN       NaN       NaN

用fillna(6)标量替换NaN后的数据
        one       two     three
a  0.577051  1.152249  0.614189
b  6.000000  6.000000  6.000000
c -1.957000  1.306602 -0.463318
d  6.000000  6.000000  6.000000
e  0.103491  0.280445 -2.530827
f  6.000000  6.000000  6.000000

ffill() fills forward and bfill() fills backward, filling NA

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=list("ace"), columns=['one', 'two', 'three'])
print(f'#原始数据\n{df}')

# 通过使用 reindex(重构索引),创建了一个存在缺少值的 DataFrame对象
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(f'#使用 reindex(重构索引)后的数据\n{df}')
print(f"#.fillna(method='ffill')向前填充后的数据\n{df.fillna(method='ffill')}")
print(f"#.bfillna()向后填充后的数据\n{df.bfill()}")

The running results are shown as follows:

#原始数据
        one       two     three
a -0.480378  0.730596 -1.192572
c  0.651002  1.834280  1.179207
e  0.146290 -0.618078  2.782963
#使用 reindex(重构索引)后的数据
        one       two     three
a -0.480378  0.730596 -1.192572
b       NaN       NaN       NaN
c  0.651002  1.834280  1.179207
d       NaN       NaN       NaN
e  0.146290 -0.618078  2.782963
f       NaN       NaN       NaN
#.fillna(method='ffill')向前填充后的数据
        one       two     three
a -0.480378  0.730596 -1.192572
b -0.480378  0.730596 -1.192572
c  0.651002  1.834280  1.179207
d  0.651002  1.834280  1.179207
e  0.146290 -0.618078  2.782963
f  0.146290 -0.618078  2.782963
#.bfillna()向后填充后的数据
        one       two     three
a -0.480378  0.730596 -1.192572
b  0.651002  1.834280  1.179207
c  0.651002  1.834280  1.179207
d  0.146290 -0.618078  2.782963
e  0.146290 -0.618078  2.782963
f       NaN       NaN       NaN

3.7.4 Use replace to replace common values

In some cases, you need to use replace() to replace common values ​​in the DataFrame with specific values. This is similar to using the fillna() function to replace NaN values.

import pandas as pd

df = pd.DataFrame({'one': [10, 20, 30, 40, 50, 10], 'two': [99, 0, 30, 40, 50, 60]})
print(f'#原始数据\n{df}')

df = df.replace({10: 100, 30: 333, 99: 9})
print(f'#replace替换后的数据\n{df}')

The running results are shown as follows:

#原始数据
   one  two
0   10   99
1   20    0
2   30   30
3   40   40
4   50   50
5   10   60
#replace替换后的数据
   one  two
0  100    9
1   20    0
2  333  333
3   40   40
4   50   50
5  100   60

3.7.5 Delete missing values

If you want to delete missing values, you can use the dropna() function with the parameter axis. By default, rows are processed according to axis=0, which means that if there is a NaN value in a row, the entire row of data will be deleted.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=list("ace"), columns=['one', 'two', 'three'])
print(f'#原始数据\n{df}')

# 通过使用 reindex(重构索引),创建了一个存在缺少值的 DataFrame对象
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
print(f'#使用 reindex(重构索引)后的数据\n{df}')

# dropna() axis=0如果某一行中存在 NaN 值将会删除整行数据
print(f'#dropna()删除后的数据\n{df.dropna()}')

The running results are shown as follows:

#原始数据
        one       two     three
a -0.822900  0.025019  0.934275
c  0.215935 -0.634852 -1.236928
e -0.044390  0.464661  0.367780
#使用 reindex(重构索引)后的数据
        one       two     three
a -0.822900  0.025019  0.934275
b       NaN       NaN       NaN
c  0.215935 -0.634852 -1.236928
d       NaN       NaN       NaN
e -0.044390  0.464661  0.367780
f       NaN       NaN       NaN
#dropna()删除后的数据
        one       two     three
a -0.822900  0.025019  0.934275
c  0.215935 -0.634852 -1.236928
e -0.044390  0.464661  0.367780

3.8 pandas csv operation

The first step in using pandas for data processing is to read data. Data sources can come from various places, and csv files are one of them. When reading csv files, pandas also provides very strong support, with forty or fifty parameters. Some of these parameters are easily overlooked, but they are very useful in actual work.

3.8.1 read_csv() reads files

Function prototype:

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, 
                header='infer',names=None, index_col=None, 
                usecols=None)

index_col() custom index: specify a column in the CSV file, and then use it index_colto implement a custom index

The file content is as follows (test.csv):

ID,Name,Age,City,Salary
1,Jack,28,Beijing,22000
2,Lida,32,Shanghai,19000
3,John,43,Shenzhen,12000
4,Helen,38,Hengshui,3500
import pandas as pd

# 读取csv文件数据 sep :指定分隔符。如果不指定参数,则会尝试使用逗号分隔
df = pd.read_csv('test.csv', sep=',')
print(f'#读取csv文件数据\n{df}')

# 使用index_col可以实现自定义索引
df = pd.read_csv('test.csv', index_col=['ID'])
print(f'使用index_col可以实现自定义索引\n{df}')

print(f'获取自定义的索引={df.index}')

The running results are shown as follows:

#读取csv文件数据
   ID   Name  Age      City  Salary
0   1   Jack   28   Beijing   22000
1   2   Lida   32  Shanghai   19000
2   3   John   43  Shenzhen   12000
3   4  Helen   38  Hengshui    3500
使用index_col可以实现自定义索引
     Name  Age      City  Salary
ID                              
1    Jack   28   Beijing   22000
2    Lida   32  Shanghai   19000
3    John   43  Shenzhen   12000
4   Helen   38  Hengshui    3500
获取自定义的索引=Index([1, 2, 3, 4], dtype='int64', name='ID')

3.8.2 names change file header name

Use the names parameter to specify the name of the header file

  • When names is not assigned a value, the header will become 0, that is, the first row of the data file is selected as the column name.
  • When names are assigned and header is not assigned, then header will become None. If both are assigned values, the combination function of the two parameters will be realized.
import pandas as pd

df = pd.read_csv('test.csv', sep=',')
print(f'#读取csv文件数据\n{df}')

# names更改文件标头名 header 没有赋值
df = pd.read_csv('test.csv', names=['a', 'b', 'c', 'd', 'e'])
print(f'#names 更改表头名\n{df}')

The running results are shown as follows:

#读取csv文件数据
   ID   Name  Age      City  Salary
0   1   Jack   28   Beijing   22000
1   2   Lida   32  Shanghai   19000
2   3   John   43  Shenzhen   12000
3   4  Helen   38  Hengshui    3500
#names 更改表头名
    a      b    c         d       e
0  ID   Name  Age      City  Salary
1   1   Jack   28   Beijing   22000
2   2   Lida   32  Shanghai   19000
3   3   John   43  Shenzhen   12000
4   4  Helen   38  Hengshui    3500

Note: The file header name is an additional custom name. The original header name (column label name) has not been deleted. You can use headerparameters to delete it at this time.

import pandas as pd

# names更改文件标头名 header为变成0,即选取文件的第一行作为表头
df = pd.read_csv("test.csv", names=['a', 'b', 'c', 'd', 'e'],header=0)
print(f'#names 更改表头名且header=0\n{df}')

df = pd.read_csv('test.csv',header=1)
# 不指定names,指定header为1,则选取第二行当做表头,第二行下面的是数据

print(f'#不指定names,指定header=1则选取第二行当做表头\n{df}')

The running results are shown as follows:

#names 更改表头名且header=0
   a      b   c         d      e
0  1   Jack  28   Beijing  22000
1  2   Lida  32  Shanghai  19000
2  3   John  43  Shenzhen  12000
3  4  Helen  38  Hengshui   3500
#不指定names,指定header=1则选取第二行当做表头
   1   Jack  28   Beijing  22000
0  2   Lida  32  Shanghai  19000
1  3   John  43  Shenzhen  12000
2  4  Helen  38  Hengshui   3500

3.8.3 skiprows skips the specified number of rows

The skiprows parameter indicates skipping the specified number of rows

import pandas as pd

df = pd.read_csv('test.csv', names=['a', 'b', 'c', 'd', 'e'], header=0)
print(f'#names 更改表头名且header=0\n{df}')

# skiprows指定跳过行数
df = pd.read_csv('test.csv', skiprows=2)
print(f'#skiprows指定跳过行数\n{df}')

The running results are shown as follows:

#names 更改表头名且header=0
   a      b   c         d      e
0  1   Jack  28   Beijing  22000
1  2   Lida  32  Shanghai  19000
2  3   John  43  Shenzhen  12000
3  4  Helen  38  Hengshui   3500
#skiprows指定跳过行数
   2   Lida  32  Shanghai  19000
0  3   John  43  Shenzhen  12000
1  4  Helen  38  Hengshui   3500

3.8.4 to_csv() converts data

The to_csv() function provided by Pandas is used to convert DataFrame to CSV data. If you want to write CSV data to a file, just pass a file object to the function. Otherwise, CSV data will be returned in string format.

import pandas as pd

data = {'Name': ['Smith', 'Parker'], 'ID': [101, 102], 'Language': ['Python', 'JavaScript']}
df_data = pd.DataFrame(data)
print(f'#DataFrame原始数据\n{df_data}')

# 通过to_csv()转成csv文件数据
df_csv = df_data.to_csv()
print(f'#通过to_csv()转成csv文件数据后的数据\n{df_csv}')

# 指定 CSV 文件输出时的分隔符,并将其保存在 pandas.csv 文件中index=False 表示不写入索引
df_data.to_csv("person.csv", sep='|', index=False)

The running results are shown as follows:

#DataFrame原始数据
     Name   ID    Language
0   Smith  101      Python
1  Parker  102  JavaScript
#通过to_csv()转成csv文件数据后的数据
,Name,ID,Language
0,Smith,101,Python
1,Parker,102,JavaScript

Stored person.csv file:

Name|ID|Language
Smith|101|Python
Parker|102|JavaScript

3.9 pandas operating Excel

3.9.1 to_excel() data conversion

The data in the Dataframe can be written to an Excel file through the to_excel() function.
If you want to write a single object to an Excel file, you must specify the target file name; if you want to write to multiple worksheets, you need to create an ExcelWriterobject with a target file name and sheet_namespecify the worksheet names in turn through parameters. name.

Function prototype:

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', 
                   na_rep='', float_format=None, 
                   columns=None, header=True, 
                   index=True, index_label=None, 
                   startrow=0, startcol=0, engine=None, 
                   merge_cells=True, encoding=None, 
                   inf_rep='inf', verbose=True, freeze_panes=None) 

Description of common parameters:

parameter name Description
excel_wirter File path or ExcelWrite object.
sheet_name Specify the name of the worksheet to which data will be written.
and_rep Representation of missing values.
float_format It is an optional parameter used to format a floating point string.
columns Refers to the column to be written.
header Write the name of each column, or the column alias if a list of strings is given.
index Indicates the index to be written to.
index_label Column label referencing the index column. If not specified, and both hearer and index are True, the index name is used. If the DataFrame
uses MultiIndex, a sequence needs to be given.
startrow The initial written row position, default value 0. Indicates that the row cell in the upper left corner is referenced to store the DataFrame.
startcol Column position for initial writing, default value 0. Indicates that the column cell in the upper left corner is referenced to store the DataFrame.
engine It is an optional parameter that specifies the engine to use, which can be openpyxl or xlsxwriter.

Create a table and write data

import pandas as pd

# 创建DataFrame数据
info_website = pd.DataFrame({'name': ['博客中国', 'c语言中文网', 'CSDN', '92python'],
     'rank': [1, 2, 3, 4],
     'language': ['PHP', 'C', 'PHP', 'Python']})
print(f'#DataFrame数据\n{info_website}')

# 创建ExcelWrite对象
to_excle_file_path = 'test_excel.xlsx'
writer = pd.ExcelWriter(to_excle_file_path)
info_website.to_excel(writer)
writer.close()

 The running results are shown as follows:

#DataFrame数据
       name  rank language
0      博客中国     1      PHP
1    c语言中文网     2        C
2      CSDN     3      PHP
3  92python     4   Python

The content of test_excel.xlsx is as follows:

name rank language
0 Blog China 1 PHP
1 c language Chinese website 2 C
2 CSDN 3 PHP
3 92python 4 Python

Use pd.ExcelWriter to generate a writer, and then the data can be written to the excel file, but it must be done after writing writer.close(), otherwise the data is still only in the data stream and not saved in the excel file.

3.9.2 Insert multiple sheet data at one time

Note: This operation will overwrite the original file content

import pandas as pd

to_excle_file_path = 'test_excel2.xlsx'

# 创建DataFrame数据 字典嵌套数组类型
info_website = pd.DataFrame({'name': ['博客中国', 'c语言中文网', 'CSDN', '92python'],
     'rank': [1, 2, 3, 4],
     'language': ['PHP', 'C', 'PHP', 'Python']})
print(f'#DataFrame数据\n{info_website}')

# 数组嵌套字典类型
data = [{'a': 1, 'b': 2, 'c': 3},
        {'a': 5, 'b': 10, 'c': 20},
        {'a': "王者", 'b': '黄金', 'c': '白银'}]
df = pd.DataFrame(data)
print(f'#DataFrame数据\n{df}')

writer = pd.ExcelWriter(to_excle_file_path)
df.to_excel(writer)

info_website.to_excel(writer, sheet_name="这是第一个sheet", index=False)
info_website.to_excel(writer, sheet_name="这是第二个sheet", index=False)
writer.close()

The running results are shown as follows:

#DataFrame数据
       name  rank language
0      博客中国     1      PHP
1    c语言中文网     2        C
2      CSDN     3      PHP
3  92python     4   Python
#DataFrame数据
    a   b   c
0   1   2   3
1   5  10  20
2  王者  黄金  白银

The contents of the saved file are as follows:

3.9.3 Additional sheet table contents

import pandas as pd

to_excle_file_path = 'test_excel2.xlsx'

# 创建DataFrame数据 字典嵌套数组类型
info_website = pd.DataFrame({'name': ['博客中国', 'c语言中文网', 'CSDN', '92python'],
     'rank': [1, 2, 3, 4],
     'language': ['PHP', 'C', 'PHP', 'Python']})
print(f'#DataFrame数据\n{info_website}')

# 数组嵌套字典类型
data = [{'a': 1, 'b': 2, 'c': 3},
        {'a': 5, 'b': 10, 'c': 20},
        {'a': "王者", 'b': '黄金', 'c': '白银'}]
df = pd.DataFrame(data)
print(f'#DataFrame数据\n{df}')

writer = pd.ExcelWriter(to_excle_file_path, mode='a', engine='openpyxl')
df.to_excel(writer, sheet_name="追加第一个sheet", index=False)

info_website.to_excel(writer, sheet_name="追加第二个sheet", index=False)
info_website.to_excel(writer, sheet_name="追加第三个sheet", index=False)
writer.close()

The running results are shown as follows:

#DataFrame数据
       name  rank language
0      博客中国     1      PHP
1    c语言中文网     2        C
2      CSDN     3      PHP
3  92python     4   Python
#DataFrame数据
    a   b   c
0   1   2   3
1   5  10  20
2  王者  黄金  白银

The Excel content is written as follows:

3.9.4 read_excel() reads data

You can use the read_excel() method to read data in an Excel table. The syntax format is as follows:

pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None,
              usecols=None, squeeze=False,dtype=None, engine=None,
              converters=None, true_values=None, false_values=None,
              skiprows=None, nrows=None, na_values=None, parse_dates=False,
              date_parser=None, thousands=None, comment=None, skipfooter=0,
              convert_float=True, **kwds)

Commonly used parameters are as follows:

Handling unnamed columns and redefining indexes

 

import pandas as pd

#读取excel数据
file_path = 'test_excel.xlsx'

df = pd.read_excel(file_path, engine='openpyxl')
print(f'#原始数据\n{df}')

# 选择name列做为索引,并跳过前两行
df = pd.read_excel(file_path, index_col='name', skiprows=[2], engine='openpyxl')
print(f'#选择name列做为索引,并跳过前两行\n{df}')

# 处理未命名列
df.columns = df.columns.str.replace('Unnamed.*', 'col_label')
print(f'#修改为未命名的列\n{df}')

The running results are shown as follows:

#原始数据
   Unnamed: 0      name  rank language
0           0      博客中国     1      PHP
1           1    c语言中文网     2        C
2           2      CSDN     3      PHP
3           3  92python     4   Python
#选择name列做为索引,并跳过前两行
          Unnamed: 0  rank language
name                               
博客中国               0     1      PHP
CSDN               2     3      PHP
92python           3     4   Python
#修改为未命名的列
          Unnamed: 0  rank language
name                               
博客中国               0     1      PHP
CSDN               2     3      PHP
92python           3     4   Python

The first several columns of index_col are used as index columns, and usecols sets the data columns to be read.

import pandas as pd

# 读取excel数据
file_path = 'test_excel.xlsx'

df = pd.read_excel(file_path, engine='openpyxl')
print(f'#原始数据\n{df}')

# index_col选择前两列作为索引列 选择前三列数据,name列作为行索引
df = pd.read_excel(file_path, index_col=[0, 1], usecols=[0, 1, 2],engine='openpyxl')
print(f'#ndex_col选择前两列作为索引列 选择前三列数据,name列作为行索引\n{df}')

The running results are shown as follows:

#原始数据
   Unnamed: 0      name  rank language
0           0      博客中国     1      PHP
1           1    c语言中文网     2        C
2           2      CSDN     3      PHP
3           3  92python     4   Python
#ndex_col选择前两列作为索引列 选择前三列数据,name列作为行索引
            
  name         rank 
0 博客中国         1
1 c语言中文网       2
2 CSDN         3
3 92python     4

3.10 File formats supported by pandas

Pandas supports almost all mainstream data storage forms on the market, such as Excel and CSV to JSON and various databases.

Pandas provides many functions to load data, mainly the following functions:

  • read_csv(): Load data from CSV file
  • read_excel(): Load data from Excel file
  • read_sql():Load data from SQL database
  • read_json():Load data from JSON file
  • read_html():Load data from HTML file

 

Pandas provides a variety of functions to save data into different file formats. The main functions are as follows:

  • to_csv(): Save data to CSV file
  • to_excel(): Save data to Excel file
  • to_sql(): Save data to SQL database
  • to_json(): Save data to JSON file
  • to_html(): Save data to HTML file

Guess you like

Origin blog.csdn.net/lsb2002/article/details/132997199