1 pandas中的数据运算与算术对齐
2 iloc与loc的切片与索引
3 DataFrame与Series之间的运算
4 函数应用和映射
- 4.1 用apply将一个规则应用到DataFrame的行或者列上
- 4.2 applymap 将一个规则应用到DataFrame中的每一个元素
5 Series和DataFrame的排序
6 处理Series的重复索引
7 汇总计算描述统计
8 唯一值、值计数与成员资格
9 缺失值处理

1 pandas中的数据运算与算术对齐

pandas可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引|对,则结果的索引就是该索引对的并集。在对不同索引的对象进
行算术运算时,当一个对象中某个轴标签在另一个对象中找不到时,会自动填充NaN,也可自己填充一个特殊值(比如0)

from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from numpy import nan

df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list("abcd"))
df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list("abcde"))
df1

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

df2

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b	c	d	e
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19

df1.add(df2,fill_value=0)  # 为df1添加第3行和e这一列，并将其填充为0

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	11.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

df1.add(df2).fillna(0)    # 按照正常方式将df1和df2相加，然后将NaN值填充为0

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b	c	d
0	0.0	2.0	4.0	6.0
1	9.0	11.0	13.0	15.0
2	18.0	20.0	22.0	24.0
3	0.0	0.0	0.0	0.0

'''
注意：df1.add(df2)，
df1.add(df2,fill_value=0)，
df1.add(df2).fillna(0)
本质上不同
'''

2 iloc与loc的切片与索引

loc，基于label的索引
iloc，完全基于位置的索引

frame = DataFrame(np.arange(12).reshape((4,3)),
                  columns=list("bde"),
                 index=["Utah","Ohio","Texas","Oregon"])
frame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

frame.iloc[1]  # 获取某一行数据  用iloc[]  替换ix[] 方法

    b    3
    d    4
    e    5
    Name: Ohio, dtype: int32

frame.index

    Index(['Utah', 'Ohio', 'Texas', 'Oregon'], dtype='object')

# 根据行索引提取数据
frame.loc["Oregon"]

    b     9
    d    10
    e    11
    Name: Oregon, dtype: int32

# DataFrame和Series进行算术运算
frame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

series = frame.iloc[0]   # frame.loc["Utah"]
series

    b    0
    d    1
    e    2
    Name: Utah, dtype: int32

3 DataFrame与Series之间的运算

默认情况下,Dataframe和 Series之间的算术运算会将Series的索引匹配到Dataframe的列,然后沿着行一直向下广播

frame - series

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	0	0	0
Ohio	3	3	3
Texas	6	6	6
Oregon	9	9	9

frame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

4 函数应用和映射

4.1 用apply将一个规则应用到DataFrame的行或者列上

f = lambda x : x.max() - x.min()  # 匿名函数

arr = np.array([1,2,3,4,5])


def getMax(x):
    return x.max() - x.min()

# getMax(arr)
f(arr)

frame.apply(f)  #apply 默认第二个参数  axis=0,作用于列方向上，axis=1 作用于行方向上

    b    9
    d    9
    e    9
    dtype: int64

frame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

frame.apply(f,axis=1)

    Utah      2
    Ohio      2
    Texas     2
    Oregon    2
    dtype: int64

4.2 applymap 将一个规则应用到DataFrame中的每一个元素

frame = DataFrame(np.random.randn(12).reshape((4,3)),
                  columns=list("bde"),
                 index=["Utah","Ohio","Texas","Oregon"])
frame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	-0.033554	-0.179060	-0.169456
Ohio	0.397475	-1.661291	0.611291
Texas	0.114703	-0.467590	-0.424874
Oregon	-1.497851	1.239364	2.076009

# 定义一个函数，保留浮点数的两位小数
def twoFixed(num):
    return "%.2f"%num

# 将该方法转化成匿名函数
f = lambda num : "%.2f"%num

# 将匿名函数f应用发到frame中的每一元素中
strFrame = frame.applymap(f)
strFrame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	b	d	e
Utah	-0.03	-0.18	-0.17
Ohio	0.40	-1.66	0.61
Texas	0.11	-0.47	-0.42
Oregon	-1.50	1.24	2.08

frame.dtypes  # 获取DataFrame中每一列的数据类型

    b    float64
    d    float64
    e    float64
    dtype: object

strFrame.dtypes

    b    object
    d    object
    e    object
    dtype: object

# 将一个规则应用到某一列上
frame["d"].map(lambda x :x+10)

    Utah       9.820940
    Ohio       8.338709
    Texas      9.532410
    Oregon    11.239364
    Name: d, dtype: float64

5 Series和DataFrame的排序

series = Series(range(4),
                index=list("dabc"))
series

    d    0
    a    1
    b    2
    c    3
    dtype: int64

series.sort_index()

    a    1
    b    2
    c    3
    d    0
    dtype: int64

frame = DataFrame(np.arange(8).reshape((2,4)),
                  index=["three","one"],
                 columns=list("dabc"))
frame

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	d	a	b	c
three	0	1	2	3
one	4	5	6	7

'''
DataFrame.sort_index(axis,ascending,by)
axis = 0   按照行索引排序   index

axis = 1    按照列索引排序  columns

ascending = False 降序

ascending = True   升序

by      指定列名进行排序，不推荐使用
'''
frame.sort_index()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

frame.sort_index(axis=1,ascending=False)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	d	c	b	a
three	0	3	2	1
one	4	7	6	5

# 按照DataFrame中某一列的值进行排序
df = DataFrame({"a":[4,7,-3,2],
               "b":[0,1,0,1]})
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b
0	4	0
1	7	1
2	-3	0
3	2	1

# 按照b这一列的数据值进行排序
df.sort_values(by="a")   # 建议使用df.sort_values(by)  替换 sort_index(by)  用法

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b
2	-3	0
3	2	1
0	4	0
1	7	1

6 处理Series的重复索引

series = Series(range(5),index=list("aabbc"))
series

    a    0
    a    1
    b    2
    b    3
    c    4
    dtype: int64

# 判断Series的索引是否出现重复
series.index.is_unique     # False 由重复的索引   True 没有重复索引

False

series["a"]

    a    0
    a    1
    dtype: int64

7 汇总计算描述统计

'''
df.sum(axis)

axis=0   按列方向求和（默认）
axis=1   按行方向求和

'''

df = DataFrame([[1.4,nan],
                [7.1,-4.5],
                [nan,nan],
                [0.75,-1.3]],
               index=list("abcd"),
              columns=["one","two"])
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

# 按照列方向求和
df.sum()

one 9.25 two -5.80 dtype: float64

# 按照行方向求和
df.sum(axis=1)

    a    1.40
    b    2.60
    c    0.00
    d   -0.55
    dtype: float64

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.mean()  # 默认按照列方向求平均  axis = 0 案列   axis=1 按行

    one    3.083333
    two   -2.900000
    dtype: float64

df.mean(axis=1,skipna=False)  # skipna 是否省略nan值，False ：不省略  True: 省略

    a      NaN
    b    1.300
    c      NaN
    d   -0.275
    dtype: float64

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.cumsum()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.describe()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

8 唯一值、值计数与成员资格

8.1 相关函数

series = Series(list("aabc")*4)
series

    0     a
    1     a
    2     b
    3     c
    4     a
    5     a
    6     b
    7     c
    8     a
    9     a
    10    b
    11    c
    12    a
    13    a
    14    b
    15    c
    dtype: object

# 获取series中去重之后的结果
series.unique()

    array(['a', 'b', 'c'], dtype=object)

# 统计Series中每个元素出现的次数
series.value_counts()

    a    8
    c    4
    b    4
    dtype: int64

# 在pandas对象上也有功能相同的value_counts()方法
pd.value_counts(series.values,sort=False)

    b    4
    a    8
    c    4
    dtype: int64

8.2 检验Series中的元素是否在指定集合

series

    0     a
    1     a
    2     b
    3     c
    4     a
    5     a
    6     b
    7     c
    8     a
    9     a
    10    b
    11    c
    12    a
    13    a
    14    b
    15    c
    dtype: object

mask = series.isin(["b","c"])

series[mask]   # 通过花式索引筛选数据

2     b
3     c
6     b
7     c
10    b
11    c
14    b
15    c
dtype: object

8.3 统计DataFrame每一列中每个元素出现次数

data=DataFrame({"qu1":[1,3,4,3,4],
               "qu2":[2,3,1,2,3],
               "qu3":[1,5,2,6,4]})
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	qu1	qu2	qu3
0	1	2	1
1	3	3	5
2	4	1	2
3	3	2	6
4	4	3	4

data.apply(pd.value_counts,axis=1).fillna(0).astype("int")

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	1	2	3	4	5	6
0	2	1	0	0	0	0
1	0	0	2	0	1	0
2	1	1	0	1	0	0
3	0	1	1	0	0	1
4	0	0	1	2	0	0

# 按照DataFrame中某一列的值进行排序
series = Series(["a","b",nan,"c"])
series

    0      a
    1      b
    2    NaN
    3      c
    dtype: object

mask = series.isnull()

~mask  # 取反

    0     True
    1     True
    2    False
    3     True
    dtype: bool

# 判断Series元素是否不为null
series.notnull()

    0     True
    1     True
    2    False
    3     True
    dtype: bool

9 缺失值处理

series

    0      a
    1      b
    2    NaN
    3      c
    dtype: object

# 去除序列中的缺失值
series.dropna()

    0    a
    1    b
    3    c
    dtype: object

# DataFrame中的缺失值处理

data = DataFrame([[1,6.5,3,4],
                  [1,nan,nan,5],
                  [nan,nan,nan,6],
                  [nan,6.5,3,7]])
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	1.0	6.5	3.0	4
1	1.0	NaN	NaN	5
2	NaN	NaN	NaN	6
3	NaN	6.5	3.0	7

'''
DataFrame.dropna(axis,how)
axis=0
     只要在一行有任意一列的值为nan，则该行被删除
axis =1
     只要在一列中任意一行的值为nan，则该列被删除

how = all
     如果axis=0  只删除全为nan的行
     如果axis=1  只删除全为nan的列

'''
cleaned = data.dropna(axis=1)

cleaned

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	3
0	4
1	5
2	6
3	7

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	1.0	6.5	3.0	4
1	1.0	NaN	NaN	5
2	NaN	NaN	NaN	6
3	NaN	6.5	3.0	7

data2 = DataFrame([[1,6.5,3,nan],
                  [1,nan,nan,nan],
                  [nan,nan,nan,nan],
                  [nan,6.5,3,nan]])
data2

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

data2.dropna(axis=1,how="all")

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

df= DataFrame([[1,2,nan],
              [4,nan,6],
              [nan,5,9]],
             index=['one','two','three'],
             columns=list('abc'))
df

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	a	b	c
one	1.0	2.0	NaN
two	4.0	NaN	6.0
three	NaN	5.0	9.0

df.fillna(0)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	a	b	c
one	1.0	2.0	0.0
two	4.0	0.0	6.0
three	0.0	5.0	9.0

#按照列名将不同列的nan替换不同的指定值
df1 = df.fillna({'a':0,'c':"F"})
df1

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	a	b	c
one	1.0	2.0	F
two	4.0	NaN	6
three	0.0	5.0	9

#fillna默认返回新对象,即inplace=False
#inplace=True修改原数据,无返回值
df.fillna(0,inplace=True)
df

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	a	b	c
one	1.0	2.0	0.0
two	4.0	0.0	6.0
three	0.0	5.0	9.0

Pandas-进阶应用

1 pandas中的数据运算与算术对齐

2 iloc与loc的切片与索引

3 DataFrame与Series之间的运算

4 函数应用和映射

4.1 用apply将一个规则应用到DataFrame的行或者列上

4.2 applymap 将一个规则应用到DataFrame中的每一个元素

5 Series和DataFrame的排序

6 处理Series的重复索引

7 汇总计算描述统计

8 唯一值、值计数与成员资格

8.1 相关函数

8.2 检验Series中的元素是否在指定集合

8.3 统计DataFrame每一列中每个元素出现次数

9 缺失值处理

猜你喜欢