Pandas-进阶应用


1 pandas中的数据运算与算术对齐

  • pandas可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引|对,则结果的索引就是该索引对的并集。在对不同索引的对象进
    行算术运算时,当一个对象中某个轴标签在另一个对象中找不到时,会自动填充NaN,也可自己填充一个特殊值(比如0)
from pandas import Series,DataFrame
import pandas as pd
import numpy as np
from numpy import nan
df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list("abcd"))
df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list("abcde"))
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
df1.add(df2,fill_value=0)  # 为df1添加第3行和e这一列,并将其填充为0
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1.add(df2).fillna(0)    # 按照正常方式将df1和df2相加,然后将NaN值填充为0
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b c d e
0 0.0 2.0 4.0 6.0 0.0
1 9.0 11.0 13.0 15.0 0.0
2 18.0 20.0 22.0 24.0 0.0
3 0.0 0.0 0.0 0.0 0.0
'''
注意:df1.add(df2),
df1.add(df2,fill_value=0),
df1.add(df2).fillna(0)
本质上不同
'''

2 iloc与loc的切片与索引

  • loc,基于label的索引
  • iloc,完全基于位置的索引
frame = DataFrame(np.arange(12).reshape((4,3)),
                  columns=list("bde"),
                 index=["Utah","Ohio","Texas","Oregon"])
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
frame.iloc[1]  # 获取某一行数据  用iloc[]  替换ix[] 方法
    b    3
    d    4
    e    5
    Name: Ohio, dtype: int32
frame.index
    Index(['Utah', 'Ohio', 'Texas', 'Oregon'], dtype='object')
# 根据行索引提取数据
frame.loc["Oregon"]
    b     9
    d    10
    e    11
    Name: Oregon, dtype: int32
# DataFrame和Series进行算术运算
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
series = frame.iloc[0]   # frame.loc["Utah"]
series
    b    0
    d    1
    e    2
    Name: Utah, dtype: int32

3 DataFrame与Series之间的运算

  • 默认情况下,Dataframe和 Series之间的算术运算会将Series的索引匹配到Dataframe的列,然后沿着行一直向下广播
frame - series
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
frame  
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11

4 函数应用和映射

4.1 用apply将一个规则应用到DataFrame的行或者列上

f = lambda x : x.max() - x.min()  # 匿名函数

arr = np.array([1,2,3,4,5])


def getMax(x):
    return x.max() - x.min()

# getMax(arr)
f(arr)
        4
frame.apply(f)  #apply 默认第二个参数  axis=0,作用于列方向上,axis=1 作用于行方向上
    b    9
    d    9
    e    9
    dtype: int64
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
frame.apply(f,axis=1)
    Utah      2
    Ohio      2
    Texas     2
    Oregon    2
    dtype: int64

4.2 applymap 将一个规则应用到DataFrame中的每一个元素

frame = DataFrame(np.random.randn(12).reshape((4,3)),
                  columns=list("bde"),
                 index=["Utah","Ohio","Texas","Oregon"])
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah -0.033554 -0.179060 -0.169456
Ohio 0.397475 -1.661291 0.611291
Texas 0.114703 -0.467590 -0.424874
Oregon -1.497851 1.239364 2.076009
# 定义一个函数,保留浮点数的两位小数
def twoFixed(num):
    return "%.2f"%num

# 将该方法转化成匿名函数
f = lambda num : "%.2f"%num
# 将匿名函数f应用发到frame中的每一元素中
strFrame = frame.applymap(f)
strFrame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
b d e
Utah -0.03 -0.18 -0.17
Ohio 0.40 -1.66 0.61
Texas 0.11 -0.47 -0.42
Oregon -1.50 1.24 2.08
frame.dtypes  # 获取DataFrame中每一列的数据类型
    b    float64
    d    float64
    e    float64
    dtype: object
strFrame.dtypes
    b    object
    d    object
    e    object
    dtype: object
# 将一个规则应用到某一列上
frame["d"].map(lambda x :x+10)
    Utah       9.820940
    Ohio       8.338709
    Texas      9.532410
    Oregon    11.239364
    Name: d, dtype: float64

5 Series和DataFrame的排序

series = Series(range(4),
                index=list("dabc"))
series
    d    0
    a    1
    b    2
    c    3
    dtype: int64
series.sort_index()
    a    1
    b    2
    c    3
    d    0
    dtype: int64
frame = DataFrame(np.arange(8).reshape((2,4)),
                  index=["three","one"],
                 columns=list("dabc"))
frame
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
d a b c
three 0 1 2 3
one 4 5 6 7
'''
DataFrame.sort_index(axis,ascending,by)
axis = 0   按照行索引排序   index

axis = 1    按照列索引排序  columns

ascending = False 降序

ascending = True   升序

by      指定列名进行排序,不推荐使用
'''
frame.sort_index()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1,ascending=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
d c b a
three 0 3 2 1
one 4 7 6 5
# 按照DataFrame中某一列的值进行排序
df = DataFrame({"a":[4,7,-3,2],
               "b":[0,1,0,1]})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b
0 4 0
1 7 1
2 -3 0
3 2 1
# 按照b这一列的数据值进行排序
df.sort_values(by="a")   # 建议使用df.sort_values(by)  替换 sort_index(by)  用法
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
a b
2 -3 0
3 2 1
0 4 0
1 7 1

6 处理Series的重复索引

series = Series(range(5),index=list("aabbc"))
series
    a    0
    a    1
    b    2
    b    3
    c    4
    dtype: int64
# 判断Series的索引是否出现重复
series.index.is_unique     # False 由重复的索引   True 没有重复索引
False
series["a"]
    a    0
    a    1
    dtype: int64

7 汇总计算描述统计

'''
df.sum(axis)

axis=0   按列方向求和(默认)
axis=1   按行方向求和

'''
df = DataFrame([[1.4,nan],
                [7.1,-4.5],
                [nan,nan],
                [0.75,-1.3]],
               index=list("abcd"),
              columns=["one","two"])
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# 按照列方向求和
df.sum()
one 9.25 two -5.80 dtype: float64
# 按照行方向求和
df.sum(axis=1)
    a    1.40
    b    2.60
    c    0.00
    d   -0.55
    dtype: float64
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.mean()  # 默认按照列方向求平均  axis = 0 案列   axis=1 按行
    one    3.083333
    two   -2.900000
    dtype: float64
df.mean(axis=1,skipna=False)  # skipna 是否省略nan值,False :不省略  True: 省略
    a      NaN
    b    1.300
    c      NaN
    d   -0.275
    dtype: float64
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.cumsum()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.describe()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

8 唯一值、值计数与成员资格

8.1 相关函数

series = Series(list("aabc")*4)
series
    0     a
    1     a
    2     b
    3     c
    4     a
    5     a
    6     b
    7     c
    8     a
    9     a
    10    b
    11    c
    12    a
    13    a
    14    b
    15    c
    dtype: object
# 获取series中去重之后的结果
series.unique()
    array(['a', 'b', 'c'], dtype=object)
# 统计Series中每个元素出现的次数
series.value_counts()
    a    8
    c    4
    b    4
    dtype: int64
# 在pandas对象上也有功能相同的value_counts()方法
pd.value_counts(series.values,sort=False)
    b    4
    a    8
    c    4
    dtype: int64

8.2 检验Series中的元素是否在指定集合

series
    0     a
    1     a
    2     b
    3     c
    4     a
    5     a
    6     b
    7     c
    8     a
    9     a
    10    b
    11    c
    12    a
    13    a
    14    b
    15    c
    dtype: object
mask = series.isin(["b","c"])
series[mask]   # 通过花式索引筛选数据
2     b
3     c
6     b
7     c
10    b
11    c
14    b
15    c
dtype: object

8.3 统计DataFrame每一列中每个元素出现次数

data=DataFrame({"qu1":[1,3,4,3,4],
               "qu2":[2,3,1,2,3],
               "qu3":[1,5,2,6,4]})
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
qu1 qu2 qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 6
4 4 3 4
data.apply(pd.value_counts,axis=1).fillna(0).astype("int")
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
1 2 3 4 5 6
0 2 1 0 0 0 0
1 0 0 2 0 1 0
2 1 1 0 1 0 0
3 0 1 1 0 0 1
4 0 0 1 2 0 0
# 按照DataFrame中某一列的值进行排序
series = Series(["a","b",nan,"c"])
series
    0      a
    1      b
    2    NaN
    3      c
    dtype: object
mask = series.isnull()
~mask  # 取反
    0     True
    1     True
    2    False
    3     True
    dtype: bool
# 判断Series元素是否不为null
series.notnull()
    0     True
    1     True
    2    False
    3     True
    dtype: bool

9 缺失值处理

series
    0      a
    1      b
    2    NaN
    3      c
    dtype: object
# 去除序列中的缺失值
series.dropna()   
    0    a
    1    b
    3    c
    dtype: object
# DataFrame中的缺失值处理
data = DataFrame([[1,6.5,3,4],
                  [1,nan,nan,5],
                  [nan,nan,nan,6],
                  [nan,6.5,3,7]])
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3
0 1.0 6.5 3.0 4
1 1.0 NaN NaN 5
2 NaN NaN NaN 6
3 NaN 6.5 3.0 7
'''
DataFrame.dropna(axis,how)
axis=0
     只要在一行有任意一列的值为nan,则该行被删除
axis =1
     只要在一列中任意一行的值为nan,则该列被删除

how = all
     如果axis=0  只删除全为nan的行
     如果axis=1  只删除全为nan的列

'''
cleaned = data.dropna(axis=1)
cleaned
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
3
0 4
1 5
2 6
3 7
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3
0 1.0 6.5 3.0 4
1 1.0 NaN NaN 5
2 NaN NaN NaN 6
3 NaN 6.5 3.0 7
data2 = DataFrame([[1,6.5,3,nan],
                  [1,nan,nan,nan],
                  [nan,nan,nan,nan],
                  [nan,6.5,3,nan]])
data2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data2.dropna(axis=1,how="all")
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df= DataFrame([[1,2,nan],
              [4,nan,6],
              [nan,5,9]],
             index=['one','two','three'],
             columns=list('abc'))
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c
one 1.0 2.0 NaN
two 4.0 NaN 6.0
three NaN 5.0 9.0
df.fillna(0)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c
one 1.0 2.0 0.0
two 4.0 0.0 6.0
three 0.0 5.0 9.0
#按照列名将不同列的nan替换不同的指定值
df1 = df.fillna({'a':0,'c':"F"})
df1
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c
one 1.0 2.0 F
two 4.0 NaN 6
three 0.0 5.0 9
#fillna默认返回新对象,即inplace=False
#inplace=True修改原数据,无返回值
df.fillna(0,inplace=True)
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c
one 1.0 2.0 0.0
two 4.0 0.0 6.0
three 0.0 5.0 9.0

猜你喜欢

转载自blog.csdn.net/qq_27171347/article/details/81475377