pandas 数据聚合

1. apply

  • Series
  • Series.apply(funcconvert_dtype=Trueargs=()**kwds)
  • func: 要进行数据聚合的函数,自动对Series内的每个数据调用func       
>>> import pandas as pd
>>> import numpy as np

>>> series = pd.Series([20, 21, 12], index=['London',
... 'New York','Helsinki'])
>>> series
London      20
New York    21
Helsinki    12
dtype: int64

>>> def square(x):
...     return x**2
>>> series.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64
  • DataFrame
  • DataFrame.apply(funcaxis=0broadcast=Noneraw=Falsereduce=Noneresult_type=Noneargs=()**kwds)
  • func: 同上
  • axis: 沿某个轴进行聚合

         每次调用func的都是一个Series,

         axis = 0: apply函数会自动遍历每一列DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回

         axis = 1: apply函数会自动遍历每一行DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回

>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

>>> df.apply(np.sum, axis=0)
A    12
B    27
dtype: int64

>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

2. iloc, loc

  • iloc: 用整数进行索引
  • loc: 用字符串进行索引 
  • Series
>>> s2 = pd.Series(['a', 'b', 'c'], index=['one', 'two', 'three'])
>>> s2
one      a
two      b
three    c
dtype: object
# loc是用字符串进行索引
>>> s2.loc['one']
'a'
# iloc是用整数进行索引
>>> s2.iloc[0]
'a'
# []用整数进行索引
>>> s2[0]
'a'
  • DataFrame
>>> s3 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=['one', 'two', 'three'])
>>> s3
       a  b
one    1  4
two    2  5
three  3  6
# loc与iloc都用来索引rows
>>> s3.iloc[0]
a    1
b    4
Name: one, dtype: int64
>>> s3.loc['one']
a    1
b    4
Name: one, dtype: int64
# []用来索引cols
>>> s3['a']
one      1
two      2
three    3
Name: a, dtype: int64

3. merge

  • DataFrame.merge(righthow='inner'on=Noneleft_on=Noneright_on=Noneleft_index=Falseright_index=Falsesort=Falsesuffixes=('_x''_y')copy=Trueindicator=Falsevalidate=None)
  • how: 两表连接的方式('inner': 内连接, 'outer': 外连接, 'left': 左外连接, 'right': 右外连接)
  • on: 要连接的键(会自动寻找列名相同的列)
  • left_on: 左表要连接的列名
  • right_on: 右表要连接的列名
  • left_index: 是否将左表的索引作为连接的键
  • right_index: 是否将右表的索引作为连接的键
>>> A              >>> B
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8

>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7

4. sort_values, sort_index

  • DataFrame.sort_values(byaxis=0ascending=Trueinplace=Falsekind='quicksort'na_position='last')
  • axis: 要排序的轴, axis=0则根据rows排序, axis=1则根据cols排序
  • by: 要排序的键, axis=0则是columns, axis=1则是index
  • ascending: True为升序, False为降序
  • inplace: 在原表改变
>>> p1 = pd.DataFrame({'col1': [1, 2, 3], 
... 'col2': [4, 5, 6],
... 'col3': [7, 8, 9]})
>>> p1
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
>>> p1.sort_values(by=0, ascending=False, axis=1)
   col3  col2  col1
0     7     4     1
1     8     5     2
2     9     6     3
>>> p1.sort_values(by='col1', ascending=False, axis=0)
   col1  col2  col3
2     3     6     9
1     2     5     8
0     1     4     7
  • DataFrame.sort_index(axis=0level=Noneascending=Trueinplace=Falsekind='quicksort'na_position='last'sort_remaining=Trueby=None)
  • axis: 根据轴排序索引, axis=0则排序row index, axis=1则排序column index
>>> p1
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
>>> p1.sort_index(axis=0, ascending=False)
   col1  col2  col3
2     3     6     9
1     2     5     8
0     1     4     7
>>> p1.sort_index(axis=1, ascending=False)
   col3  col2  col1
0     7     4     1
1     8     5     2
2     9     6     3

5. idxmax

  • Series
  • Series.idxmax(axis=0skipna=True*args**kwargs)
  • 返回最大值的索引
>>> s = pd.Series(data=[1, None, 4, 3, 4],
...               index=['A', 'B', 'C', 'D', 'E'])
>>> s
A    1.0
B    NaN
C    4.0
D    3.0
E    4.0
dtype: float64
>>> s.idxmax()
'C'
  • DataFrame
  • DataFrame.idxmax(axis=0skipna=True)
  • 返回最大索引组成的Series
>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.idxmax()
col1    2
col2    0
col3    1
dtype: int64
>>> p2.idxmax(axis=1)
0    col2
1    col3
2    col1
dtype: object

6. 新构建的特征需要转换成numpy类型

result['citable doc per person'] = (result['Citable documents'] / result['Population']).astype(np.float64)

7.corr

  • Series
  • Series.corr(othermethod='pearson'min_periods=None)
  • 求出两个Series的相关系数
>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.col2.corr(p2.col3)
-0.23281119015753007
  • DataFrame
  • DataFrame.corr(method='pearson'min_periods=1)
  • 求出DataFrame各列的相关系数
>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.corr()
          col1      col2      col3
col1  1.000000 -0.933257 -0.132068
col2 -0.933257  1.000000 -0.232811
col3 -0.132068 -0.232811  1.000000

8. map

  • Series
  • 接收函数作为或字典对象作为参数,返回经过函数或字典映射处理后的值

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> mapp1 = {1: 10, 2: 12}
>>> p2.col1.map(mapp1)
0    10.0
1    12.0
2     NaN
Name: col1, dtype: float64

>>> p2
   col1  col2  col3
0     1     6     2
1     2     1     8
2     3     0     1
>>> p2.col2.map(lambda x: x+1)
0    7
1    2
2    1
Name: col2, dtype: int64

9. agg

  • DataFrameGroupBy.agg(arg*args**kwargs)
  • func: string function name / function / list of functions / dict of column names -> functions (or list of functions)
>>> df
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860
>>> df.groupby('A').agg(['min', 'max'])
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

10. cut, qcut

  • cut
  • pandas.cut(xbinsright=Truelabels=Noneretbins=Falseprecision=3include_lowest=Falseduplicates='raise')
  • 将连续值均分区间转换为离散值
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]]
>>> 
  • qcut
  • pandas.qcut(xqlabels=Noneretbins=Falseprecision=3duplicates='raise')
  • 按照频率将值转换为离散值
>>> pd.qcut(range(5), 4)
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]

猜你喜欢

转载自blog.csdn.net/Ahead_J/article/details/82911664
今日推荐