pandas学习2

pandas学习

Working on Text Data


  1. Introduction
  2. Concat, Split & Join
  3. Contains, Find
  4. Cleaning Punchuations
  5. Checking contents of String Data
  6. String Manipulation
  7. 导言
  8. 同轴、分轴和加入
  9. 包含、查找
  10. 清洁冲头
  11. 检查字符串数据的内容
  12. 字符串操作

Introduction

  • Columns many times contain string data.
  • And, string data may always not be in best of their hygine.
  • Pandas provide rich api’s to make task easier using .str functions.
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d']})
df
A
0 a
1 b
2 c
3 d

Concat, Split & Join

df.A.str.cat(sep=' ')
'a b c d'
df.A.str.cat(['1','2','3','4'])
0    a1
1    b2
2    c3
3    d4
Name: A, dtype: object
horror_data = pd.read_csv('../Data/horror-train.csv')
horror_data
id text author
0 id26305 This process, however, afforded me no means of... EAP
1 id17569 It never once occurred to me that the fumbling... HPL
2 id11008 In his left hand was a gold snuff box, from wh... EAP
3 id27763 How lovely is spring As we looked from Windsor... MWS
4 id12958 Finding nothing else, not even gold, the Super... HPL
... ... ... ...
19574 id17718 I could have fancied, while I looked at it, th... EAP
19575 id08973 The lids clenched themselves together as if in... EAP
19576 id05267 Mais il faut agir that is to say, a Frenchman ... EAP
19577 id17513 For an item of news like this, it strikes us i... EAP
19578 id00393 He laid a gnarled claw on my shoulder, and it ... HPL

19579 rows × 3 columns

horror_data = horror_data.iloc[:10]
horror_data
id text author
0 id26305 This process, however, afforded me no means of... EAP
1 id17569 It never once occurred to me that the fumbling... HPL
2 id11008 In his left hand was a gold snuff box, from wh... EAP
3 id27763 How lovely is spring As we looked from Windsor... MWS
4 id12958 Finding nothing else, not even gold, the Super... HPL
5 id22965 A youth passed in solitude, my best years spen... MWS
6 id09674 The astronomer, perhaps, at this point, took r... EAP
7 id13515 The surcingle hung in ribands from my body. EAP
8 id19322 I knew that you could not say to yourself 'ste... EAP
9 id00912 I confess that neither the structure of langua... MWS
horror_data.text.str.split()
0    [This, process,, however,, afforded, me, no, m...
1    [It, never, once, occurred, to, me, that, the,...
2    [In, his, left, hand, was, a, gold, snuff, box...
3    [How, lovely, is, spring, As, we, looked, from...
4    [Finding, nothing, else,, not, even, gold,, th...
5    [A, youth, passed, in, solitude,, my, best, ye...
6    [The, astronomer,, perhaps,, at, this, point,,...
7    [The, surcingle, hung, in, ribands, from, my, ...
8    [I, knew, that, you, could, not, say, to, your...
9    [I, confess, that, neither, the, structure, of...
Name: text, dtype: object
horror_data.text.str.split(expand=True,n=20)
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 This process, however, afforded me no means of ascertaining the ... of my dungeon; as I might make its circuit, and return to the point whence I set out, with...
1 It never once occurred to me that the fumbling might ... a mere mistake. None None None None None None None
2 In his left hand was a gold snuff box, from ... as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly ...
3 How lovely is spring As we looked from Windsor Terrace ... the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in...
4 Finding nothing else, not even gold, the Superintendent abandoned his ... but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.
5 A youth passed in solitude, my best years spent under ... gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an inte...
6 The astronomer, perhaps, at this point, took refuge in the ... of non luminosity; and here analogy was suddenly let fall.
7 The surcingle hung in ribands from my body. None None ... None None None None None None None None None None
8 I knew that you could not say to yourself 'stereotomy' ... being brought to think of atomies, and thus of the theories of Epicurus; and since, when we d...
9 I confess that neither the structure of languages, nor the ... of governments, nor the politics of various states possessed attractions for me.

10 rows × 21 columns

res = horror_data.text.str.split()
res.str.join(sep=' ')
0    This process, however, afforded me no means of...
1    It never once occurred to me that the fumbling...
2    In his left hand was a gold snuff box, from wh...
3    How lovely is spring As we looked from Windsor...
4    Finding nothing else, not even gold, the Super...
5    A youth passed in solitude, my best years spen...
6    The astronomer, perhaps, at this point, took r...
7          The surcingle hung in ribands from my body.
8    I knew that you could not say to yourself 'ste...
9    I confess that neither the structure of langua...
Name: text, dtype: object

Understanding contains, find & index

horror_data.text.str.contains('This')
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: text, dtype: bool
horror_data.text.str.contains('his')
0     True
1    False
2     True
3    False
4     True
5     True
6     True
7    False
8     True
9    False
Name: text, dtype: bool
horror_data.text.str.find('This')
0    0
1   -1
2   -1
3   -1
4   -1
5   -1
6   -1
7   -1
8   -1
9   -1
Name: text, dtype: int64
horror_data[:5].text.str.index('is')
0     2
1    64
2     4
3    11
4    67
Name: text, dtype: int64
df = pd.DataFrame({'Name':['Abc','Def','Jkl'], 'Email':['awidgmail.com', ' [email protected]', '[email protected]']})
df
Name Email
0 Abc awidgmail.com
1 Def [email protected]
2 Jkl [email protected]
df.Email.str.contains(r'\w+@\w+')
0    False
1     True
2     True
Name: Email, dtype: bool
df.Email.str.replace(pat='@',repl='&')
0     awidgmail.com
1     def&gmail.com
2     jkl&yahoo.com
Name: Email, dtype: object

Cleaning Punchuations

import string
table = str.maketrans('', '', string.punctuation)
horror_data.text.str.translate(table)
0    This process however afforded me no means of a...
1    It never once occurred to me that the fumbling...
2    In his left hand was a gold snuff box from whi...
3    How lovely is spring As we looked from Windsor...
4    Finding nothing else not even gold the Superin...
5    A youth passed in solitude my best years spent...
6    The astronomer perhaps at this point took refu...
7           The surcingle hung in ribands from my body
8    I knew that you could not say to yourself ster...
9    I confess that neither the structure of langua...
Name: text, dtype: object
horror_data_tf = horror_data.text.str.translate(table)
horror_data_tf = horror_data_tf.str.lower()

Check for string contents|检查字符串内容

  • isalnum() Equivalent to str.isalnum
  • isalpha() Equivalent to str.isalpha
  • isdigit() Equivalent to str.isdigit
  • isspace() Equivalent to str.isspace
  • islower() Equivalent to str.islower
  • isupper() Equivalent to str.isupper
  • istitle() Equivalent to str.istitle
  • isnumeric() Equivalent to str.isnumeric
  • isdecimal() Equivalent to str.isdecimal
  • isalnum() 等同于str.isalnum()
  • isalpha() 等效于str.isalpha()
  • isdigit() 等同于str.isdigit()
  • isspace() 等效于str.isspace()
  • islower() 等同于str.islower()
  • isupper() 等同于str.isupper()
  • istitle() 等同于str.istitle()
  • isnumeric() 等同于str.isumeric()
  • isdecimal() 等同于str.isdecimal()
df = pd.DataFrame({'A':['1234','123ab','abcde']})
df
A
0 1234
1 123ab
2 abcde
  • Returns only rows which contain digit
df[df.A.str.isdigit()]
A
0 1234
  • Returns only rows which is alphabets
df[df.A.str.isalpha()]
A
2 abcde

String Manipulation|字符串操作

  • slice() Slice each string in the Series
  • slice_replace() Replace slice in each string with passed value
  • count() Count occurrences of pattern
  • startswith() Equivalent to str.startswith(pat) for each element
  • endswith() Equivalent to str.endswith(pat) for each element
  • findall() Compute list of all occurrences of pattern/regex for each string
  • match() Call re.match on each element, returning matched groups as list
  • extract() Call re.search on each element, returning DataFrame with one row for each element and one column for each regex
  • extractall()
  • slice() 将系列中的每一个字符串都切成片。
  • slice_replace() 用传递的值替换每个字符串中的slice()
  • count() 计数模式的出现次数
  • startswith() 等同于str.startwith(pat)对每个元素的作用。
  • endswith() 等同于str.endwith(pat)对每个元素的作用。
  • findall() 计算每个字符串的所有pattern/regex的出现次数列表
  • match() 在每个元素上调用re.match(),返回匹配的组作为列表。
  • extract() 在每个元素上调用re.search(),返回的DataFrame中,每个元素有一行,每个regex有一列。
  • extractall()
df = pd.DataFrame({'Name':['Rush','Riba','Kunal','Pruthvi'],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]']})
df
Name Email
0 Rush [email protected]
1 Riba [email protected]
2 Kunal [email protected]
3 Pruthvi [email protected]
df['Username'] = df.Email.str.slice(start = 0, step=2, stop=-11)
df
Name Email Username
0 Rush [email protected] rs
1 Riba [email protected] rb
2 Kunal [email protected] knl
3 Pruthvi [email protected] puhi
df['UpdatedEmail'] = df.Email.str.slice_replace(start=-11, repl='@zekelabs.com')
df
Name Email Username UpdatedEmail
0 Rush [email protected] rs [email protected]
1 Riba [email protected] rb [email protected]
2 Kunal [email protected] knl [email protected]
3 Pruthvi [email protected] puhi [email protected]
  • Altering value
df.at[2,'Email'] = '[email protected]'
df
Name Email Username UpdatedEmail
0 Rush [email protected] rs [email protected]
1 Riba [email protected] rb [email protected]
2 Kunal [email protected] knl [email protected]
3 Pruthvi [email protected] puhi [email protected]
help(pd.DataFrame.at)
Help on property:

    Access a single value for a row/column label pair.
    
    Similar to ``loc``, in that both provide label-based lookups. Use
    ``at`` if you only need to get or set a single value in a DataFrame
    or Series.
    
    Raises
    ------
    KeyError
        When label does not exist in DataFrame
    
    See Also
    --------
    DataFrame.iat : Access a single value for a row/column pair by integer
        position.
    DataFrame.loc : Access a group of rows and columns by label(s).
    Series.at : Access a single value using a label.
    
    Examples
    --------
    >>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
    ...                   index=[4, 5, 6], columns=['A', 'B', 'C'])
    >>> df
        A   B   C
    4   0   2   3
    5   0   4   1
    6  10  20  30
    
    Get value at specified row/column pair
    
    >>> df.at[4, 'B']
    2
    
    Set value at specified row/column pair
    
    >>> df.at[4, 'B'] = 10
    >>> df.at[4, 'B']
    10
    
    Get value within a Series
    
    >>> df.loc[5].at['B']
    4
  • Filtering based domain name
  • 基于过滤的域名
df.Email.str.match('[\w][email protected]')
0     True
1     True
2    False
3     True
Name: Email, dtype: bool
  • Extract text based on certain pattern
df.Email.str.extract('([\w]*)@([\w.]*)')
0 1
0 rush edyoda.com
1 riba edyoda.com
2 kunal everywhere.com
3 pruthvi edyoda.com

Working on Missing Data


  1. Detect missing & existing values.
  2. Return a new Series with missing values removed.
  3. Fill NA/NaN values using the specified method.
  4. Interpolate values according to different methods.
  5. 检测缺失值和现有值。
  6. 返回一个新的系列,并删除缺失值。
  7. 使用指定的方法填充NA/NaN值。
  8. 根据不同的方法进行插值。

1. Detect existing non-missing values

  1. Considered as missing values - None or numpy.NaN
  2. Empty string is still considered non null values
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1,2,None], 'B':[2, np.NaN, 3]})
df
A B
0 1.0 2.0
1 2.0 NaN
2 NaN 3.0
df.isna()
A B
0 False False
1 False True
2 True False
df.isna().sum()
A    1
B    1
dtype: int64
df.isna().any()
A    True
B    True
dtype: bool
df.isna().sum().sum()
2
  • isnull is implementation of isna
df.isnull()
A B
0 False False
1 False True
2 True False
df = pd.DataFrame({'A':[1,'',None], 'B':[2, np.NaN, 3]})
df
A B
0 1 2.0
1 NaN
2 None 3.0
df.isna()
A B
0 False False
1 False True
2 True False
df.isnull()#和上面结果一样
A B
0 False False
1 False True
2 True False
  • Handling Empty Strings
df.replace('',np.NaN)#这个是会经常用到
A B
0 1.0 2.0
1 NaN NaN
2 NaN 3.0
  • Finding non-null values
df.notna()
A B
0 True True
1 True False
2 False True
  • Filtering data based on series
df[df.A.notna()]
A B
0 1 2.0
1 NaN
df.A.notna()
0     True
1     True
2    False
Name: A, dtype: bool

Return a new Series with missing values removed.

  • Dropping rows which have any missing values
df = pd.DataFrame({'A':[1,'',None], 'B':[2, np.NaN, 3], 'C':[3,4,5]})
df
A B C
0 1 2.0 3
1 NaN 4
2 None 3.0 5
df.dropna()
A B C
0 1 2.0 3
  • Dropping columns which have null values
df.dropna(axis=1)
C
0 3
1 4
2 5

Filling missing values

df.fillna(0)
A B C
0 1 2.0 3
1 0.0 4
2 0 3.0 5
df.fillna({'A':10,'B':11})
#df.fillna({'A':df['A'].mean(),'B':df['B'].mode()})
A B C
0 1 2.0 3
1 11.0 4
2 10 3.0 5
  • Values can be backward fill, forward fill
df.fillna(method='bfill')
A B C
0 1 2.0 3
1 3.0 4
2 None 3.0 5

Intrapolate missing values based on different methods| 基于不同方法的内推缺失值

df = pd.DataFrame({'Name':['Rush','Riba','Kunal','Pruthvi'],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                   'Age':[33,31,None,18]})
df
Name Email Age
0 Rush [email protected] 33.0
1 Riba [email protected] 31.0
2 Kunal [email protected] NaN
3 Pruthvi [email protected] 18.0

Styling Pandas Table


  1. Applymap for applying entire table
  2. Apply for applying column wise
  3. Highlighting Null
  4. Applying colors to subset of data

  1. 适用于应用整个表格的应用图
  2. 按栏目申请申请
  3. 突出显示Null
  4. 对数据的子集应用颜色

import pandas as pd
import numpy as np
df = pd.read_csv('SalesTrainingDataset.csv')
df = df.iloc[:100,list(range(10))]
df

Applymap

  • Apply styling on complete data
  • function returns a css parameter
def color_product_sales(val):
    if val > 10000:
        color = 'green'
    elif val < 2000:
        color = 'red'
    else:
        color = 'black'
    return 'color: %s' % color

df.style.applymap(color_product_sales)

Apply

  1. In case, we want to do series wise check apply can be used.
  2. Function argument will be series.
  3. By default, columns.
  4. If assigned axis = 1, rows
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

df.style.apply(highlight_max)
  • Chaining is also supported
df.style.applymap(color_product_sales).apply(highlight_max)

Highlighting Null Values

df.style.highlight_null(null_color='red')

Dealing with subset of Data

  • Selecting subset of columns
df.style.apply(highlight_max, subset=['Outcome_M8','Outcome_M9'])
  • Selecting subset of rows
df.style.apply(highlight_max, subset=pd.IndexSlice[2:14,:], axis=1)
df.style.apply(highlight_max, subset=pd.IndexSlice[2:14, ['Outcome_M6','Outcome_M7','Outcome_M8','Outcome_M9']], axis=1)

Pandas for Computation


  1. Percent change
  2. Covariance
  3. Correlation
  4. Data Ranking
  5. Window Functions
  6. Time aware rolling
  7. Window Function
  8. Rolling vs Expanding

  1. 变化百分比
  2. 协方差
  3. 相关性
  4. 数据排名
  5. 窗口功能
  6. 时间意识的滚动
  7. 窗口功能
  8. 滚动与扩张

Statistical Functions|##统计功能

  1. Percent Change - Series and DataFrame have a method pct_change() to compute the percent change over a given number of periods
  2. 百分比变化 - 系列和DataFrame有一个方法pct_change()来计算在给定的周期数上的百分比变化。
import pandas as pd
import numpy as np

sales_data = pd.DataFrame(data=np.random.randint(1,100,(10,4)), 
                          columns=['Tea','Milk','Carpet','Cream'], 
                          index=pd.Series(pd.period_range('1/1/2011', freq='M', periods=10)))
sales_data
Tea Milk Carpet Cream
2011-01 58 81 8 49
2011-02 63 29 39 70
2011-03 17 16 17 81
2011-04 26 9 23 53
2011-05 95 97 82 45
2011-06 52 12 72 99
2011-07 66 93 84 15
2011-08 92 76 75 96
2011-09 29 19 46 35
2011-10 83 94 2 4
  • Changes in monthly sales data
sales_data.pct_change(periods=1).round(4)*100
Tea Milk Carpet Cream
2011-01 NaN NaN NaN NaN
2011-02 8.62 -64.20 387.50 42.86
2011-03 -73.02 -44.83 -56.41 15.71
2011-04 52.94 -43.75 35.29 -34.57
2011-05 265.38 977.78 256.52 -15.09
2011-06 -45.26 -87.63 -12.20 120.00
2011-07 26.92 675.00 16.67 -84.85
2011-08 39.39 -18.28 -10.71 540.00
2011-09 -68.48 -75.00 -38.67 -63.54
2011-10 186.21 394.74 -95.65 -88.57

Covariance & Correlation

Calculate covariance between series. Covariance is a measure of how much two random variables vary together

A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of b
etween -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation

共方差和相关性

计算序列之间的协方差。协方差是衡量两个随机变量在一起变化的程度。
相关系数是给关系取值的一种方法。相关系数的值为b。
在-1和1之间,"0 "表示变量之间完全不存在任何关系,而-1或1表示完全负相关或正相关。

df = pd.DataFrame(np.random.randint(10,20,(10,2)), columns=['A','B'])
df
A B
0 13 13
1 19 18
2 17 10
3 16 15
4 12 10
5 13 10
6 18 17
7 12 11
8 14 11
9 15 17

df.cov()

df.corr()
A B
A 1.0000 0.6942
B 0.6942 1.0000
#help(df.corr)
'''
关于模块pandas.core.frame.com中的方法corr的帮助。

corr(method='pearson', min_periods=1) pandas.core.frame.DataFrame实例的方法
    计算列的成对相关,不包括NA/null值。
    
    参数
    
    方法:{'pearson', 'kendall', 'spearman'} 或可调用的方法
        * Pearson:标准相关系数
        * Kendall:Kendall Tau相关系数。
        * Spearman:Spearman等级相关性
        * 可调用:输入两个1d ndarrays,可调用。
            并返回一个float。注意,从corr
            将有1沿对角线的对角线,并且将是对称的
            无论可调用者的行为如何
            ...版本添加:: 0.24.0
    
    min_periods : int, 可选
        每一对列所需的最低观测次数
        以获得有效的结果。目前仅适用于Pearson
        和Spearman相关性。
'''
"\n关于模块pandas.core.frame.com中的方法corr的帮助。\n\ncorr(method='pearson', min_periods=1) pandas.core.frame.DataFrame实例的方法\n    计算列的成对相关,不包括NA/null值。\n    \n    参数\n    \n    方法:{'pearson', 'kendall', 'spearman'} 或可调用的方法\n        * Pearson:标准相关系数\n        * Kendall:Kendall Tau相关系数。\n        * Spearman:Spearman等级相关性\n        * 可调用:输入两个1d ndarrays,可调用。\n            并返回一个float。注意,从corr\n            将有1沿对角线的对角线,并且将是对称的\n            无论可调用者的行为如何\n            ...版本添加:: 0.24.0\n    \n    min_periods : int, 可选\n        每一对列所需的最低观测次数\n        以获得有效的结果。目前仅适用于Pearson\n        和Spearman相关性。\n"
  • Rank method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:
  • 排名法产生一个数据排名,并将各组的排名平均值(默认情况下)分配给各组的排名。
df
df['A_Rank'] = df.A.rank()
df
A B A_Rank
0 13 13 3.5
1 19 18 10.0
2 17 10 8.0
3 16 15 7.0
4 12 10 1.5
5 13 10 3.5
6 18 17 9.0
7 12 11 1.5
8 14 11 5.0
9 15 17 6.0
#help(df.A.rank)
'''
关于模块pandas.core.generic.module pandas.core.generic.module rank中的方法 rank的帮助。

rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) pandas.core.core.series实例的方法
    计算沿轴的数值数据等级(1到n)。
    
    默认情况下,相同的值被分配了一个等级,该等级为
    这些值的等级。
    
    参数
    
    axis : {0或'index', 1或'columns'}, 默认为0
        直接排名的索引。
    method : {'平均', 'min', 'max', 'first', 'dense'}, 默认为'平均'
        如何对具有相同值的记录组进行排序?
        (即平局)。
    
        *平均:小组的平均排名
        * 最低:组内最低等级
        * 最高:组内最高等级
        * 第一:按数组中的顺序分配等级。
        * 密集:像 "min "一样,但组与组之间的等级总是增加1。
    numeric_only : bool, 可选的
        对于DataFrame对象,如果设置为True,只对数字列进行排序。
    na_option : {'keep', 'top', 'bottom'}, 默认为'keep'
        如何对NaN值进行排序。
    
        * 保留:为NaN值分配NaN等级。
        *顶部:如果从高到低,给NaN值分配最小的等级。
        * 底部:如果从高到低,给NaN值分配最高等级。
    升序:bool,默认为True
        元素是否应该按升序排列。
    pct : bool, 默认为 False
        是否以百分位数显示返回的排名。
        形式。

通过www.DeepL.com/Translator(免费版)翻译
'''
'\n关于模块pandas.core.generic.module pandas.core.generic.module rank中的方法 rank的帮助。\n\nrank(axis=0, method=\'average\', numeric_only=None, na_option=\'keep\', ascending=True, pct=False) pandas.core.core.series实例的方法\n    计算沿轴的数值数据等级(1到n)。\n    \n    默认情况下,相同的值被分配了一个等级,该等级为\n    这些值的等级。\n    \n    参数\n    \n    axis : {0或\'index\', 1或\'columns\'}, 默认为0\n        直接排名的索引。\n    method : {\'平均\', \'min\', \'max\', \'first\', \'dense\'}, 默认为\'平均\'\n        如何对具有相同值的记录组进行排序?\n        (即平局)。\n    \n        *平均:小组的平均排名\n        * 最低:组内最低等级\n        * 最高:组内最高等级\n        * 第一:按数组中的顺序分配等级。\n        * 密集:像 "min "一样,但组与组之间的等级总是增加1。\n    numeric_only : bool, 可选的\n        对于DataFrame对象,如果设置为True,只对数字列进行排序。\n    na_option : {\'keep\', \'top\', \'bottom\'}, 默认为\'keep\'\n        如何对NaN值进行排序。\n    \n        * 保留:为NaN值分配NaN等级。\n        *顶部:如果从高到低,给NaN值分配最小的等级。\n        * 底部:如果从高到低,给NaN值分配最高等级。\n    升序:bool,默认为True\n        元素是否应该按升序排列。\n    pct : bool, 默认为 False\n        是否以百分位数显示返回的排名。\n        形式。\n\n通过www.DeepL.com/Translator(免费版)翻译\n'

Window Functions|窗口功能

  1. For working with data, a number of window functions are provided for computing common window or rolling statistics.
  2. Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis.
  3. 在处理数据时,提供了一些窗口函数,用于计算常见的窗口或滚动统计。
  4. 其中有计数、求和、均值、均值、中位数、相关、方差、协方差、标准差、偏斜度和峰度。
sales_data = pd.read_csv('../Data/sales-data.csv')
sales_data
Month Sales
0 1-01 266.0
1 1-02 145.9
2 1-03 183.1
3 1-04 119.3
4 1-05 180.3
5 1-06 168.5
6 1-07 231.8
7 1-08 224.5
8 1-09 192.8
9 1-10 122.9
10 1-11 336.5
11 1-12 185.9
12 2-01 194.3
13 2-02 149.5
14 2-03 210.1
15 2-04 273.3
16 2-05 191.4
17 2-06 287.0
18 2-07 226.0
19 2-08 303.6
20 2-09 289.9
21 2-10 421.6
22 2-11 264.5
23 2-12 342.3
24 3-01 339.7
25 3-02 440.4
26 3-03 315.9
27 3-04 439.3
28 3-05 401.3
29 3-06 437.4
30 3-07 575.5
31 3-08 407.6
32 3-09 682.0
33 3-10 475.3
34 3-11 581.3
35 3-12 646.9
r = sales_data.Sales.rolling(window=5)
r.count()
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     5.0
6     5.0
7     5.0
8     5.0
9     5.0
10    5.0
11    5.0
12    5.0
13    5.0
14    5.0
15    5.0
16    5.0
17    5.0
18    5.0
19    5.0
20    5.0
21    5.0
22    5.0
23    5.0
24    5.0
25    5.0
26    5.0
27    5.0
28    5.0
29    5.0
30    5.0
31    5.0
32    5.0
33    5.0
34    5.0
35    5.0
Name: Sales, dtype: float64
r.max()
0       NaN
1       NaN
2       NaN
3       NaN
4     266.0
5     183.1
6     231.8
7     231.8
8     231.8
9     231.8
10    336.5
11    336.5
12    336.5
13    336.5
14    336.5
15    273.3
16    273.3
17    287.0
18    287.0
19    303.6
20    303.6
21    421.6
22    421.6
23    421.6
24    421.6
25    440.4
26    440.4
27    440.4
28    440.4
29    440.4
30    575.5
31    575.5
32    682.0
33    682.0
34    682.0
35    682.0
Name: Sales, dtype: float64

Time aware rolling

dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
                    index=pd.date_range('20130101 09:00:00',
                                        periods=5,
                                        freq='s'))
dft
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 2.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 4.0
dft.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
r.agg(np.sum)
0        NaN
1        NaN
2        NaN
3        NaN
4      894.6
5      797.1
6      883.0
7      924.4
8      997.9
9      940.5
10    1108.5
11    1062.6
12    1032.4
13     989.1
14    1076.3
15    1013.1
16    1018.6
17    1111.3
18    1187.8
19    1281.3
20    1297.9
21    1528.1
22    1505.6
23    1621.9
24    1658.0
25    1808.5
26    1702.8
27    1877.6
28    1936.6
29    2034.3
30    2169.4
31    2261.1
32    2503.8
33    2577.8
34    2721.7
35    2793.1
Name: Sales, dtype: float64
r.agg([np.sum, np.mean])
sum mean
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 894.6 178.92
5 797.1 159.42
6 883.0 176.60
7 924.4 184.88
8 997.9 199.58
9 940.5 188.10
10 1108.5 221.70
11 1062.6 212.52
12 1032.4 206.48
13 989.1 197.82
14 1076.3 215.26
15 1013.1 202.62
16 1018.6 203.72
17 1111.3 222.26
18 1187.8 237.56
19 1281.3 256.26
20 1297.9 259.58
21 1528.1 305.62
22 1505.6 301.12
23 1621.9 324.38
24 1658.0 331.60
25 1808.5 361.70
26 1702.8 340.56
27 1877.6 375.52
28 1936.6 387.32
29 2034.3 406.86
30 2169.4 433.88
31 2261.1 452.22
32 2503.8 500.76
33 2577.8 515.56
34 2721.7 544.34
35 2793.1 558.62

Rolling vs Expanding

data = pd.DataFrame([
    ['a', 1],
    ['a', 2],
    ['a', 3],
    ['b', 5],
    ['b', 6],
    ['b', 7],
    ['b', 8],
    ['c', 10],
    ['c', 11],
    ['c', 12],
    ['c', 13]
], columns = ['category', 'value'])
data
category value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 b 7
6 b 8
7 c 10
8 c 11
9 c 12
10 c 13
data.value.expanding(1).sum()
0      1.0
1      3.0
2      6.0
3     11.0
4     17.0
5     24.0
6     32.0
7     42.0
8     53.0
9     65.0
10    78.0
Name: value, dtype: float64
data.value.rolling(2).sum()
0      NaN
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0
Name: value, dtype: float64
  1. Expanding - If we use the expanding window with initial size 1, it will create a window that in the first step contains only the first row. In the second step, it contains both the first and the second row. In every step, one additional row is added to the window, and the aggregating function is being recalculated.

  2. Rolling - Rolling windows are totally different. In this case, we specify the size of the window which is moving. What happens when we set the rolling window size to 2?

    • In the first step, it is going to contain the first row and one undefined row, so I am going to get NaN as a result.

    • In the second step, the window moves and now contains the first and the second row. Now it is possible to calculate the aggregate function. In the case of this example, the sum of both rows.

    • In the third step, the window moves again and no longer contains the first row. Instead of that now it calculates the sum of the second and the third row.

  3. 展开 - 如果我们使用初始大小为1的扩展窗口,它将创建一个窗口,在第一步中只包含第一行。在第二步中,它同时包含第一行和第二行。在每一步中,窗口中都会增加一条额外的行,并对聚合函数进行重新计算。

  4. 滚动–滚动窗口是完全不同的。在这种情况下,我们指定的是移动窗口的大小。当我们将滚动窗口的大小设置为2时,会发生什么情况呢?

    • 在第一步中,它将包含第一行和一个未定义的行,所以我将得到NaN作为结果。

    • 在第二步中,窗口移动了,现在包含了第一行和第二行。现在就可以计算出聚合函数了。在本例中,是两行的总和。

    • 在第三步中,窗口再次移动,不再包含第一行。相反,现在它计算的是第二行和第三行之和。

Data Transformation using Map, Apply & GroupBy


  1. Transforming Series using Map
  2. Transforming across multiple Series using apply
  3. GroupBy - Splitting, Applying & Combine
  4. 使用map变换序列
  5. 2.使用apply进行跨多个系列的转换
  6. GroupBy - 分割、应用和合并

import pandas as pd
import numpy as np
hr_data = pd.read_csv('../Data/HR_comma_sep.csv.txt')
hr_data.rename(columns={'sales':'department'}, inplace=True)

Transforming Series using Map

hr_data.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
  • map for transforming left column with some categorical information
hr_data['left_categorical'] = hr_data.left.map({1:'True',0:'False'})
hr_data.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary left_categorical
0 0.38 0.53 2 157 3 0 1 0 sales low True
1 0.80 0.86 5 262 6 0 1 0 sales medium True
2 0.11 0.88 7 272 4 0 1 0 sales medium True
3 0.72 0.87 5 223 5 0 1 0 sales low True
4 0.37 0.52 2 159 3 0 1 0 sales low True

Transforming data across multiple Series

  1. If satisfaction_level > .9, increase number_project by 1
  2. Multiple columns can’t be dealt with map, we need apply for that
def increase_proj(r):
    if r.satisfaction_level > .9:
        return r.number_project + 1
    else:
        return r.number_project

hr_data['new_number_project'] = hr_data.apply(increase_proj, axis=1)
  • Filtering all the folks for which this happened
hr_data[hr_data.number_project != hr_data.new_number_project].head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary left_categorical new_number_project
7 0.92 0.85 5 259 5 0 1 0 sales low True 6
106 0.91 1.00 4 257 5 0 1 0 accounting medium True 5
191 0.92 0.87 4 226 6 1 1 0 technical medium True 5
231 0.92 0.99 5 255 6 0 1 0 sales low True 6
352 0.91 0.91 4 262 6 0 1 0 support low True 5

GroupBy

grouped = hr_data.groupby(['department'])
  • Compute first & last of group values
grouped.first()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary left_categorical new_number_project
department
IT 0.11 0.93 7 308 4 0 1 0 medium True 7
RandD 0.12 1.00 3 278 4 0 1 0 medium True 3
accounting 0.41 0.46 2 128 3 0 1 0 low True 2
hr 0.45 0.57 2 134 3 0 1 0 low True 2
management 0.85 0.91 5 226 5 0 1 0 medium True 5
marketing 0.40 0.54 2 137 3 0 1 0 medium True 2
product_mng 0.43 0.54 2 153 3 0 1 0 medium True 2
sales 0.38 0.53 2 157 3 0 1 0 low True 2
support 0.40 0.55 2 147 3 0 1 0 low True 2
technical 0.10 0.94 6 255 4 0 1 0 low True 6
grouped.last()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary left_categorical new_number_project
department
IT 0.90 0.92 4 271 5 0 1 0 medium True 4
RandD 0.81 0.92 5 239 5 0 1 0 medium True 5
accounting 0.36 0.54 2 153 3 0 1 0 medium True 2
hr 0.40 0.47 2 144 3 0 1 0 medium True 2
management 0.42 0.57 2 147 3 1 1 0 low True 2
marketing 0.44 0.52 2 149 3 0 1 0 low True 2
product_mng 0.46 0.55 2 147 3 0 1 0 medium True 2
sales 0.39 0.45 2 140 3 0 1 0 medium True 2
support 0.37 0.52 2 158 3 0 1 0 low True 2
technical 0.43 0.57 2 159 3 1 1 0 low True 2
grouped.nth(2)
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years salary left_categorical new_number_project
department
IT 0.36 0.56 2 132 3 0 1 0 medium True 2
RandD 0.37 0.55 2 127 3 0 1 0 medium True 2
accounting 0.09 0.62 6 294 4 0 1 0 low True 6
hr 0.45 0.55 2 140 3 0 1 0 low True 2
management 0.42 0.48 2 129 3 0 1 0 low True 2
marketing 0.11 0.77 6 291 4 0 1 0 low True 6
product_mng 0.76 0.86 5 223 5 1 1 0 medium True 5
sales 0.11 0.88 7 272 4 0 1 0 medium True 7
support 0.40 0.54 2 148 3 0 1 0 low True 2
technical 0.45 0.50 2 126 3 0 1 0 low True 2
grouped.groups
{'IT': Int64Index([   61,    62,    63,    64,    65,    70,   138,   139,   140,
               141,
             ...
             14808, 14809, 14810, 14815, 14929, 14930, 14931, 14932, 14933,
             14938],
            dtype='int64', length=1227),
 'RandD': Int64Index([  301,   302,   303,   304,   305,   453,   454,   455,   456,
               457,
             ...
             14816, 14817, 14818, 14819, 14820, 14939, 14940, 14941, 14942,
             14943],
            dtype='int64', length=787),
 'accounting': Int64Index([   28,    29,    30,    79,   105,   106,   107,   155,   181,
               182,
             ...
             14849, 14850, 14851, 14896, 14897, 14898, 14946, 14972, 14973,
             14974],
            dtype='int64', length=767),
 'hr': Int64Index([   31,    32,    33,    34,   108,   109,   110,   111,   184,
               185,
             ...
             14854, 14855, 14899, 14900, 14901, 14902, 14975, 14976, 14977,
             14978],
            dtype='int64', length=739),
 'management': Int64Index([   60,    82,   137,   158,   213,   235,   290,   311,   366,
               387,
             ...
             14598, 14653, 14674, 14729, 14750, 14805, 14826, 14873, 14928,
             14949],
            dtype='int64', length=630),
 'marketing': Int64Index([   77,    83,    84,    85,   148,   149,   150,   151,   152,
               153,
             ...
             14827, 14828, 14829, 14874, 14875, 14876, 14944, 14950, 14951,
             14952],
            dtype='int64', length=858),
 'product_mng': Int64Index([   66,    67,    68,    69,    71,    72,    73,    74,    75,
                76,
             ...
             14737, 14738, 14811, 14812, 14813, 14814, 14934, 14935, 14936,
             14937],
            dtype='int64', length=902),
 'sales': Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                 9,
             ...
             14962, 14963, 14964, 14965, 14966, 14967, 14968, 14969, 14970,
             14971],
            dtype='int64', length=4140),
 'support': Int64Index([   46,    47,    48,    49,    50,    51,    52,    53,    54,
                55,
             ...
             14947, 14990, 14991, 14992, 14993, 14994, 14995, 14996, 14997,
             14998],
            dtype='int64', length=2229),
 'technical': Int64Index([   35,    36,    37,    38,    39,    40,    41,    42,    43,
                44,
             ...
             14980, 14981, 14982, 14983, 14984, 14985, 14986, 14987, 14988,
             14989],
            dtype='int64', length=2720)}
hr_data.groupby(['department','salary']).groups
{('IT',
  'high'): Int64Index([ 1281,  1359,  1437,  1515,  3192,  3193,  3194,  3195,  3200,
              3270,  3504,  3799,  3802,  4260,  4264,  4269,  4720,  5097,
              5098,  5557,  5558,  5559,  5560,  5561,  5634,  5712,  5790,
              5865,  5943,  6024,  6093,  6547,  6550,  7009,  7011,  7012,
              7087,  7474,  7544,  7845,  7846,  7847,  7848,  7849,  7998,
              8076,  8154,  8229,  8307,  8308,  8309,  8314,  8385,  8772,
              8917,  8919,  9375,  9835, 10213, 10593, 10671, 10672, 10673,
             10678, 10747, 10749, 10980, 11268, 11601, 11700, 11707, 12804,
             12882, 12883, 12884, 12889, 12958, 12960, 13191, 13479, 13812,
             13911, 13918],
            dtype='int64'),
 ('IT',
  'low'): Int64Index([  138,   139,   140,   141,   142,   147,   214,   215,   216,
               217,
             ...
             14731, 14732, 14733, 14734, 14806, 14807, 14808, 14809, 14810,
             14815],
            dtype='int64', length=609),
 ('IT',
  'medium'): Int64Index([   61,    62,    63,    64,    65,    70,   294,   295,   300,
               376,
             ...
             14511, 14587, 14663, 14739, 14929, 14930, 14931, 14932, 14933,
             14938],
            dtype='int64', length=535),
 ('RandD',
  'high'): Int64Index([ 1827,  1905,  1983,  3201,  3202,  3203,  3657,  3658,  3659,
              3660,  3661,  3735,  3813,  4044,  4729,  5185,  6025,  6026,
              6027,  6028,  6177,  6255,  6330,  6408,  6558,  7018,  7476,
              7477,  7552,  8009,  8315,  8316,  8317,  8318,  8773,  8774,
              8775,  8776,  8777,  8850,  8928,  9382,  9384, 10300, 11659,
             11661, 11662, 13870, 13872, 13873, 14941],
            dtype='int64'),
 ('RandD',
  'low'): Int64Index([  605,   833,   834,   835,   836,   837,   985,   986,   987,
               988,
             ...
             13213, 13214, 13215, 13216, 13217, 13871, 13874, 13875, 14816,
             14942],
            dtype='int64', length=364),
 ('RandD',
  'medium'): Int64Index([  301,   302,   303,   304,   305,   453,   454,   455,   456,
               457,
             ...
             14666, 14667, 14668, 14817, 14818, 14819, 14820, 14939, 14940,
             14943],
            dtype='int64', length=372),
 ('accounting',
  'high'): Int64Index([  384,  1632,  1710,  3234,  3235,  3236,  3387,  3540,  3664,
              3768,  4124,  4228,  4684,  4686,  4734,  4762,  5190,  5524,
              5525,  5526,  5982,  5983,  5984,  6060,  6488,  6564,  6592,
              6795,  7510,  7888,  8349,  8350,  8351,  8502,  8580,  8655,
              8780,  8883,  9084,  9340,  9613,  9614,  9799,  9801,  9843,
              9844,  9877,  9920,  9921,  9996,  9997, 10072, 10073, 10334,
             10636, 10637, 10638, 10866, 10944, 11101, 11214, 11971, 11972,
             12384, 12847, 12848, 12849, 13077, 13155, 13312, 13425, 14182,
             14183, 14595],
            dtype='int64'),
 ('accounting',
  'low'): Int64Index([   28,    29,    30,    79,   155,   224,   225,   232,   410,
               486,
             ...
             14621, 14697, 14698, 14699, 14773, 14774, 14775, 14849, 14850,
             14851],
            dtype='int64', length=358),
 ('accounting',
  'medium'): Int64Index([  105,   106,   107,   181,   182,   183,   258,   259,   260,
               308,
             ...
             14741, 14747, 14823, 14896, 14897, 14898, 14946, 14972, 14973,
             14974],
            dtype='int64', length=335),
 ('hr',
  'high'): Int64Index([  111,  1788,  1866,  3237,  3238,  3465,  3618,  3696,  3772,
              4233,  4687,  4690,  5219,  5527,  5528,  5985,  5986,  5987,
              5988,  6138,  6216,  6594,  7054,  7057,  8352,  8353,  8733,
              8811,  8812,  8887,  9343,  9802,  9805, 10338, 10639, 10640,
             10641, 10642, 12111, 12850, 12851, 12852, 12853, 14322, 14902],
            dtype='int64'),
 ('hr',
  'low'): Int64Index([   31,    32,    33,    34,   226,   227,   228,   565,   566,
               641,
             ...
             14245, 14437, 14438, 14439, 14776, 14777, 14852, 14853, 14854,
             14855],
            dtype='int64', length=335),
 ('hr',
  'medium'): Int64Index([  108,   109,   110,   184,   185,   186,   187,   261,   262,
               263,
             ...
             14744, 14778, 14779, 14899, 14900, 14901, 14975, 14976, 14977,
             14978],
            dtype='int64', length=359),
 ('management',
  'high'): Int64Index([ 1203,  2217,  3114,  3363,  3667,  4127,  4509,  5096,  5325,
              5556,
             ...
             14148, 14149, 14150, 14151, 14186, 14204, 14205, 14206, 14207,
             14208],
            dtype='int64', length=225),
 ('management',
  'low'): Int64Index([   82,   137,   158,   213,   235,   290,   366,   442,   463,
               518,
             ...
             14446, 14501, 14577, 14653, 14674, 14729, 14805, 14873, 14928,
             14949],
            dtype='int64', length=180),
 ('management',
  'medium'): Int64Index([   60,   311,   387,   539,   615,   691,   767,   843,   919,
               974,
             ...
             13727, 13881, 14005, 14089, 14157, 14271, 14522, 14598, 14750,
             14826],
            dtype='int64', length=225),
 ('marketing',
  'high'): Int64Index([  306,   540,   618,  2295,  3289,  3662,  3668,  3891,  4122,
              4128,  4129,  4130,  4434,  4587,  4588,  4732,  5194,  5655,
              6486,  6492,  6493,  6562,  6953,  6954,  6955,  7023,  7029,
              7107,  7260,  7480,  7488,  7942,  8013,  8404,  8474,  8482,
              8778,  9006,  9393,  9471,  9549,  9612,  9624,  9702,  9703,
              9842,  9919,  9995, 10071, 10312, 11099, 11105, 11106, 11107,
             11484, 11608, 11665, 11673, 11796, 11949, 11970, 11998, 12306,
             12540, 12618, 13310, 13316, 13317, 13318, 13695, 13819, 13876,
             13884, 14007, 14160, 14181, 14209, 14517, 14751, 14829],
            dtype='int64'),
 ('marketing',
  'low'): Int64Index([   83,    84,    85,   148,   149,   150,   151,   152,   153,
               159,
             ...
             14524, 14525, 14601, 14752, 14874, 14875, 14876, 14950, 14951,
             14952],
            dtype='int64', length=402),
 ('marketing',
  'medium'): Int64Index([   77,   382,   388,   389,   458,   464,   465,   466,   534,
               542,
             ...
             14669, 14675, 14676, 14677, 14745, 14753, 14821, 14827, 14828,
             14944],
            dtype='int64', length=376),
 ('product_mng',
  'high'): Int64Index([   72,  1593,  1671,  1749,  3196,  3197,  3198,  3199,  3348,
              3426,  3579,  3804,  4267,  4725,  5562,  5563,  6021,  6022,
              6023,  6097,  6099,  6553,  7015,  7548,  7850,  7851,  7852,
              7853,  8310,  8311,  8312,  8313,  8463,  8541,  8619,  8694,
              9379,  9840, 10674, 10675, 10676, 10677, 10827, 10905, 11175,
             11253, 11264, 11445, 11523, 11704, 11835, 11988, 12072, 12885,
             12886, 12887, 12888, 13038, 13116, 13386, 13464, 13475, 13656,
             13734, 13915, 14046, 14199, 14283],
            dtype='int64'),
 ('product_mng',
  'low'): Int64Index([   73,   143,   144,   145,   146,   219,   220,   221,   222,
               448,
             ...
             14659, 14660, 14735, 14736, 14737, 14738, 14811, 14812, 14813,
             14814],
            dtype='int64', length=451),
 ('product_mng',
  'medium'): Int64Index([   66,    67,    68,    69,    71,    74,    75,    76,   296,
               297,
             ...
             14583, 14584, 14585, 14586, 14661, 14662, 14934, 14935, 14936,
             14937],
            dtype='int64', length=383),
 ('sales',
  'high'): Int64Index([  696,   774,   852,   930,  1008,  1086,  1164,  1242,  1320,
              1398,
             ...
             13503, 13581, 13620, 13888, 13890, 13929, 13940, 13944, 13948,
             14121],
            dtype='int64', length=269),
 ('sales',
  'low'): Int64Index([    0,     3,     4,     5,     6,     7,     8,     9,    10,
                11,
             ...
             14958, 14959, 14960, 14961, 14962, 14963, 14964, 14965, 14966,
             14967],
            dtype='int64', length=2099),
 ('sales',
  'medium'): Int64Index([    1,     2,    99,   100,   101,   102,   103,   104,   177,
               178,
             ...
             14891, 14892, 14893, 14894, 14895, 14945, 14968, 14969, 14970,
             14971],
            dtype='int64', length=1772),
 ('support',
  'high'): Int64Index([  657,   735,   813,   891,   969,  2139,  2334,  2412,  2490,
              2568,
             ...
             12949, 13018, 13269, 13313, 13446, 13450, 13542, 13879, 14085,
             14868],
            dtype='int64', length=141),
 ('support',
  'low'): Int64Index([   46,    47,    48,    49,    50,    51,    52,    53,    54,
                55,
             ...
             14947, 14990, 14991, 14992, 14993, 14994, 14995, 14996, 14997,
             14998],
            dtype='int64', length=1146),
 ('support',
  'medium'): Int64Index([  309,   428,   461,   504,   505,   506,   537,   581,   582,
               583,
             ...
             14748, 14792, 14793, 14794, 14795, 14824, 14867, 14870, 14871,
             14872],
            dtype='int64', length=942),
 ('technical',
  'high'): Int64Index([  189,   267,   345,   423,   462,   501,   579,  1047,  1125,
              1944,
             ...
             13851, 13906, 14400, 14478, 14556, 14634, 14673, 14712, 14790,
             14980],
            dtype='int64', length=201),
 ('technical',
  'low'): Int64Index([   35,    36,    37,    38,    39,    40,    41,    42,    43,
                44,
             ...
             14913, 14925, 14926, 14927, 14948, 14981, 14986, 14987, 14988,
             14989],
            dtype='int64', length=1372),
 ('technical',
  'medium'): Int64Index([  113,   114,   115,   116,   188,   191,   192,   193,   194,
               265,
             ...
             14866, 14904, 14905, 14906, 14907, 14979, 14982, 14983, 14984,
             14985],
            dtype='int64', length=1147)}
  • Selecting a group
grouped = hr_data.groupby(['department','salary'])
grouped.get_group(('technical','low')).head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department salary left_categorical new_number_project
35 0.10 0.94 6 255 4 0 1 0 technical low True 6
36 0.38 0.46 2 137 3 0 1 0 technical low True 2
37 0.45 0.50 2 126 3 0 1 0 technical low True 2
38 0.11 0.89 6 306 4 0 1 0 technical low True 6
39 0.41 0.54 2 152 3 0 1 0 technical low True 2

Aggregation

  • Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.
  • These operations are similar to the aggregating API, window functions API, and resample API.
grouped.number_project.aggregate(np.mean)
department   salary
IT           high      3.867470
             low       3.794745
             medium    3.833645
RandD        high      3.764706
             low       3.804945
             medium    3.913978
accounting   high      3.905405
             low       3.801676
             medium    3.832836
hr           high      3.888889
             low       3.692537
             medium    3.590529
management   high      3.777778
             low       3.777778
             medium    4.008889
marketing    high      3.425000
             low       3.751244
             medium    3.675532
product_mng  high      3.705882
             low       3.824834
             medium    3.804178
sales        high      3.858736
             low       3.757980
             medium    3.785553
support      high      3.794326
             low       3.787086
             medium    3.825902
technical    high      3.651741
             low       3.910350
             medium    3.878814
Name: number_project, dtype: float64
grouped.agg([np.mean,np.max])
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years new_number_project
mean amax mean amax mean amax mean amax mean amax mean amax mean amax mean amax mean amax
department salary
IT high 0.638193 0.99 0.716627 0.99 3.867470 6 194.927711 275 3.072289 6 0.048193 1 0.048193 1 0.000000 0 4.012048 6
low 0.610099 1.00 0.715665 1.00 3.794745 7 201.382594 308 3.438424 10 0.146141 1 0.282430 1 0.003284 1 3.916256 7
medium 0.624187 1.00 0.718187 1.00 3.833645 7 204.295327 308 3.564486 10 0.132710 1 0.181308 1 0.001869 1 3.945794 7
RandD high 0.586667 0.97 0.700588 0.95 3.764706 6 199.745098 287 3.529412 8 0.176471 1 0.078431 1 0.019608 1 3.843137 6
low 0.623929 1.00 0.714176 1.00 3.804945 7 198.747253 308 3.381868 8 0.195055 1 0.151099 1 0.008242 1 3.903846 7
medium 0.620349 1.00 0.711694 1.00 3.913978 7 202.954301 301 3.330645 6 0.145161 1 0.166667 1 0.061828 1 4.043011 7
accounting high 0.614054 0.97 0.724595 1.00 3.905405 6 205.905405 277 3.216216 8 0.202703 1 0.067568 1 0.081081 1 3.986486 6
low 0.574162 1.00 0.713883 1.00 3.801676 7 199.899441 308 3.438547 10 0.111732 1 0.276536 1 0.005587 1 3.905028 7
medium 0.583642 1.00 0.720299 1.00 3.832836 7 201.465672 310 3.680597 10 0.122388 1 0.298507 1 0.017910 1 3.940299 7
hr high 0.673111 0.99 0.743778 0.99 3.888889 6 209.066667 289 2.911111 6 0.088889 1 0.133333 1 0.044444 1 4.066667 6
low 0.608657 1.00 0.717821 1.00 3.692537 7 202.456716 310 3.259701 6 0.137313 1 0.274627 1 0.005970 1 3.797015 7
medium 0.580306 1.00 0.696100 1.00 3.590529 7 193.863510 310 3.501393 8 0.108635 1 0.325905 1 0.030641 1 3.710306 7
management high 0.653333 0.98 0.715822 1.00 3.777778 6 200.248889 286 5.164444 10 0.160000 1 0.004444 1 0.200000 1 3.844444 6
low 0.610722 1.00 0.712833 1.00 3.777778 7 200.744444 307 3.411111 10 0.166667 1 0.327778 1 0.038889 1 3.900000 7
medium 0.597867 1.00 0.741111 1.00 4.008889 7 202.653333 304 4.155556 10 0.164444 1 0.137778 1 0.075556 1 4.084444 7
marketing high 0.605250 1.00 0.663625 1.00 3.425000 6 185.575000 286 3.512500 10 0.162500 1 0.112500 1 0.062500 1 3.600000 6
low 0.602910 0.99 0.727587 1.00 3.751244 7 204.487562 310 3.527363 10 0.154229 1 0.313433 1 0.027363 1 3.883085 7
medium 0.638218 1.00 0.714495 1.00 3.675532 7 196.869681 300 3.627660 10 0.167553 1 0.180851 1 0.071809 1 3.776596 7
product_mng high 0.614118 0.99 0.665735 0.98 3.705882 6 194.632353 307 3.617647 10 0.191176 1 0.088235 1 0.000000 0 3.882353 6
low 0.620909 1.00 0.725831 1.00 3.824834 7 201.048780 310 3.434590 10 0.150776 1 0.232816 1 0.000000 0 3.937916 7
medium 0.619112 1.00 0.710418 1.00 3.804178 7 199.637076 310 3.498695 10 0.133159 1 0.227154 1 0.000000 0 3.895561 7
sales high 0.648959 1.00 0.699814 0.99 3.858736 7 201.178439 306 3.550186 10 0.137546 1 0.052045 1 0.044610 1 3.988848 7
low 0.600838 1.00 0.709247 1.00 3.757980 7 200.363030 307 3.464030 10 0.126251 1 0.332063 1 0.009528 1 3.870414 7
medium 0.625327 1.00 0.711778 1.00 3.785553 7 201.520316 310 3.614560 10 0.160835 1 0.170993 1 0.038375 1 3.930023 7
support high 0.655035 0.99 0.714113 1.00 3.794326 6 203.985816 286 3.219858 10 0.219858 1 0.056738 1 0.000000 0 3.943262 6
low 0.591710 1.00 0.719494 1.00 3.787086 7 198.900524 310 3.484293 10 0.151832 1 0.339442 1 0.006108 1 3.902269 7
medium 0.645149 1.00 0.728854 1.00 3.825902 7 202.535032 310 3.307856 10 0.148620 1 0.167728 1 0.013800 1 3.951168 7
technical high 0.625970 1.00 0.699453 1.00 3.651741 6 200.044776 284 3.313433 10 0.149254 1 0.124378 1 0.004975 1 3.781095 6
low 0.594322 1.00 0.723367 1.00 3.910350 7 203.064869 310 3.397230 10 0.142128 1 0.275510 1 0.008746 1 4.035714 7
medium 0.620968 1.00 0.722180 1.00 3.878814 7 202.248474 310 3.445510 10 0.136007 1 0.256321 1 0.013078 1 3.993897 7

Descriptive statistics of grouped data

grouped.size()
department   salary
IT           high        83
             low        609
             medium     535
RandD        high        51
             low        364
             medium     372
accounting   high        74
             low        358
             medium     335
hr           high        45
             low        335
             medium     359
management   high       225
             low        180
             medium     225
marketing    high        80
             low        402
             medium     376
product_mng  high        68
             low        451
             medium     383
sales        high       269
             low       2099
             medium    1772
support      high       141
             low       1146
             medium     942
technical    high       201
             low       1372
             medium    1147
dtype: int64
grouped.describe()
satisfaction_level last_evaluation ... promotion_last_5years new_number_project
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
department salary
IT high 83.0 0.638193 0.223749 0.15 0.5250 0.650 0.7800 0.99 83.0 0.716627 ... 0.0 0.0 83.0 4.012048 1.152875 2.0 3.0 4.0 5.0 6.0
low 609.0 0.610099 0.258915 0.09 0.4100 0.650 0.8200 1.00 609.0 0.715665 ... 0.0 1.0 609.0 3.916256 1.289334 2.0 3.0 4.0 5.0 7.0
medium 535.0 0.624187 0.243297 0.09 0.4900 0.660 0.8100 1.00 535.0 0.718187 ... 0.0 1.0 535.0 3.945794 1.235726 2.0 3.0 4.0 5.0 7.0
RandD high 51.0 0.586667 0.228785 0.10 0.4400 0.600 0.7450 0.97 51.0 0.700588 ... 0.0 1.0 51.0 3.843137 1.137938 2.0 3.0 4.0 5.0 6.0
low 364.0 0.623929 0.242586 0.09 0.4700 0.675 0.8200 1.00 364.0 0.714176 ... 0.0 1.0 364.0 3.903846 1.187208 2.0 3.0 4.0 5.0 7.0
medium 372.0 0.620349 0.250293 0.09 0.4775 0.650 0.8300 1.00 372.0 0.711694 ... 0.0 1.0 372.0 4.043011 1.262471 2.0 3.0 4.0 5.0 7.0
accounting high 74.0 0.614054 0.237319 0.11 0.5000 0.620 0.8300 0.97 74.0 0.724595 ... 0.0 1.0 74.0 3.986486 1.140695 2.0 3.0 4.0 5.0 6.0
low 358.0 0.574162 0.252250 0.09 0.4000 0.590 0.7800 1.00 358.0 0.713883 ... 0.0 1.0 358.0 3.905028 1.350144 2.0 3.0 4.0 5.0 7.0
medium 335.0 0.583642 0.262273 0.09 0.4000 0.630 0.8000 1.00 335.0 0.720299 ... 0.0 1.0 335.0 3.940299 1.295785 2.0 3.0 4.0 5.0 7.0
hr high 45.0 0.673111 0.250616 0.09 0.5500 0.730 0.8600 0.99 45.0 0.743778 ... 0.0 1.0 45.0 4.066667 1.136182 2.0 4.0 4.0 5.0 6.0
low 335.0 0.608657 0.239902 0.09 0.4400 0.620 0.8100 1.00 335.0 0.717821 ... 0.0 1.0 335.0 3.797015 1.235935 2.0 3.0 4.0 5.0 7.0
medium 359.0 0.580306 0.253324 0.09 0.4050 0.600 0.7950 1.00 359.0 0.696100 ... 0.0 1.0 359.0 3.710306 1.330636 2.0 3.0 4.0 4.0 7.0
management high 225.0 0.653333 0.194436 0.14 0.5300 0.680 0.8000 0.98 225.0 0.715822 ... 0.0 1.0 225.0 3.844444 1.152551 2.0 3.0 4.0 5.0 6.0
low 180.0 0.610722 0.254620 0.09 0.4350 0.655 0.8050 1.00 180.0 0.712833 ... 0.0 1.0 180.0 3.900000 1.268880 2.0 3.0 4.0 5.0 7.0
medium 225.0 0.597867 0.233161 0.09 0.4800 0.630 0.7600 1.00 225.0 0.741111 ... 0.0 1.0 225.0 4.084444 1.234539 2.0 3.0 4.0 5.0 7.0
marketing high 80.0 0.605250 0.255784 0.14 0.4350 0.610 0.8275 1.00 80.0 0.663625 ... 0.0 1.0 80.0 3.600000 1.164887 2.0 3.0 3.0 4.0 6.0
low 402.0 0.602910 0.256258 0.09 0.4200 0.630 0.8100 0.99 402.0 0.727587 ... 0.0 1.0 402.0 3.883085 1.322645 2.0 3.0 4.0 5.0 7.0
medium 376.0 0.638218 0.227333 0.09 0.4900 0.670 0.8200 1.00 376.0 0.714495 ... 0.0 1.0 376.0 3.776596 1.181224 2.0 3.0 4.0 5.0 7.0
product_mng high 68.0 0.614118 0.248038 0.09 0.4500 0.625 0.8225 0.99 68.0 0.665735 ... 0.0 0.0 68.0 3.882353 1.165799 2.0 3.0 4.0 5.0 6.0
low 451.0 0.620909 0.248181 0.09 0.4450 0.650 0.8300 1.00 451.0 0.725831 ... 0.0 0.0 451.0 3.937916 1.335217 2.0 3.0 4.0 5.0 7.0
medium 383.0 0.619112 0.234720 0.09 0.4500 0.640 0.8100 1.00 383.0 0.710418 ... 0.0 0.0 383.0 3.895561 1.271745 2.0 3.0 4.0 5.0 7.0
sales high 269.0 0.648959 0.236264 0.10 0.5200 0.690 0.8200 1.00 269.0 0.699814 ... 0.0 1.0 269.0 3.988848 1.137827 2.0 3.0 4.0 5.0 7.0
low 2099.0 0.600838 0.251686 0.09 0.4200 0.630 0.8100 1.00 2099.0 0.709247 ... 0.0 1.0 2099.0 3.870414 1.352319 2.0 3.0 4.0 5.0 7.0
medium 1772.0 0.625327 0.249707 0.09 0.4500 0.660 0.8300 1.00 1772.0 0.711778 ... 0.0 1.0 1772.0 3.930023 1.224935 2.0 3.0 4.0 5.0 7.0
support high 141.0 0.655035 0.225644 0.15 0.5100 0.670 0.8600 0.99 141.0 0.714113 ... 0.0 0.0 141.0 3.943262 1.080827 2.0 3.0 4.0 5.0 6.0
low 1146.0 0.591710 0.255661 0.09 0.4000 0.630 0.8000 1.00 1146.0 0.719494 ... 0.0 1.0 1146.0 3.902269 1.339052 2.0 3.0 4.0 5.0 7.0
medium 942.0 0.645149 0.234231 0.09 0.5100 0.680 0.8300 1.00 942.0 0.728854 ... 0.0 1.0 942.0 3.951168 1.171642 2.0 3.0 4.0 5.0 7.0
technical high 201.0 0.625970 0.219279 0.10 0.4900 0.640 0.7900 1.00 201.0 0.699453 ... 0.0 1.0 201.0 3.781095 1.136592 2.0 3.0 4.0 5.0 6.0
low 1372.0 0.594322 0.264359 0.09 0.4100 0.630 0.8200 1.00 1372.0 0.723367 ... 0.0 1.0 1372.0 4.035714 1.327829 2.0 3.0 4.0 5.0 7.0
medium 1147.0 0.620968 0.246691 0.09 0.4500 0.660 0.8300 1.00 1147.0 0.722180 ... 0.0 1.0 1147.0 3.993897 1.298730 2.0 3.0 4.0 5.0 7.0

30 rows × 72 columns

grouped = hr_data.groupby(['department'])
grouped.agg(mean_projects=('number_project','mean'), mean_satisfaction=('satisfaction_level','mean'))
mean_projects mean_satisfaction
department
IT 3.816626 0.618142
RandD 3.853875 0.619822
accounting 3.825293 0.582151
hr 3.654939 0.598809
management 3.860317 0.621349
marketing 3.687646 0.618601
product_mng 3.807095 0.619634
sales 3.776329 0.614447
support 3.803948 0.618300
technical 3.877941 0.607897
grouped.transform(lambda x : x + 2 )
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years new_number_project
0 2.38 2.53 4 159 5 2 3 2 4
1 2.80 2.86 7 264 8 2 3 2 7
2 2.11 2.88 9 274 6 2 3 2 9
3 2.72 2.87 7 225 7 2 3 2 7
4 2.37 2.52 4 161 5 2 3 2 4
... ... ... ... ... ... ... ... ... ...
14994 2.40 2.57 4 153 5 2 3 2 4
14995 2.37 2.48 4 162 5 2 3 2 4
14996 2.37 2.53 4 145 5 2 3 2 4
14997 2.11 2.96 8 282 6 2 3 2 8
14998 2.37 2.52 4 160 5 2 3 2 4

14999 rows × 9 columns

发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/sinat_23971513/article/details/105290512