续pandas学习
Working on Text Data
Introduction
Concat, Split & Join
Contains, Find
Cleaning Punchuations
Checking contents of String Data
String Manipulation
导言
同轴、分轴和加入
包含、查找
清洁冲头
检查字符串数据的内容
字符串操作
Introduction
Columns many times contain string data.
And, string data may always not be in best of their hygine.
Pandas provide rich api’s to make task easier using .str functions.
import pandas as pd
df = pd. DataFrame( { 'A' : [ 'a' , 'b' , 'c' , 'd' ] } )
df
Concat, Split & Join
df. A. str . cat( sep= ' ' )
'a b c d'
df. A. str . cat( [ '1' , '2' , '3' , '4' ] )
0 a1
1 b2
2 c3
3 d4
Name: A, dtype: object
horror_data = pd. read_csv( '../Data/horror-train.csv' )
horror_data
id
text
author
0
id26305
This process, however, afforded me no means of...
EAP
1
id17569
It never once occurred to me that the fumbling...
HPL
2
id11008
In his left hand was a gold snuff box, from wh...
EAP
3
id27763
How lovely is spring As we looked from Windsor...
MWS
4
id12958
Finding nothing else, not even gold, the Super...
HPL
...
...
...
...
19574
id17718
I could have fancied, while I looked at it, th...
EAP
19575
id08973
The lids clenched themselves together as if in...
EAP
19576
id05267
Mais il faut agir that is to say, a Frenchman ...
EAP
19577
id17513
For an item of news like this, it strikes us i...
EAP
19578
id00393
He laid a gnarled claw on my shoulder, and it ...
HPL
19579 rows × 3 columns
horror_data = horror_data. iloc[ : 10 ]
horror_data
id
text
author
0
id26305
This process, however, afforded me no means of...
EAP
1
id17569
It never once occurred to me that the fumbling...
HPL
2
id11008
In his left hand was a gold snuff box, from wh...
EAP
3
id27763
How lovely is spring As we looked from Windsor...
MWS
4
id12958
Finding nothing else, not even gold, the Super...
HPL
5
id22965
A youth passed in solitude, my best years spen...
MWS
6
id09674
The astronomer, perhaps, at this point, took r...
EAP
7
id13515
The surcingle hung in ribands from my body.
EAP
8
id19322
I knew that you could not say to yourself 'ste...
EAP
9
id00912
I confess that neither the structure of langua...
MWS
horror_data. text. str . split( )
0 [This, process,, however,, afforded, me, no, m...
1 [It, never, once, occurred, to, me, that, the,...
2 [In, his, left, hand, was, a, gold, snuff, box...
3 [How, lovely, is, spring, As, we, looked, from...
4 [Finding, nothing, else,, not, even, gold,, th...
5 [A, youth, passed, in, solitude,, my, best, ye...
6 [The, astronomer,, perhaps,, at, this, point,,...
7 [The, surcingle, hung, in, ribands, from, my, ...
8 [I, knew, that, you, could, not, say, to, your...
9 [I, confess, that, neither, the, structure, of...
Name: text, dtype: object
horror_data. text. str . split( expand= True , n= 20 )
0
1
2
3
4
5
6
7
8
9
...
11
12
13
14
15
16
17
18
19
20
0
This
process,
however,
afforded
me
no
means
of
ascertaining
the
...
of
my
dungeon;
as
I
might
make
its
circuit,
and return to the point whence I set out, with...
1
It
never
once
occurred
to
me
that
the
fumbling
might
...
a
mere
mistake.
None
None
None
None
None
None
None
2
In
his
left
hand
was
a
gold
snuff
box,
from
...
as
he
capered
down
the
hill,
cutting
all
manner
of fantastic steps, he took snuff incessantly ...
3
How
lovely
is
spring
As
we
looked
from
Windsor
Terrace
...
the
sixteen
fertile
counties
spread
beneath,
speckled
by
happy
cottages and wealthier towns, all looked as in...
4
Finding
nothing
else,
not
even
gold,
the
Superintendent
abandoned
his
...
but
a
perplexed
look
occasionally
steals
over
his
countenance
as he sits thinking at his desk.
5
A
youth
passed
in
solitude,
my
best
years
spent
under
...
gentle
and
feminine
fosterage,
has
so
refined
the
groundwork
of my character that I cannot overcome an inte...
6
The
astronomer,
perhaps,
at
this
point,
took
refuge
in
the
...
of
non
luminosity;
and
here
analogy
was
suddenly
let
fall.
7
The
surcingle
hung
in
ribands
from
my
body.
None
None
...
None
None
None
None
None
None
None
None
None
None
8
I
knew
that
you
could
not
say
to
yourself
'stereotomy'
...
being
brought
to
think
of
atomies,
and
thus
of
the theories of Epicurus; and since, when we d...
9
I
confess
that
neither
the
structure
of
languages,
nor
the
...
of
governments,
nor
the
politics
of
various
states
possessed
attractions for me.
10 rows × 21 columns
res = horror_data. text. str . split( )
res. str . join( sep= ' ' )
0 This process, however, afforded me no means of...
1 It never once occurred to me that the fumbling...
2 In his left hand was a gold snuff box, from wh...
3 How lovely is spring As we looked from Windsor...
4 Finding nothing else, not even gold, the Super...
5 A youth passed in solitude, my best years spen...
6 The astronomer, perhaps, at this point, took r...
7 The surcingle hung in ribands from my body.
8 I knew that you could not say to yourself 'ste...
9 I confess that neither the structure of langua...
Name: text, dtype: object
Understanding contains, find & index
horror_data. text. str . contains( 'This' )
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: text, dtype: bool
horror_data. text. str . contains( 'his' )
0 True
1 False
2 True
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: text, dtype: bool
horror_data. text. str . find( 'This' )
0 0
1 -1
2 -1
3 -1
4 -1
5 -1
6 -1
7 -1
8 -1
9 -1
Name: text, dtype: int64
horror_data[ : 5 ] . text. str . index( 'is' )
0 2
1 64
2 4
3 11
4 67
Name: text, dtype: int64
df = pd. DataFrame( { 'Name' : [ 'Abc' , 'Def' , 'Jkl' ] , 'Email' : [ 'awidgmail.com' , ' [email protected] ' , '[email protected] ' ] } )
df
df. Email. str . contains( r'\w+@\w+' )
0 False
1 True
2 True
Name: Email, dtype: bool
df. Email. str . replace( pat= '@' , repl= '&' )
0 awidgmail.com
1 def&gmail.com
2 jkl&yahoo.com
Name: Email, dtype: object
Cleaning Punchuations
import string
table = str . maketrans( '' , '' , string. punctuation)
horror_data. text. str . translate( table)
0 This process however afforded me no means of a...
1 It never once occurred to me that the fumbling...
2 In his left hand was a gold snuff box from whi...
3 How lovely is spring As we looked from Windsor...
4 Finding nothing else not even gold the Superin...
5 A youth passed in solitude my best years spent...
6 The astronomer perhaps at this point took refu...
7 The surcingle hung in ribands from my body
8 I knew that you could not say to yourself ster...
9 I confess that neither the structure of langua...
Name: text, dtype: object
horror_data_tf = horror_data. text. str . translate( table)
horror_data_tf = horror_data_tf. str . lower( )
Check for string contents|检查字符串内容
isalnum() Equivalent to str.isalnum
isalpha() Equivalent to str.isalpha
isdigit() Equivalent to str.isdigit
isspace() Equivalent to str.isspace
islower() Equivalent to str.islower
isupper() Equivalent to str.isupper
istitle() Equivalent to str.istitle
isnumeric() Equivalent to str.isnumeric
isdecimal() Equivalent to str.isdecimal
isalnum() 等同于str.isalnum()
isalpha() 等效于str.isalpha()
isdigit() 等同于str.isdigit()
isspace() 等效于str.isspace()
islower() 等同于str.islower()
isupper() 等同于str.isupper()
istitle() 等同于str.istitle()
isnumeric() 等同于str.isumeric()
isdecimal() 等同于str.isdecimal()
df = pd. DataFrame( { 'A' : [ '1234' , '123ab' , 'abcde' ] } )
df
Returns only rows which contain digit
df[ df. A. str . isdigit( ) ]
Returns only rows which is alphabets
df[ df. A. str . isalpha( ) ]
String Manipulation|字符串操作
slice() Slice each string in the Series
slice_replace() Replace slice in each string with passed value
count() Count occurrences of pattern
startswith() Equivalent to str.startswith(pat) for each element
endswith() Equivalent to str.endswith(pat) for each element
findall() Compute list of all occurrences of pattern/regex for each string
match() Call re.match on each element, returning matched groups as list
extract() Call re.search on each element, returning DataFrame with one row for each element and one column for each regex
extractall()
slice() 将系列中的每一个字符串都切成片。
slice_replace() 用传递的值替换每个字符串中的slice()
count() 计数模式的出现次数
startswith() 等同于str.startwith(pat)对每个元素的作用。
endswith() 等同于str.endwith(pat)对每个元素的作用。
findall() 计算每个字符串的所有pattern/regex的出现次数列表
match() 在每个元素上调用re.match(),返回匹配的组作为列表。
extract() 在每个元素上调用re.search(),返回的DataFrame中,每个元素有一行,每个regex有一列。
extractall()
df = pd. DataFrame( { 'Name' : [ 'Rush' , 'Riba' , 'Kunal' , 'Pruthvi' ] ,
'Email' : [ '[email protected] ' , '[email protected] ' , '[email protected] ' , '[email protected] ' ] } )
df
df[ 'Username' ] = df. Email. str . slice ( start = 0 , step= 2 , stop= - 11 )
df
df[ 'UpdatedEmail' ] = df. Email. str . slice_replace( start= - 11 , repl= '@zekelabs.com' )
df
df. at[ 2 , 'Email' ] = '[email protected] '
df
help ( pd. DataFrame. at)
Help on property:
Access a single value for a row/column label pair.
Similar to ``loc``, in that both provide label-based lookups. Use
``at`` if you only need to get or set a single value in a DataFrame
or Series.
Raises
------
KeyError
When label does not exist in DataFrame
See Also
--------
DataFrame.iat : Access a single value for a row/column pair by integer
position.
DataFrame.loc : Access a group of rows and columns by label(s).
Series.at : Access a single value using a label.
Examples
--------
>>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
... index=[4, 5, 6], columns=['A', 'B', 'C'])
>>> df
A B C
4 0 2 3
5 0 4 1
6 10 20 30
Get value at specified row/column pair
>>> df.at[4, 'B']
2
Set value at specified row/column pair
>>> df.at[4, 'B'] = 10
>>> df.at[4, 'B']
10
Get value within a Series
>>> df.loc[5].at['B']
4
Filtering based domain name
基于过滤的域名
df. Email. str . match( '[\w][email protected] ' )
0 True
1 True
2 False
3 True
Name: Email, dtype: bool
Extract text based on certain pattern
df. Email. str . extract( '([\w]*)@([\w.]*)' )
0
1
0
rush
edyoda.com
1
riba
edyoda.com
2
kunal
everywhere.com
3
pruthvi
edyoda.com
Working on Missing Data
Detect missing & existing values.
Return a new Series with missing values removed.
Fill NA/NaN values using the specified method.
Interpolate values according to different methods.
检测缺失值和现有值。
返回一个新的系列,并删除缺失值。
使用指定的方法填充NA/NaN值。
根据不同的方法进行插值。
1. Detect existing non-missing values
Considered as missing values - None or numpy.NaN
Empty string is still considered non null values
import pandas as pd
import numpy as np
df = pd. DataFrame( { 'A' : [ 1 , 2 , None ] , 'B' : [ 2 , np. NaN, 3 ] } )
df
A
B
0
1.0
2.0
1
2.0
NaN
2
NaN
3.0
df. isna( )
A
B
0
False
False
1
False
True
2
True
False
df. isna( ) . sum ( )
A 1
B 1
dtype: int64
df. isna( ) . any ( )
A True
B True
dtype: bool
df. isna( ) . sum ( ) . sum ( )
2
isnull is implementation of isna
df. isnull( )
A
B
0
False
False
1
False
True
2
True
False
df = pd. DataFrame( { 'A' : [ 1 , '' , None ] , 'B' : [ 2 , np. NaN, 3 ] } )
df
A
B
0
1
2.0
1
NaN
2
None
3.0
df. isna( )
A
B
0
False
False
1
False
True
2
True
False
df. isnull( )
A
B
0
False
False
1
False
True
2
True
False
df. replace( '' , np. NaN)
A
B
0
1.0
2.0
1
NaN
NaN
2
NaN
3.0
df. notna( )
A
B
0
True
True
1
True
False
2
False
True
Filtering data based on series
df[ df. A. notna( ) ]
df. A. notna( )
0 True
1 True
2 False
Name: A, dtype: bool
Return a new Series with missing values removed.
Dropping rows which have any missing values
df = pd. DataFrame( { 'A' : [ 1 , '' , None ] , 'B' : [ 2 , np. NaN, 3 ] , 'C' : [ 3 , 4 , 5 ] } )
df
A
B
C
0
1
2.0
3
1
NaN
4
2
None
3.0
5
df. dropna( )
Dropping columns which have null values
df. dropna( axis= 1 )
Filling missing values
df. fillna( 0 )
A
B
C
0
1
2.0
3
1
0.0
4
2
0
3.0
5
df. fillna( { 'A' : 10 , 'B' : 11 } )
A
B
C
0
1
2.0
3
1
11.0
4
2
10
3.0
5
Values can be backward fill, forward fill
df. fillna( method= 'bfill' )
A
B
C
0
1
2.0
3
1
3.0
4
2
None
3.0
5
Intrapolate missing values based on different methods| 基于不同方法的内推缺失值
df = pd. DataFrame( { 'Name' : [ 'Rush' , 'Riba' , 'Kunal' , 'Pruthvi' ] ,
'Email' : [ '[email protected] ' , '[email protected] ' , '[email protected] ' , '[email protected] ' ] ,
'Age' : [ 33 , 31 , None , 18 ] } )
df
Styling Pandas Table
Applymap for applying entire table
Apply for applying column wise
Highlighting Null
Applying colors to subset of data
适用于应用整个表格的应用图
按栏目申请申请
突出显示Null
对数据的子集应用颜色
import pandas as pd
import numpy as np
df = pd. read_csv( 'SalesTrainingDataset.csv' )
df = df. iloc[ : 100 , list ( range ( 10 ) ) ]
df
Applymap
Apply styling on complete data
function returns a css parameter
def color_product_sales ( val) :
if val > 10000 :
color = 'green'
elif val < 2000 :
color = 'red'
else :
color = 'black'
return 'color: %s' % color
df. style. applymap( color_product_sales)
Apply
In case, we want to do series wise check apply can be used.
Function argument will be series.
By default, columns.
If assigned axis = 1, rows
def highlight_max ( s) :
is_max = s == s. max ( )
return [ 'background-color: yellow' if v else '' for v in is_max]
df. style. apply ( highlight_max)
Chaining is also supported
df. style. applymap( color_product_sales) . apply ( highlight_max)
Highlighting Null Values
df. style. highlight_null( null_color= 'red' )
Dealing with subset of Data
Selecting subset of columns
df. style. apply ( highlight_max, subset= [ 'Outcome_M8' , 'Outcome_M9' ] )
df. style. apply ( highlight_max, subset= pd. IndexSlice[ 2 : 14 , : ] , axis= 1 )
df. style. apply ( highlight_max, subset= pd. IndexSlice[ 2 : 14 , [ 'Outcome_M6' , 'Outcome_M7' , 'Outcome_M8' , 'Outcome_M9' ] ] , axis= 1 )
Pandas for Computation
Percent change
Covariance
Correlation
Data Ranking
Window Functions
Time aware rolling
Window Function
Rolling vs Expanding
变化百分比
协方差
相关性
数据排名
窗口功能
时间意识的滚动
窗口功能
滚动与扩张
Statistical Functions|##统计功能
Percent Change - Series and DataFrame have a method pct_change() to compute the percent change over a given number of periods
百分比变化 - 系列和DataFrame有一个方法pct_change()来计算在给定的周期数上的百分比变化。
import pandas as pd
import numpy as np
sales_data = pd. DataFrame( data= np. random. randint( 1 , 100 , ( 10 , 4 ) ) ,
columns= [ 'Tea' , 'Milk' , 'Carpet' , 'Cream' ] ,
index= pd. Series( pd. period_range( '1/1/2011' , freq= 'M' , periods= 10 ) ) )
sales_data
Tea
Milk
Carpet
Cream
2011-01
58
81
8
49
2011-02
63
29
39
70
2011-03
17
16
17
81
2011-04
26
9
23
53
2011-05
95
97
82
45
2011-06
52
12
72
99
2011-07
66
93
84
15
2011-08
92
76
75
96
2011-09
29
19
46
35
2011-10
83
94
2
4
Changes in monthly sales data
sales_data. pct_change( periods= 1 ) . round ( 4 ) * 100
Tea
Milk
Carpet
Cream
2011-01
NaN
NaN
NaN
NaN
2011-02
8.62
-64.20
387.50
42.86
2011-03
-73.02
-44.83
-56.41
15.71
2011-04
52.94
-43.75
35.29
-34.57
2011-05
265.38
977.78
256.52
-15.09
2011-06
-45.26
-87.63
-12.20
120.00
2011-07
26.92
675.00
16.67
-84.85
2011-08
39.39
-18.28
-10.71
540.00
2011-09
-68.48
-75.00
-38.67
-63.54
2011-10
186.21
394.74
-95.65
-88.57
Covariance & Correlation
Calculate covariance between series. Covariance is a measure of how much two random variables vary together
A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of b etween -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation
共方差和相关性
计算序列之间的协方差。协方差是衡量两个随机变量在一起变化的程度。 相关系数是给关系取值的一种方法。相关系数的值为b。 在-1和1之间,"0 "表示变量之间完全不存在任何关系,而-1或1表示完全负相关或正相关。
df = pd. DataFrame( np. random. randint( 10 , 20 , ( 10 , 2 ) ) , columns= [ 'A' , 'B' ] )
df
A
B
0
13
13
1
19
18
2
17
10
3
16
15
4
12
10
5
13
10
6
18
17
7
12
11
8
14
11
9
15
17
df.cov()
df. corr( )
A
B
A
1.0000
0.6942
B
0.6942
1.0000
'''
关于模块pandas.core.frame.com中的方法corr的帮助。
corr(method='pearson', min_periods=1) pandas.core.frame.DataFrame实例的方法
计算列的成对相关,不包括NA/null值。
参数
方法:{'pearson', 'kendall', 'spearman'} 或可调用的方法
* Pearson:标准相关系数
* Kendall:Kendall Tau相关系数。
* Spearman:Spearman等级相关性
* 可调用:输入两个1d ndarrays,可调用。
并返回一个float。注意,从corr
将有1沿对角线的对角线,并且将是对称的
无论可调用者的行为如何
...版本添加:: 0.24.0
min_periods : int, 可选
每一对列所需的最低观测次数
以获得有效的结果。目前仅适用于Pearson
和Spearman相关性。
'''
"\n关于模块pandas.core.frame.com中的方法corr的帮助。\n\ncorr(method='pearson', min_periods=1) pandas.core.frame.DataFrame实例的方法\n 计算列的成对相关,不包括NA/null值。\n \n 参数\n \n 方法:{'pearson', 'kendall', 'spearman'} 或可调用的方法\n * Pearson:标准相关系数\n * Kendall:Kendall Tau相关系数。\n * Spearman:Spearman等级相关性\n * 可调用:输入两个1d ndarrays,可调用。\n 并返回一个float。注意,从corr\n 将有1沿对角线的对角线,并且将是对称的\n 无论可调用者的行为如何\n ...版本添加:: 0.24.0\n \n min_periods : int, 可选\n 每一对列所需的最低观测次数\n 以获得有效的结果。目前仅适用于Pearson\n 和Spearman相关性。\n"
Rank method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:
排名法产生一个数据排名,并将各组的排名平均值(默认情况下)分配给各组的排名。
df
df[ 'A_Rank' ] = df. A. rank( )
df
A
B
A_Rank
0
13
13
3.5
1
19
18
10.0
2
17
10
8.0
3
16
15
7.0
4
12
10
1.5
5
13
10
3.5
6
18
17
9.0
7
12
11
1.5
8
14
11
5.0
9
15
17
6.0
'''
关于模块pandas.core.generic.module pandas.core.generic.module rank中的方法 rank的帮助。
rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) pandas.core.core.series实例的方法
计算沿轴的数值数据等级(1到n)。
默认情况下,相同的值被分配了一个等级,该等级为
这些值的等级。
参数
axis : {0或'index', 1或'columns'}, 默认为0
直接排名的索引。
method : {'平均', 'min', 'max', 'first', 'dense'}, 默认为'平均'
如何对具有相同值的记录组进行排序?
(即平局)。
*平均:小组的平均排名
* 最低:组内最低等级
* 最高:组内最高等级
* 第一:按数组中的顺序分配等级。
* 密集:像 "min "一样,但组与组之间的等级总是增加1。
numeric_only : bool, 可选的
对于DataFrame对象,如果设置为True,只对数字列进行排序。
na_option : {'keep', 'top', 'bottom'}, 默认为'keep'
如何对NaN值进行排序。
* 保留:为NaN值分配NaN等级。
*顶部:如果从高到低,给NaN值分配最小的等级。
* 底部:如果从高到低,给NaN值分配最高等级。
升序:bool,默认为True
元素是否应该按升序排列。
pct : bool, 默认为 False
是否以百分位数显示返回的排名。
形式。
通过www.DeepL.com/Translator(免费版)翻译
'''
'\n关于模块pandas.core.generic.module pandas.core.generic.module rank中的方法 rank的帮助。\n\nrank(axis=0, method=\'average\', numeric_only=None, na_option=\'keep\', ascending=True, pct=False) pandas.core.core.series实例的方法\n 计算沿轴的数值数据等级(1到n)。\n \n 默认情况下,相同的值被分配了一个等级,该等级为\n 这些值的等级。\n \n 参数\n \n axis : {0或\'index\', 1或\'columns\'}, 默认为0\n 直接排名的索引。\n method : {\'平均\', \'min\', \'max\', \'first\', \'dense\'}, 默认为\'平均\'\n 如何对具有相同值的记录组进行排序?\n (即平局)。\n \n *平均:小组的平均排名\n * 最低:组内最低等级\n * 最高:组内最高等级\n * 第一:按数组中的顺序分配等级。\n * 密集:像 "min "一样,但组与组之间的等级总是增加1。\n numeric_only : bool, 可选的\n 对于DataFrame对象,如果设置为True,只对数字列进行排序。\n na_option : {\'keep\', \'top\', \'bottom\'}, 默认为\'keep\'\n 如何对NaN值进行排序。\n \n * 保留:为NaN值分配NaN等级。\n *顶部:如果从高到低,给NaN值分配最小的等级。\n * 底部:如果从高到低,给NaN值分配最高等级。\n 升序:bool,默认为True\n 元素是否应该按升序排列。\n pct : bool, 默认为 False\n 是否以百分位数显示返回的排名。\n 形式。\n\n通过www.DeepL.com/Translator(免费版)翻译\n'
Window Functions|窗口功能
For working with data, a number of window functions are provided for computing common window or rolling statistics.
Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis.
在处理数据时,提供了一些窗口函数,用于计算常见的窗口或滚动统计。
其中有计数、求和、均值、均值、中位数、相关、方差、协方差、标准差、偏斜度和峰度。
sales_data = pd. read_csv( '../Data/sales-data.csv' )
sales_data
Month
Sales
0
1-01
266.0
1
1-02
145.9
2
1-03
183.1
3
1-04
119.3
4
1-05
180.3
5
1-06
168.5
6
1-07
231.8
7
1-08
224.5
8
1-09
192.8
9
1-10
122.9
10
1-11
336.5
11
1-12
185.9
12
2-01
194.3
13
2-02
149.5
14
2-03
210.1
15
2-04
273.3
16
2-05
191.4
17
2-06
287.0
18
2-07
226.0
19
2-08
303.6
20
2-09
289.9
21
2-10
421.6
22
2-11
264.5
23
2-12
342.3
24
3-01
339.7
25
3-02
440.4
26
3-03
315.9
27
3-04
439.3
28
3-05
401.3
29
3-06
437.4
30
3-07
575.5
31
3-08
407.6
32
3-09
682.0
33
3-10
475.3
34
3-11
581.3
35
3-12
646.9
r = sales_data. Sales. rolling( window= 5 )
r. count( )
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 5.0
6 5.0
7 5.0
8 5.0
9 5.0
10 5.0
11 5.0
12 5.0
13 5.0
14 5.0
15 5.0
16 5.0
17 5.0
18 5.0
19 5.0
20 5.0
21 5.0
22 5.0
23 5.0
24 5.0
25 5.0
26 5.0
27 5.0
28 5.0
29 5.0
30 5.0
31 5.0
32 5.0
33 5.0
34 5.0
35 5.0
Name: Sales, dtype: float64
r. max ( )
0 NaN
1 NaN
2 NaN
3 NaN
4 266.0
5 183.1
6 231.8
7 231.8
8 231.8
9 231.8
10 336.5
11 336.5
12 336.5
13 336.5
14 336.5
15 273.3
16 273.3
17 287.0
18 287.0
19 303.6
20 303.6
21 421.6
22 421.6
23 421.6
24 421.6
25 440.4
26 440.4
27 440.4
28 440.4
29 440.4
30 575.5
31 575.5
32 682.0
33 682.0
34 682.0
35 682.0
Name: Sales, dtype: float64
Time aware rolling
dft = pd. DataFrame( { 'B' : [ 0 , 1 , 2 , np. nan, 4 ] } ,
index= pd. date_range( '20130101 09:00:00' ,
periods= 5 ,
freq= 's' ) )
dft
B
2013-01-01 09:00:00
0.0
2013-01-01 09:00:01
1.0
2013-01-01 09:00:02
2.0
2013-01-01 09:00:03
NaN
2013-01-01 09:00:04
4.0
dft. rolling( '2s' ) . sum ( )
B
2013-01-01 09:00:00
0.0
2013-01-01 09:00:01
1.0
2013-01-01 09:00:02
3.0
2013-01-01 09:00:03
2.0
2013-01-01 09:00:04
4.0
r. agg( np. sum )
0 NaN
1 NaN
2 NaN
3 NaN
4 894.6
5 797.1
6 883.0
7 924.4
8 997.9
9 940.5
10 1108.5
11 1062.6
12 1032.4
13 989.1
14 1076.3
15 1013.1
16 1018.6
17 1111.3
18 1187.8
19 1281.3
20 1297.9
21 1528.1
22 1505.6
23 1621.9
24 1658.0
25 1808.5
26 1702.8
27 1877.6
28 1936.6
29 2034.3
30 2169.4
31 2261.1
32 2503.8
33 2577.8
34 2721.7
35 2793.1
Name: Sales, dtype: float64
r. agg( [ np. sum , np. mean] )
sum
mean
0
NaN
NaN
1
NaN
NaN
2
NaN
NaN
3
NaN
NaN
4
894.6
178.92
5
797.1
159.42
6
883.0
176.60
7
924.4
184.88
8
997.9
199.58
9
940.5
188.10
10
1108.5
221.70
11
1062.6
212.52
12
1032.4
206.48
13
989.1
197.82
14
1076.3
215.26
15
1013.1
202.62
16
1018.6
203.72
17
1111.3
222.26
18
1187.8
237.56
19
1281.3
256.26
20
1297.9
259.58
21
1528.1
305.62
22
1505.6
301.12
23
1621.9
324.38
24
1658.0
331.60
25
1808.5
361.70
26
1702.8
340.56
27
1877.6
375.52
28
1936.6
387.32
29
2034.3
406.86
30
2169.4
433.88
31
2261.1
452.22
32
2503.8
500.76
33
2577.8
515.56
34
2721.7
544.34
35
2793.1
558.62
Rolling vs Expanding
data = pd. DataFrame( [
[ 'a' , 1 ] ,
[ 'a' , 2 ] ,
[ 'a' , 3 ] ,
[ 'b' , 5 ] ,
[ 'b' , 6 ] ,
[ 'b' , 7 ] ,
[ 'b' , 8 ] ,
[ 'c' , 10 ] ,
[ 'c' , 11 ] ,
[ 'c' , 12 ] ,
[ 'c' , 13 ]
] , columns = [ 'category' , 'value' ] )
data
category
value
0
a
1
1
a
2
2
a
3
3
b
5
4
b
6
5
b
7
6
b
8
7
c
10
8
c
11
9
c
12
10
c
13
data. value. expanding( 1 ) . sum ( )
0 1.0
1 3.0
2 6.0
3 11.0
4 17.0
5 24.0
6 32.0
7 42.0
8 53.0
9 65.0
10 78.0
Name: value, dtype: float64
data. value. rolling( 2 ) . sum ( )
0 NaN
1 3.0
2 5.0
3 8.0
4 11.0
5 13.0
6 15.0
7 18.0
8 21.0
9 23.0
10 25.0
Name: value, dtype: float64
Expanding - If we use the expanding window with initial size 1, it will create a window that in the first step contains only the first row. In the second step, it contains both the first and the second row. In every step, one additional row is added to the window, and the aggregating function is being recalculated.
Rolling - Rolling windows are totally different. In this case, we specify the size of the window which is moving. What happens when we set the rolling window size to 2?
In the first step, it is going to contain the first row and one undefined row, so I am going to get NaN as a result.
In the second step, the window moves and now contains the first and the second row. Now it is possible to calculate the aggregate function. In the case of this example, the sum of both rows.
In the third step, the window moves again and no longer contains the first row. Instead of that now it calculates the sum of the second and the third row.
展开 - 如果我们使用初始大小为1的扩展窗口,它将创建一个窗口,在第一步中只包含第一行。在第二步中,它同时包含第一行和第二行。在每一步中,窗口中都会增加一条额外的行,并对聚合函数进行重新计算。
滚动–滚动窗口是完全不同的。在这种情况下,我们指定的是移动窗口的大小。当我们将滚动窗口的大小设置为2时,会发生什么情况呢?
在第一步中,它将包含第一行和一个未定义的行,所以我将得到NaN作为结果。
在第二步中,窗口移动了,现在包含了第一行和第二行。现在就可以计算出聚合函数了。在本例中,是两行的总和。
在第三步中,窗口再次移动,不再包含第一行。相反,现在它计算的是第二行和第三行之和。
Data Transformation using Map, Apply & GroupBy
Transforming Series using Map
Transforming across multiple Series using apply
GroupBy - Splitting, Applying & Combine
使用map变换序列
2.使用apply进行跨多个系列的转换
GroupBy - 分割、应用和合并
import pandas as pd
import numpy as np
hr_data = pd. read_csv( '../Data/HR_comma_sep.csv.txt' )
hr_data. rename( columns= { 'sales' : 'department' } , inplace= True )
Transforming Series using Map
hr_data. head( )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
department
salary
0
0.38
0.53
2
157
3
0
1
0
sales
low
1
0.80
0.86
5
262
6
0
1
0
sales
medium
2
0.11
0.88
7
272
4
0
1
0
sales
medium
3
0.72
0.87
5
223
5
0
1
0
sales
low
4
0.37
0.52
2
159
3
0
1
0
sales
low
map for transforming left column with some categorical information
hr_data[ 'left_categorical' ] = hr_data. left. map ( { 1 : 'True' , 0 : 'False' } )
hr_data. head( )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
department
salary
left_categorical
0
0.38
0.53
2
157
3
0
1
0
sales
low
True
1
0.80
0.86
5
262
6
0
1
0
sales
medium
True
2
0.11
0.88
7
272
4
0
1
0
sales
medium
True
3
0.72
0.87
5
223
5
0
1
0
sales
low
True
4
0.37
0.52
2
159
3
0
1
0
sales
low
True
Transforming data across multiple Series
If satisfaction_level > .9, increase number_project by 1
Multiple columns can’t be dealt with map, we need apply for that
def increase_proj ( r) :
if r. satisfaction_level > .9 :
return r. number_project + 1
else :
return r. number_project
hr_data[ 'new_number_project' ] = hr_data. apply ( increase_proj, axis= 1 )
Filtering all the folks for which this happened
hr_data[ hr_data. number_project != hr_data. new_number_project] . head( )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
department
salary
left_categorical
new_number_project
7
0.92
0.85
5
259
5
0
1
0
sales
low
True
6
106
0.91
1.00
4
257
5
0
1
0
accounting
medium
True
5
191
0.92
0.87
4
226
6
1
1
0
technical
medium
True
5
231
0.92
0.99
5
255
6
0
1
0
sales
low
True
6
352
0.91
0.91
4
262
6
0
1
0
support
low
True
5
GroupBy
grouped = hr_data. groupby( [ 'department' ] )
Compute first & last of group values
grouped. first( )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
salary
left_categorical
new_number_project
department
IT
0.11
0.93
7
308
4
0
1
0
medium
True
7
RandD
0.12
1.00
3
278
4
0
1
0
medium
True
3
accounting
0.41
0.46
2
128
3
0
1
0
low
True
2
hr
0.45
0.57
2
134
3
0
1
0
low
True
2
management
0.85
0.91
5
226
5
0
1
0
medium
True
5
marketing
0.40
0.54
2
137
3
0
1
0
medium
True
2
product_mng
0.43
0.54
2
153
3
0
1
0
medium
True
2
sales
0.38
0.53
2
157
3
0
1
0
low
True
2
support
0.40
0.55
2
147
3
0
1
0
low
True
2
technical
0.10
0.94
6
255
4
0
1
0
low
True
6
grouped. last( )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
salary
left_categorical
new_number_project
department
IT
0.90
0.92
4
271
5
0
1
0
medium
True
4
RandD
0.81
0.92
5
239
5
0
1
0
medium
True
5
accounting
0.36
0.54
2
153
3
0
1
0
medium
True
2
hr
0.40
0.47
2
144
3
0
1
0
medium
True
2
management
0.42
0.57
2
147
3
1
1
0
low
True
2
marketing
0.44
0.52
2
149
3
0
1
0
low
True
2
product_mng
0.46
0.55
2
147
3
0
1
0
medium
True
2
sales
0.39
0.45
2
140
3
0
1
0
medium
True
2
support
0.37
0.52
2
158
3
0
1
0
low
True
2
technical
0.43
0.57
2
159
3
1
1
0
low
True
2
grouped. nth( 2 )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
salary
left_categorical
new_number_project
department
IT
0.36
0.56
2
132
3
0
1
0
medium
True
2
RandD
0.37
0.55
2
127
3
0
1
0
medium
True
2
accounting
0.09
0.62
6
294
4
0
1
0
low
True
6
hr
0.45
0.55
2
140
3
0
1
0
low
True
2
management
0.42
0.48
2
129
3
0
1
0
low
True
2
marketing
0.11
0.77
6
291
4
0
1
0
low
True
6
product_mng
0.76
0.86
5
223
5
1
1
0
medium
True
5
sales
0.11
0.88
7
272
4
0
1
0
medium
True
7
support
0.40
0.54
2
148
3
0
1
0
low
True
2
technical
0.45
0.50
2
126
3
0
1
0
low
True
2
grouped. groups
{'IT': Int64Index([ 61, 62, 63, 64, 65, 70, 138, 139, 140,
141,
...
14808, 14809, 14810, 14815, 14929, 14930, 14931, 14932, 14933,
14938],
dtype='int64', length=1227),
'RandD': Int64Index([ 301, 302, 303, 304, 305, 453, 454, 455, 456,
457,
...
14816, 14817, 14818, 14819, 14820, 14939, 14940, 14941, 14942,
14943],
dtype='int64', length=787),
'accounting': Int64Index([ 28, 29, 30, 79, 105, 106, 107, 155, 181,
182,
...
14849, 14850, 14851, 14896, 14897, 14898, 14946, 14972, 14973,
14974],
dtype='int64', length=767),
'hr': Int64Index([ 31, 32, 33, 34, 108, 109, 110, 111, 184,
185,
...
14854, 14855, 14899, 14900, 14901, 14902, 14975, 14976, 14977,
14978],
dtype='int64', length=739),
'management': Int64Index([ 60, 82, 137, 158, 213, 235, 290, 311, 366,
387,
...
14598, 14653, 14674, 14729, 14750, 14805, 14826, 14873, 14928,
14949],
dtype='int64', length=630),
'marketing': Int64Index([ 77, 83, 84, 85, 148, 149, 150, 151, 152,
153,
...
14827, 14828, 14829, 14874, 14875, 14876, 14944, 14950, 14951,
14952],
dtype='int64', length=858),
'product_mng': Int64Index([ 66, 67, 68, 69, 71, 72, 73, 74, 75,
76,
...
14737, 14738, 14811, 14812, 14813, 14814, 14934, 14935, 14936,
14937],
dtype='int64', length=902),
'sales': Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
14962, 14963, 14964, 14965, 14966, 14967, 14968, 14969, 14970,
14971],
dtype='int64', length=4140),
'support': Int64Index([ 46, 47, 48, 49, 50, 51, 52, 53, 54,
55,
...
14947, 14990, 14991, 14992, 14993, 14994, 14995, 14996, 14997,
14998],
dtype='int64', length=2229),
'technical': Int64Index([ 35, 36, 37, 38, 39, 40, 41, 42, 43,
44,
...
14980, 14981, 14982, 14983, 14984, 14985, 14986, 14987, 14988,
14989],
dtype='int64', length=2720)}
hr_data. groupby( [ 'department' , 'salary' ] ) . groups
{('IT',
'high'): Int64Index([ 1281, 1359, 1437, 1515, 3192, 3193, 3194, 3195, 3200,
3270, 3504, 3799, 3802, 4260, 4264, 4269, 4720, 5097,
5098, 5557, 5558, 5559, 5560, 5561, 5634, 5712, 5790,
5865, 5943, 6024, 6093, 6547, 6550, 7009, 7011, 7012,
7087, 7474, 7544, 7845, 7846, 7847, 7848, 7849, 7998,
8076, 8154, 8229, 8307, 8308, 8309, 8314, 8385, 8772,
8917, 8919, 9375, 9835, 10213, 10593, 10671, 10672, 10673,
10678, 10747, 10749, 10980, 11268, 11601, 11700, 11707, 12804,
12882, 12883, 12884, 12889, 12958, 12960, 13191, 13479, 13812,
13911, 13918],
dtype='int64'),
('IT',
'low'): Int64Index([ 138, 139, 140, 141, 142, 147, 214, 215, 216,
217,
...
14731, 14732, 14733, 14734, 14806, 14807, 14808, 14809, 14810,
14815],
dtype='int64', length=609),
('IT',
'medium'): Int64Index([ 61, 62, 63, 64, 65, 70, 294, 295, 300,
376,
...
14511, 14587, 14663, 14739, 14929, 14930, 14931, 14932, 14933,
14938],
dtype='int64', length=535),
('RandD',
'high'): Int64Index([ 1827, 1905, 1983, 3201, 3202, 3203, 3657, 3658, 3659,
3660, 3661, 3735, 3813, 4044, 4729, 5185, 6025, 6026,
6027, 6028, 6177, 6255, 6330, 6408, 6558, 7018, 7476,
7477, 7552, 8009, 8315, 8316, 8317, 8318, 8773, 8774,
8775, 8776, 8777, 8850, 8928, 9382, 9384, 10300, 11659,
11661, 11662, 13870, 13872, 13873, 14941],
dtype='int64'),
('RandD',
'low'): Int64Index([ 605, 833, 834, 835, 836, 837, 985, 986, 987,
988,
...
13213, 13214, 13215, 13216, 13217, 13871, 13874, 13875, 14816,
14942],
dtype='int64', length=364),
('RandD',
'medium'): Int64Index([ 301, 302, 303, 304, 305, 453, 454, 455, 456,
457,
...
14666, 14667, 14668, 14817, 14818, 14819, 14820, 14939, 14940,
14943],
dtype='int64', length=372),
('accounting',
'high'): Int64Index([ 384, 1632, 1710, 3234, 3235, 3236, 3387, 3540, 3664,
3768, 4124, 4228, 4684, 4686, 4734, 4762, 5190, 5524,
5525, 5526, 5982, 5983, 5984, 6060, 6488, 6564, 6592,
6795, 7510, 7888, 8349, 8350, 8351, 8502, 8580, 8655,
8780, 8883, 9084, 9340, 9613, 9614, 9799, 9801, 9843,
9844, 9877, 9920, 9921, 9996, 9997, 10072, 10073, 10334,
10636, 10637, 10638, 10866, 10944, 11101, 11214, 11971, 11972,
12384, 12847, 12848, 12849, 13077, 13155, 13312, 13425, 14182,
14183, 14595],
dtype='int64'),
('accounting',
'low'): Int64Index([ 28, 29, 30, 79, 155, 224, 225, 232, 410,
486,
...
14621, 14697, 14698, 14699, 14773, 14774, 14775, 14849, 14850,
14851],
dtype='int64', length=358),
('accounting',
'medium'): Int64Index([ 105, 106, 107, 181, 182, 183, 258, 259, 260,
308,
...
14741, 14747, 14823, 14896, 14897, 14898, 14946, 14972, 14973,
14974],
dtype='int64', length=335),
('hr',
'high'): Int64Index([ 111, 1788, 1866, 3237, 3238, 3465, 3618, 3696, 3772,
4233, 4687, 4690, 5219, 5527, 5528, 5985, 5986, 5987,
5988, 6138, 6216, 6594, 7054, 7057, 8352, 8353, 8733,
8811, 8812, 8887, 9343, 9802, 9805, 10338, 10639, 10640,
10641, 10642, 12111, 12850, 12851, 12852, 12853, 14322, 14902],
dtype='int64'),
('hr',
'low'): Int64Index([ 31, 32, 33, 34, 226, 227, 228, 565, 566,
641,
...
14245, 14437, 14438, 14439, 14776, 14777, 14852, 14853, 14854,
14855],
dtype='int64', length=335),
('hr',
'medium'): Int64Index([ 108, 109, 110, 184, 185, 186, 187, 261, 262,
263,
...
14744, 14778, 14779, 14899, 14900, 14901, 14975, 14976, 14977,
14978],
dtype='int64', length=359),
('management',
'high'): Int64Index([ 1203, 2217, 3114, 3363, 3667, 4127, 4509, 5096, 5325,
5556,
...
14148, 14149, 14150, 14151, 14186, 14204, 14205, 14206, 14207,
14208],
dtype='int64', length=225),
('management',
'low'): Int64Index([ 82, 137, 158, 213, 235, 290, 366, 442, 463,
518,
...
14446, 14501, 14577, 14653, 14674, 14729, 14805, 14873, 14928,
14949],
dtype='int64', length=180),
('management',
'medium'): Int64Index([ 60, 311, 387, 539, 615, 691, 767, 843, 919,
974,
...
13727, 13881, 14005, 14089, 14157, 14271, 14522, 14598, 14750,
14826],
dtype='int64', length=225),
('marketing',
'high'): Int64Index([ 306, 540, 618, 2295, 3289, 3662, 3668, 3891, 4122,
4128, 4129, 4130, 4434, 4587, 4588, 4732, 5194, 5655,
6486, 6492, 6493, 6562, 6953, 6954, 6955, 7023, 7029,
7107, 7260, 7480, 7488, 7942, 8013, 8404, 8474, 8482,
8778, 9006, 9393, 9471, 9549, 9612, 9624, 9702, 9703,
9842, 9919, 9995, 10071, 10312, 11099, 11105, 11106, 11107,
11484, 11608, 11665, 11673, 11796, 11949, 11970, 11998, 12306,
12540, 12618, 13310, 13316, 13317, 13318, 13695, 13819, 13876,
13884, 14007, 14160, 14181, 14209, 14517, 14751, 14829],
dtype='int64'),
('marketing',
'low'): Int64Index([ 83, 84, 85, 148, 149, 150, 151, 152, 153,
159,
...
14524, 14525, 14601, 14752, 14874, 14875, 14876, 14950, 14951,
14952],
dtype='int64', length=402),
('marketing',
'medium'): Int64Index([ 77, 382, 388, 389, 458, 464, 465, 466, 534,
542,
...
14669, 14675, 14676, 14677, 14745, 14753, 14821, 14827, 14828,
14944],
dtype='int64', length=376),
('product_mng',
'high'): Int64Index([ 72, 1593, 1671, 1749, 3196, 3197, 3198, 3199, 3348,
3426, 3579, 3804, 4267, 4725, 5562, 5563, 6021, 6022,
6023, 6097, 6099, 6553, 7015, 7548, 7850, 7851, 7852,
7853, 8310, 8311, 8312, 8313, 8463, 8541, 8619, 8694,
9379, 9840, 10674, 10675, 10676, 10677, 10827, 10905, 11175,
11253, 11264, 11445, 11523, 11704, 11835, 11988, 12072, 12885,
12886, 12887, 12888, 13038, 13116, 13386, 13464, 13475, 13656,
13734, 13915, 14046, 14199, 14283],
dtype='int64'),
('product_mng',
'low'): Int64Index([ 73, 143, 144, 145, 146, 219, 220, 221, 222,
448,
...
14659, 14660, 14735, 14736, 14737, 14738, 14811, 14812, 14813,
14814],
dtype='int64', length=451),
('product_mng',
'medium'): Int64Index([ 66, 67, 68, 69, 71, 74, 75, 76, 296,
297,
...
14583, 14584, 14585, 14586, 14661, 14662, 14934, 14935, 14936,
14937],
dtype='int64', length=383),
('sales',
'high'): Int64Index([ 696, 774, 852, 930, 1008, 1086, 1164, 1242, 1320,
1398,
...
13503, 13581, 13620, 13888, 13890, 13929, 13940, 13944, 13948,
14121],
dtype='int64', length=269),
('sales',
'low'): Int64Index([ 0, 3, 4, 5, 6, 7, 8, 9, 10,
11,
...
14958, 14959, 14960, 14961, 14962, 14963, 14964, 14965, 14966,
14967],
dtype='int64', length=2099),
('sales',
'medium'): Int64Index([ 1, 2, 99, 100, 101, 102, 103, 104, 177,
178,
...
14891, 14892, 14893, 14894, 14895, 14945, 14968, 14969, 14970,
14971],
dtype='int64', length=1772),
('support',
'high'): Int64Index([ 657, 735, 813, 891, 969, 2139, 2334, 2412, 2490,
2568,
...
12949, 13018, 13269, 13313, 13446, 13450, 13542, 13879, 14085,
14868],
dtype='int64', length=141),
('support',
'low'): Int64Index([ 46, 47, 48, 49, 50, 51, 52, 53, 54,
55,
...
14947, 14990, 14991, 14992, 14993, 14994, 14995, 14996, 14997,
14998],
dtype='int64', length=1146),
('support',
'medium'): Int64Index([ 309, 428, 461, 504, 505, 506, 537, 581, 582,
583,
...
14748, 14792, 14793, 14794, 14795, 14824, 14867, 14870, 14871,
14872],
dtype='int64', length=942),
('technical',
'high'): Int64Index([ 189, 267, 345, 423, 462, 501, 579, 1047, 1125,
1944,
...
13851, 13906, 14400, 14478, 14556, 14634, 14673, 14712, 14790,
14980],
dtype='int64', length=201),
('technical',
'low'): Int64Index([ 35, 36, 37, 38, 39, 40, 41, 42, 43,
44,
...
14913, 14925, 14926, 14927, 14948, 14981, 14986, 14987, 14988,
14989],
dtype='int64', length=1372),
('technical',
'medium'): Int64Index([ 113, 114, 115, 116, 188, 191, 192, 193, 194,
265,
...
14866, 14904, 14905, 14906, 14907, 14979, 14982, 14983, 14984,
14985],
dtype='int64', length=1147)}
grouped = hr_data. groupby( [ 'department' , 'salary' ] )
grouped. get_group( ( 'technical' , 'low' ) ) . head( )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
department
salary
left_categorical
new_number_project
35
0.10
0.94
6
255
4
0
1
0
technical
low
True
6
36
0.38
0.46
2
137
3
0
1
0
technical
low
True
2
37
0.45
0.50
2
126
3
0
1
0
technical
low
True
2
38
0.11
0.89
6
306
4
0
1
0
technical
low
True
6
39
0.41
0.54
2
152
3
0
1
0
technical
low
True
2
Aggregation
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.
These operations are similar to the aggregating API, window functions API, and resample API.
grouped. number_project. aggregate( np. mean)
department salary
IT high 3.867470
low 3.794745
medium 3.833645
RandD high 3.764706
low 3.804945
medium 3.913978
accounting high 3.905405
low 3.801676
medium 3.832836
hr high 3.888889
low 3.692537
medium 3.590529
management high 3.777778
low 3.777778
medium 4.008889
marketing high 3.425000
low 3.751244
medium 3.675532
product_mng high 3.705882
low 3.824834
medium 3.804178
sales high 3.858736
low 3.757980
medium 3.785553
support high 3.794326
low 3.787086
medium 3.825902
technical high 3.651741
low 3.910350
medium 3.878814
Name: number_project, dtype: float64
grouped. agg( [ np. mean, np. max ] )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
new_number_project
mean
amax
mean
amax
mean
amax
mean
amax
mean
amax
mean
amax
mean
amax
mean
amax
mean
amax
department
salary
IT
high
0.638193
0.99
0.716627
0.99
3.867470
6
194.927711
275
3.072289
6
0.048193
1
0.048193
1
0.000000
0
4.012048
6
low
0.610099
1.00
0.715665
1.00
3.794745
7
201.382594
308
3.438424
10
0.146141
1
0.282430
1
0.003284
1
3.916256
7
medium
0.624187
1.00
0.718187
1.00
3.833645
7
204.295327
308
3.564486
10
0.132710
1
0.181308
1
0.001869
1
3.945794
7
RandD
high
0.586667
0.97
0.700588
0.95
3.764706
6
199.745098
287
3.529412
8
0.176471
1
0.078431
1
0.019608
1
3.843137
6
low
0.623929
1.00
0.714176
1.00
3.804945
7
198.747253
308
3.381868
8
0.195055
1
0.151099
1
0.008242
1
3.903846
7
medium
0.620349
1.00
0.711694
1.00
3.913978
7
202.954301
301
3.330645
6
0.145161
1
0.166667
1
0.061828
1
4.043011
7
accounting
high
0.614054
0.97
0.724595
1.00
3.905405
6
205.905405
277
3.216216
8
0.202703
1
0.067568
1
0.081081
1
3.986486
6
low
0.574162
1.00
0.713883
1.00
3.801676
7
199.899441
308
3.438547
10
0.111732
1
0.276536
1
0.005587
1
3.905028
7
medium
0.583642
1.00
0.720299
1.00
3.832836
7
201.465672
310
3.680597
10
0.122388
1
0.298507
1
0.017910
1
3.940299
7
hr
high
0.673111
0.99
0.743778
0.99
3.888889
6
209.066667
289
2.911111
6
0.088889
1
0.133333
1
0.044444
1
4.066667
6
low
0.608657
1.00
0.717821
1.00
3.692537
7
202.456716
310
3.259701
6
0.137313
1
0.274627
1
0.005970
1
3.797015
7
medium
0.580306
1.00
0.696100
1.00
3.590529
7
193.863510
310
3.501393
8
0.108635
1
0.325905
1
0.030641
1
3.710306
7
management
high
0.653333
0.98
0.715822
1.00
3.777778
6
200.248889
286
5.164444
10
0.160000
1
0.004444
1
0.200000
1
3.844444
6
low
0.610722
1.00
0.712833
1.00
3.777778
7
200.744444
307
3.411111
10
0.166667
1
0.327778
1
0.038889
1
3.900000
7
medium
0.597867
1.00
0.741111
1.00
4.008889
7
202.653333
304
4.155556
10
0.164444
1
0.137778
1
0.075556
1
4.084444
7
marketing
high
0.605250
1.00
0.663625
1.00
3.425000
6
185.575000
286
3.512500
10
0.162500
1
0.112500
1
0.062500
1
3.600000
6
low
0.602910
0.99
0.727587
1.00
3.751244
7
204.487562
310
3.527363
10
0.154229
1
0.313433
1
0.027363
1
3.883085
7
medium
0.638218
1.00
0.714495
1.00
3.675532
7
196.869681
300
3.627660
10
0.167553
1
0.180851
1
0.071809
1
3.776596
7
product_mng
high
0.614118
0.99
0.665735
0.98
3.705882
6
194.632353
307
3.617647
10
0.191176
1
0.088235
1
0.000000
0
3.882353
6
low
0.620909
1.00
0.725831
1.00
3.824834
7
201.048780
310
3.434590
10
0.150776
1
0.232816
1
0.000000
0
3.937916
7
medium
0.619112
1.00
0.710418
1.00
3.804178
7
199.637076
310
3.498695
10
0.133159
1
0.227154
1
0.000000
0
3.895561
7
sales
high
0.648959
1.00
0.699814
0.99
3.858736
7
201.178439
306
3.550186
10
0.137546
1
0.052045
1
0.044610
1
3.988848
7
low
0.600838
1.00
0.709247
1.00
3.757980
7
200.363030
307
3.464030
10
0.126251
1
0.332063
1
0.009528
1
3.870414
7
medium
0.625327
1.00
0.711778
1.00
3.785553
7
201.520316
310
3.614560
10
0.160835
1
0.170993
1
0.038375
1
3.930023
7
support
high
0.655035
0.99
0.714113
1.00
3.794326
6
203.985816
286
3.219858
10
0.219858
1
0.056738
1
0.000000
0
3.943262
6
low
0.591710
1.00
0.719494
1.00
3.787086
7
198.900524
310
3.484293
10
0.151832
1
0.339442
1
0.006108
1
3.902269
7
medium
0.645149
1.00
0.728854
1.00
3.825902
7
202.535032
310
3.307856
10
0.148620
1
0.167728
1
0.013800
1
3.951168
7
technical
high
0.625970
1.00
0.699453
1.00
3.651741
6
200.044776
284
3.313433
10
0.149254
1
0.124378
1
0.004975
1
3.781095
6
low
0.594322
1.00
0.723367
1.00
3.910350
7
203.064869
310
3.397230
10
0.142128
1
0.275510
1
0.008746
1
4.035714
7
medium
0.620968
1.00
0.722180
1.00
3.878814
7
202.248474
310
3.445510
10
0.136007
1
0.256321
1
0.013078
1
3.993897
7
Descriptive statistics of grouped data
grouped. size( )
department salary
IT high 83
low 609
medium 535
RandD high 51
low 364
medium 372
accounting high 74
low 358
medium 335
hr high 45
low 335
medium 359
management high 225
low 180
medium 225
marketing high 80
low 402
medium 376
product_mng high 68
low 451
medium 383
sales high 269
low 2099
medium 1772
support high 141
low 1146
medium 942
technical high 201
low 1372
medium 1147
dtype: int64
grouped. describe( )
satisfaction_level
last_evaluation
...
promotion_last_5years
new_number_project
count
mean
std
min
25%
50%
75%
max
count
mean
...
75%
max
count
mean
std
min
25%
50%
75%
max
department
salary
IT
high
83.0
0.638193
0.223749
0.15
0.5250
0.650
0.7800
0.99
83.0
0.716627
...
0.0
0.0
83.0
4.012048
1.152875
2.0
3.0
4.0
5.0
6.0
low
609.0
0.610099
0.258915
0.09
0.4100
0.650
0.8200
1.00
609.0
0.715665
...
0.0
1.0
609.0
3.916256
1.289334
2.0
3.0
4.0
5.0
7.0
medium
535.0
0.624187
0.243297
0.09
0.4900
0.660
0.8100
1.00
535.0
0.718187
...
0.0
1.0
535.0
3.945794
1.235726
2.0
3.0
4.0
5.0
7.0
RandD
high
51.0
0.586667
0.228785
0.10
0.4400
0.600
0.7450
0.97
51.0
0.700588
...
0.0
1.0
51.0
3.843137
1.137938
2.0
3.0
4.0
5.0
6.0
low
364.0
0.623929
0.242586
0.09
0.4700
0.675
0.8200
1.00
364.0
0.714176
...
0.0
1.0
364.0
3.903846
1.187208
2.0
3.0
4.0
5.0
7.0
medium
372.0
0.620349
0.250293
0.09
0.4775
0.650
0.8300
1.00
372.0
0.711694
...
0.0
1.0
372.0
4.043011
1.262471
2.0
3.0
4.0
5.0
7.0
accounting
high
74.0
0.614054
0.237319
0.11
0.5000
0.620
0.8300
0.97
74.0
0.724595
...
0.0
1.0
74.0
3.986486
1.140695
2.0
3.0
4.0
5.0
6.0
low
358.0
0.574162
0.252250
0.09
0.4000
0.590
0.7800
1.00
358.0
0.713883
...
0.0
1.0
358.0
3.905028
1.350144
2.0
3.0
4.0
5.0
7.0
medium
335.0
0.583642
0.262273
0.09
0.4000
0.630
0.8000
1.00
335.0
0.720299
...
0.0
1.0
335.0
3.940299
1.295785
2.0
3.0
4.0
5.0
7.0
hr
high
45.0
0.673111
0.250616
0.09
0.5500
0.730
0.8600
0.99
45.0
0.743778
...
0.0
1.0
45.0
4.066667
1.136182
2.0
4.0
4.0
5.0
6.0
low
335.0
0.608657
0.239902
0.09
0.4400
0.620
0.8100
1.00
335.0
0.717821
...
0.0
1.0
335.0
3.797015
1.235935
2.0
3.0
4.0
5.0
7.0
medium
359.0
0.580306
0.253324
0.09
0.4050
0.600
0.7950
1.00
359.0
0.696100
...
0.0
1.0
359.0
3.710306
1.330636
2.0
3.0
4.0
4.0
7.0
management
high
225.0
0.653333
0.194436
0.14
0.5300
0.680
0.8000
0.98
225.0
0.715822
...
0.0
1.0
225.0
3.844444
1.152551
2.0
3.0
4.0
5.0
6.0
low
180.0
0.610722
0.254620
0.09
0.4350
0.655
0.8050
1.00
180.0
0.712833
...
0.0
1.0
180.0
3.900000
1.268880
2.0
3.0
4.0
5.0
7.0
medium
225.0
0.597867
0.233161
0.09
0.4800
0.630
0.7600
1.00
225.0
0.741111
...
0.0
1.0
225.0
4.084444
1.234539
2.0
3.0
4.0
5.0
7.0
marketing
high
80.0
0.605250
0.255784
0.14
0.4350
0.610
0.8275
1.00
80.0
0.663625
...
0.0
1.0
80.0
3.600000
1.164887
2.0
3.0
3.0
4.0
6.0
low
402.0
0.602910
0.256258
0.09
0.4200
0.630
0.8100
0.99
402.0
0.727587
...
0.0
1.0
402.0
3.883085
1.322645
2.0
3.0
4.0
5.0
7.0
medium
376.0
0.638218
0.227333
0.09
0.4900
0.670
0.8200
1.00
376.0
0.714495
...
0.0
1.0
376.0
3.776596
1.181224
2.0
3.0
4.0
5.0
7.0
product_mng
high
68.0
0.614118
0.248038
0.09
0.4500
0.625
0.8225
0.99
68.0
0.665735
...
0.0
0.0
68.0
3.882353
1.165799
2.0
3.0
4.0
5.0
6.0
low
451.0
0.620909
0.248181
0.09
0.4450
0.650
0.8300
1.00
451.0
0.725831
...
0.0
0.0
451.0
3.937916
1.335217
2.0
3.0
4.0
5.0
7.0
medium
383.0
0.619112
0.234720
0.09
0.4500
0.640
0.8100
1.00
383.0
0.710418
...
0.0
0.0
383.0
3.895561
1.271745
2.0
3.0
4.0
5.0
7.0
sales
high
269.0
0.648959
0.236264
0.10
0.5200
0.690
0.8200
1.00
269.0
0.699814
...
0.0
1.0
269.0
3.988848
1.137827
2.0
3.0
4.0
5.0
7.0
low
2099.0
0.600838
0.251686
0.09
0.4200
0.630
0.8100
1.00
2099.0
0.709247
...
0.0
1.0
2099.0
3.870414
1.352319
2.0
3.0
4.0
5.0
7.0
medium
1772.0
0.625327
0.249707
0.09
0.4500
0.660
0.8300
1.00
1772.0
0.711778
...
0.0
1.0
1772.0
3.930023
1.224935
2.0
3.0
4.0
5.0
7.0
support
high
141.0
0.655035
0.225644
0.15
0.5100
0.670
0.8600
0.99
141.0
0.714113
...
0.0
0.0
141.0
3.943262
1.080827
2.0
3.0
4.0
5.0
6.0
low
1146.0
0.591710
0.255661
0.09
0.4000
0.630
0.8000
1.00
1146.0
0.719494
...
0.0
1.0
1146.0
3.902269
1.339052
2.0
3.0
4.0
5.0
7.0
medium
942.0
0.645149
0.234231
0.09
0.5100
0.680
0.8300
1.00
942.0
0.728854
...
0.0
1.0
942.0
3.951168
1.171642
2.0
3.0
4.0
5.0
7.0
technical
high
201.0
0.625970
0.219279
0.10
0.4900
0.640
0.7900
1.00
201.0
0.699453
...
0.0
1.0
201.0
3.781095
1.136592
2.0
3.0
4.0
5.0
6.0
low
1372.0
0.594322
0.264359
0.09
0.4100
0.630
0.8200
1.00
1372.0
0.723367
...
0.0
1.0
1372.0
4.035714
1.327829
2.0
3.0
4.0
5.0
7.0
medium
1147.0
0.620968
0.246691
0.09
0.4500
0.660
0.8300
1.00
1147.0
0.722180
...
0.0
1.0
1147.0
3.993897
1.298730
2.0
3.0
4.0
5.0
7.0
30 rows × 72 columns
grouped = hr_data. groupby( [ 'department' ] )
grouped. agg( mean_projects= ( 'number_project' , 'mean' ) , mean_satisfaction= ( 'satisfaction_level' , 'mean' ) )
mean_projects
mean_satisfaction
department
IT
3.816626
0.618142
RandD
3.853875
0.619822
accounting
3.825293
0.582151
hr
3.654939
0.598809
management
3.860317
0.621349
marketing
3.687646
0.618601
product_mng
3.807095
0.619634
sales
3.776329
0.614447
support
3.803948
0.618300
technical
3.877941
0.607897
grouped. transform( lambda x : x + 2 )
satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
Work_accident
left
promotion_last_5years
new_number_project
0
2.38
2.53
4
159
5
2
3
2
4
1
2.80
2.86
7
264
8
2
3
2
7
2
2.11
2.88
9
274
6
2
3
2
9
3
2.72
2.87
7
225
7
2
3
2
7
4
2.37
2.52
4
161
5
2
3
2
4
...
...
...
...
...
...
...
...
...
...
14994
2.40
2.57
4
153
5
2
3
2
4
14995
2.37
2.48
4
162
5
2
3
2
4
14996
2.37
2.53
4
145
5
2
3
2
4
14997
2.11
2.96
8
282
6
2
3
2
8
14998
2.37
2.52
4
160
5
2
3
2
4
14999 rows × 9 columns