文章目录

Working on Text Data

Introduction
Concat, Split & Join
Understanding contains, find & index
Cleaning Punchuations
Check for string contents|检查字符串内容
String Manipulation|字符串操作

Working on Missing Data

1. Detect existing non-missing values
Return a new Series with missing values removed.
Filling missing values
Intrapolate missing values based on different methods| 基于不同方法的内推缺失值
Styling Pandas Table
Applymap
Apply
Highlighting Null Values
Dealing with subset of Data

Pandas for Computation

Statistical Functions|##统计功能

Covariance & Correlation

共方差和相关性

Window Functions|窗口功能
Time aware rolling
Rolling vs Expanding

Data Transformation using Map, Apply & GroupBy

Transforming Series using Map
Transforming data across multiple Series
GroupBy
Aggregation
Descriptive statistics of grouped data

Working on Text Data

Introduction
Concat, Split & Join
Contains, Find
Cleaning Punchuations
Checking contents of String Data
String Manipulation
导言
同轴、分轴和加入
包含、查找
清洁冲头
检查字符串数据的内容
字符串操作

Introduction

Columns many times contain string data.
And, string data may always not be in best of their hygine.
Pandas provide rich api’s to make task easier using .str functions.

import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d']})
df

	A
0	a
1	b
2	c
3	d

Concat, Split & Join

df.A.str.cat(sep=' ')

'a b c d'

df.A.str.cat(['1','2','3','4'])

0    a1
1    b2
2    c3
3    d4
Name: A, dtype: object

horror_data = pd.read_csv('../Data/horror-train.csv')
horror_data

	id	text	author
0	id26305	This process, however, afforded me no means of...	EAP
1	id17569	It never once occurred to me that the fumbling...	HPL
2	id11008	In his left hand was a gold snuff box, from wh...	EAP
3	id27763	How lovely is spring As we looked from Windsor...	MWS
4	id12958	Finding nothing else, not even gold, the Super...	HPL
...	...	...	...
19574	id17718	I could have fancied, while I looked at it, th...	EAP
19575	id08973	The lids clenched themselves together as if in...	EAP
19576	id05267	Mais il faut agir that is to say, a Frenchman ...	EAP
19577	id17513	For an item of news like this, it strikes us i...	EAP
19578	id00393	He laid a gnarled claw on my shoulder, and it ...	HPL

19579 rows × 3 columns

horror_data = horror_data.iloc[:10]
horror_data

	id	text	author
0	id26305	This process, however, afforded me no means of...	EAP
1	id17569	It never once occurred to me that the fumbling...	HPL
2	id11008	In his left hand was a gold snuff box, from wh...	EAP
3	id27763	How lovely is spring As we looked from Windsor...	MWS
4	id12958	Finding nothing else, not even gold, the Super...	HPL
5	id22965	A youth passed in solitude, my best years spen...	MWS
6	id09674	The astronomer, perhaps, at this point, took r...	EAP
7	id13515	The surcingle hung in ribands from my body.	EAP
8	id19322	I knew that you could not say to yourself 'ste...	EAP
9	id00912	I confess that neither the structure of langua...	MWS

horror_data.text.str.split()

0    [This, process,, however,, afforded, me, no, m...
1    [It, never, once, occurred, to, me, that, the,...
2    [In, his, left, hand, was, a, gold, snuff, box...
3    [How, lovely, is, spring, As, we, looked, from...
4    [Finding, nothing, else,, not, even, gold,, th...
5    [A, youth, passed, in, solitude,, my, best, ye...
6    [The, astronomer,, perhaps,, at, this, point,,...
7    [The, surcingle, hung, in, ribands, from, my, ...
8    [I, knew, that, you, could, not, say, to, your...
9    [I, confess, that, neither, the, structure, of...
Name: text, dtype: object

horror_data.text.str.split(expand=True,n=20)

	0	1	2	3	4	5	6	7	8	9	...	11	12	13	14	15	16	17	18	19	20
0	This	process,	however,	afforded	me	no	means	of	ascertaining	the	...	of	my	dungeon;	as	I	might	make	its	circuit,	and return to the point whence I set out, with...
1	It	never	once	occurred	to	me	that	the	fumbling	might	...	a	mere	mistake.	None	None	None	None	None	None	None
2	In	his	left	hand	was	a	gold	snuff	box,	from	...	as	he	capered	down	the	hill,	cutting	all	manner	of fantastic steps, he took snuff incessantly ...
3	How	lovely	is	spring	As	we	looked	from	Windsor	Terrace	...	the	sixteen	fertile	counties	spread	beneath,	speckled	by	happy	cottages and wealthier towns, all looked as in...
4	Finding	nothing	else,	not	even	gold,	the	Superintendent	abandoned	his	...	but	a	perplexed	look	occasionally	steals	over	his	countenance	as he sits thinking at his desk.
5	A	youth	passed	in	solitude,	my	best	years	spent	under	...	gentle	and	feminine	fosterage,	has	so	refined	the	groundwork	of my character that I cannot overcome an inte...
6	The	astronomer,	perhaps,	at	this	point,	took	refuge	in	the	...	of	non	luminosity;	and	here	analogy	was	suddenly	let	fall.
7	The	surcingle	hung	in	ribands	from	my	body.	None	None	...	None	None	None	None	None	None	None	None	None	None
8	I	knew	that	you	could	not	say	to	yourself	'stereotomy'	...	being	brought	to	think	of	atomies,	and	thus	of	the theories of Epicurus; and since, when we d...
9	I	confess	that	neither	the	structure	of	languages,	nor	the	...	of	governments,	nor	the	politics	of	various	states	possessed	attractions for me.

10 rows × 21 columns

res = horror_data.text.str.split()
res.str.join(sep=' ')

0    This process, however, afforded me no means of...
1    It never once occurred to me that the fumbling...
2    In his left hand was a gold snuff box, from wh...
3    How lovely is spring As we looked from Windsor...
4    Finding nothing else, not even gold, the Super...
5    A youth passed in solitude, my best years spen...
6    The astronomer, perhaps, at this point, took r...
7          The surcingle hung in ribands from my body.
8    I knew that you could not say to yourself 'ste...
9    I confess that neither the structure of langua...
Name: text, dtype: object

Understanding contains, find & index

horror_data.text.str.contains('This')

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: text, dtype: bool

horror_data.text.str.contains('his')

0     True
1    False
2     True
3    False
4     True
5     True
6     True
7    False
8     True
9    False
Name: text, dtype: bool

horror_data.text.str.find('This')

0    0
1   -1
2   -1
3   -1
4   -1
5   -1
6   -1
7   -1
8   -1
9   -1
Name: text, dtype: int64

horror_data[:5].text.str.index('is')

0     2
1    64
2     4
3    11
4    67
Name: text, dtype: int64

df = pd.DataFrame({'Name':['Abc','Def','Jkl'], 'Email':['awidgmail.com', ' [email protected]', '[email protected]']})
df

	Name	Email
0	Abc	awidgmail.com
1	Def	[email protected]
2	Jkl	[email protected]

df.Email.str.contains(r'\w+@\w+')

0    False
1     True
2     True
Name: Email, dtype: bool

df.Email.str.replace(pat='@',repl='&')

0     awidgmail.com
1     def&gmail.com
2     jkl&yahoo.com
Name: Email, dtype: object

Cleaning Punchuations

import string
table = str.maketrans('', '', string.punctuation)

horror_data.text.str.translate(table)

0    This process however afforded me no means of a...
1    It never once occurred to me that the fumbling...
2    In his left hand was a gold snuff box from whi...
3    How lovely is spring As we looked from Windsor...
4    Finding nothing else not even gold the Superin...
5    A youth passed in solitude my best years spent...
6    The astronomer perhaps at this point took refu...
7           The surcingle hung in ribands from my body
8    I knew that you could not say to yourself ster...
9    I confess that neither the structure of langua...
Name: text, dtype: object

horror_data_tf = horror_data.text.str.translate(table)

horror_data_tf = horror_data_tf.str.lower()

Check for string contents|检查字符串内容

isalnum() Equivalent to str.isalnum
isalpha() Equivalent to str.isalpha
isdigit() Equivalent to str.isdigit
isspace() Equivalent to str.isspace
islower() Equivalent to str.islower
isupper() Equivalent to str.isupper
istitle() Equivalent to str.istitle
isnumeric() Equivalent to str.isnumeric
isdecimal() Equivalent to str.isdecimal
isalnum() 等同于str.isalnum()
isalpha() 等效于str.isalpha()
isdigit() 等同于str.isdigit()
isspace() 等效于str.isspace()
islower() 等同于str.islower()
isupper() 等同于str.isupper()
istitle() 等同于str.istitle()
isnumeric() 等同于str.isumeric()
isdecimal() 等同于str.isdecimal()

df = pd.DataFrame({'A':['1234','123ab','abcde']})
df

	A
0	1234
1	123ab
2	abcde

Returns only rows which contain digit

df[df.A.str.isdigit()]

	A
0	1234

Returns only rows which is alphabets

df[df.A.str.isalpha()]

	A
2	abcde

String Manipulation|字符串操作

slice() Slice each string in the Series
slice_replace() Replace slice in each string with passed value
count() Count occurrences of pattern
startswith() Equivalent to str.startswith(pat) for each element
endswith() Equivalent to str.endswith(pat) for each element
findall() Compute list of all occurrences of pattern/regex for each string
match() Call re.match on each element, returning matched groups as list
extract() Call re.search on each element, returning DataFrame with one row for each element and one column for each regex
extractall()
slice() 将系列中的每一个字符串都切成片。
slice_replace() 用传递的值替换每个字符串中的slice()
count() 计数模式的出现次数
startswith() 等同于str.startwith(pat)对每个元素的作用。
endswith() 等同于str.endwith(pat)对每个元素的作用。
findall() 计算每个字符串的所有pattern/regex的出现次数列表
match() 在每个元素上调用re.match()，返回匹配的组作为列表。
extract() 在每个元素上调用re.search()，返回的DataFrame中，每个元素有一行，每个regex有一列。
extractall()

df = pd.DataFrame({'Name':['Rush','Riba','Kunal','Pruthvi'],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]']})
df

	Name	Email
0	Rush	[email protected]
1	Riba	[email protected]
2	Kunal	[email protected]
3	Pruthvi	[email protected]

df['Username'] = df.Email.str.slice(start = 0, step=2, stop=-11)
df

	Name	Email	Username
0	Rush	[email protected]	rs
1	Riba	[email protected]	rb
2	Kunal	[email protected]	knl
3	Pruthvi	[email protected]	puhi

df['UpdatedEmail'] = df.Email.str.slice_replace(start=-11, repl='@zekelabs.com')
df

	Name	Email	Username	UpdatedEmail
0	Rush	[email protected]	rs	[email protected]
1	Riba	[email protected]	rb	[email protected]
2	Kunal	[email protected]	knl	[email protected]
3	Pruthvi	[email protected]	puhi	[email protected]

Altering value

df.at[2,'Email'] = '[email protected]'
df

	Name	Email	Username	UpdatedEmail
0	Rush	[email protected]	rs	[email protected]
1	Riba	[email protected]	rb	[email protected]
2	Kunal	[email protected]	knl	[email protected]
3	Pruthvi	[email protected]	puhi	[email protected]

help(pd.DataFrame.at)

Help on property:

    Access a single value for a row/column label pair.
    
    Similar to ``loc``, in that both provide label-based lookups. Use
    ``at`` if you only need to get or set a single value in a DataFrame
    or Series.
    
    Raises
    ------
    KeyError
        When label does not exist in DataFrame
    
    See Also
    --------
    DataFrame.iat : Access a single value for a row/column pair by integer
        position.
    DataFrame.loc : Access a group of rows and columns by label(s).
    Series.at : Access a single value using a label.
    
    Examples
    --------
    >>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
    ...                   index=[4, 5, 6], columns=['A', 'B', 'C'])
    >>> df
        A   B   C
    4   0   2   3
    5   0   4   1
    6  10  20  30
    
    Get value at specified row/column pair
    
    >>> df.at[4, 'B']
    2
    
    Set value at specified row/column pair
    
    >>> df.at[4, 'B'] = 10
    >>> df.at[4, 'B']
    10
    
    Get value within a Series
    
    >>> df.loc[5].at['B']
    4

Filtering based domain name
基于过滤的域名

df.Email.str.match('[\w][email protected]')

0     True
1     True
2    False
3     True
Name: Email, dtype: bool

Extract text based on certain pattern

df.Email.str.extract('([\w]*)@([\w.]*)')

	0	1
0	rush	edyoda.com
1	riba	edyoda.com
2	kunal	everywhere.com
3	pruthvi	edyoda.com

Working on Missing Data

Detect missing & existing values.
Return a new Series with missing values removed.
Fill NA/NaN values using the specified method.
Interpolate values according to different methods.
检测缺失值和现有值。
返回一个新的系列，并删除缺失值。
使用指定的方法填充NA/NaN值。
根据不同的方法进行插值。

1. Detect existing non-missing values

Considered as missing values - None or numpy.NaN
Empty string is still considered non null values

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1,2,None], 'B':[2, np.NaN, 3]})
df

	A	B
0	1.0	2.0
1	2.0	NaN
2	NaN	3.0

df.isna()

	A	B
0	False	False
1	False	True
2	True	False

df.isna().sum()

A    1
B    1
dtype: int64

df.isna().any()

A    True
B    True
dtype: bool

df.isna().sum().sum()

isnull is implementation of isna

df.isnull()

	A	B
0	False	False
1	False	True
2	True	False

df = pd.DataFrame({'A':[1,'',None], 'B':[2, np.NaN, 3]})
df

	A	B
0	1	2.0
1		NaN
2	None	3.0

df.isna()

	A	B
0	False	False
1	False	True
2	True	False

df.isnull()#和上面结果一样

	A	B
0	False	False
1	False	True
2	True	False

Handling Empty Strings

df.replace('',np.NaN)#这个是会经常用到

	A	B
0	1.0	2.0
1	NaN	NaN
2	NaN	3.0

Finding non-null values

df.notna()

	A	B
0	True	True
1	True	False
2	False	True

Filtering data based on series

df[df.A.notna()]

	A	B
0	1	2.0
1		NaN

df.A.notna()

0     True
1     True
2    False
Name: A, dtype: bool

Return a new Series with missing values removed.

Dropping rows which have any missing values

df = pd.DataFrame({'A':[1,'',None], 'B':[2, np.NaN, 3], 'C':[3,4,5]})
df

	A	B	C
0	1	2.0	3
1		NaN	4
2	None	3.0	5

df.dropna()

	A	B	C
0	1	2.0	3

Dropping columns which have null values

df.dropna(axis=1)

	C
0	3
1	4
2	5

Filling missing values

df.fillna(0)

	A	B	C
0	1	2.0	3
1		0.0	4
2	0	3.0	5

df.fillna({'A':10,'B':11})
#df.fillna({'A':df['A'].mean(),'B':df['B'].mode()})

	A	B	C
0	1	2.0	3
1		11.0	4
2	10	3.0	5

Values can be backward fill, forward fill

df.fillna(method='bfill')

	A	B	C
0	1	2.0	3
1		3.0	4
2	None	3.0	5

Intrapolate missing values based on different methods| 基于不同方法的内推缺失值

df = pd.DataFrame({'Name':['Rush','Riba','Kunal','Pruthvi'],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                   'Age':[33,31,None,18]})

df

	Name	Email	Age
0	Rush	[email protected]	33.0
1	Riba	[email protected]	31.0
2	Kunal	[email protected]	NaN
3	Pruthvi	[email protected]	18.0

Styling Pandas Table

Applymap for applying entire table
Apply for applying column wise
Highlighting Null
Applying colors to subset of data

适用于应用整个表格的应用图
按栏目申请申请
突出显示Null
对数据的子集应用颜色

import pandas as pd
import numpy as np
df = pd.read_csv('SalesTrainingDataset.csv')
df = df.iloc[:100,list(range(10))]
df

Applymap

Apply styling on complete data
function returns a css parameter

def color_product_sales(val):
    if val > 10000:
        color = 'green'
    elif val < 2000:
        color = 'red'
    else:
        color = 'black'
    return 'color: %s' % color

df.style.applymap(color_product_sales)

Apply

In case, we want to do series wise check apply can be used.
Function argument will be series.
By default, columns.
If assigned axis = 1, rows

def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

df.style.apply(highlight_max)

Chaining is also supported

df.style.applymap(color_product_sales).apply(highlight_max)

Highlighting Null Values

df.style.highlight_null(null_color='red')

Dealing with subset of Data

Selecting subset of columns

df.style.apply(highlight_max, subset=['Outcome_M8','Outcome_M9'])

Selecting subset of rows

df.style.apply(highlight_max, subset=pd.IndexSlice[2:14,:], axis=1)

df.style.apply(highlight_max, subset=pd.IndexSlice[2:14, ['Outcome_M6','Outcome_M7','Outcome_M8','Outcome_M9']], axis=1)

Pandas for Computation

Percent change
Covariance
Correlation
Data Ranking
Window Functions
Time aware rolling
Window Function
Rolling vs Expanding

变化百分比
协方差
相关性
数据排名
窗口功能
时间意识的滚动
窗口功能
滚动与扩张

Statistical Functions|##统计功能

Percent Change - Series and DataFrame have a method pct_change() to compute the percent change over a given number of periods
百分比变化 - 系列和DataFrame有一个方法pct_change()来计算在给定的周期数上的百分比变化。

import pandas as pd
import numpy as np

sales_data = pd.DataFrame(data=np.random.randint(1,100,(10,4)), 
                          columns=['Tea','Milk','Carpet','Cream'], 
                          index=pd.Series(pd.period_range('1/1/2011', freq='M', periods=10)))
sales_data

	Tea	Milk	Carpet	Cream
2011-01	58	81	8	49
2011-02	63	29	39	70
2011-03	17	16	17	81
2011-04	26	9	23	53
2011-05	95	97	82	45
2011-06	52	12	72	99
2011-07	66	93	84	15
2011-08	92	76	75	96
2011-09	29	19	46	35
2011-10	83	94	2	4

Changes in monthly sales data

sales_data.pct_change(periods=1).round(4)*100

	Tea	Milk	Carpet	Cream
2011-01	NaN	NaN	NaN	NaN
2011-02	8.62	-64.20	387.50	42.86
2011-03	-73.02	-44.83	-56.41	15.71
2011-04	52.94	-43.75	35.29	-34.57
2011-05	265.38	977.78	256.52	-15.09
2011-06	-45.26	-87.63	-12.20	120.00
2011-07	26.92	675.00	16.67	-84.85
2011-08	39.39	-18.28	-10.71	540.00
2011-09	-68.48	-75.00	-38.67	-63.54
2011-10	186.21	394.74	-95.65	-88.57

Covariance & Correlation

Calculate covariance between series. Covariance is a measure of how much two random variables vary together

A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of b
etween -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that there is a perfect negative or positive correlation

共方差和相关性

计算序列之间的协方差。协方差是衡量两个随机变量在一起变化的程度。
相关系数是给关系取值的一种方法。相关系数的值为b。
在-1和1之间，"0 "表示变量之间完全不存在任何关系，而-1或1表示完全负相关或正相关。

df = pd.DataFrame(np.random.randint(10,20,(10,2)), columns=['A','B'])
df

	A	B
0	13	13
1	19	18
2	17	10
3	16	15
4	12	10
5	13	10
6	18	17
7	12	11
8	14	11
9	15	17

df.cov()

df.corr()

	A	B
A	1.0000	0.6942
B	0.6942	1.0000

#help(df.corr)
'''
关于模块pandas.core.frame.com中的方法corr的帮助。

corr(method='pearson', min_periods=1) pandas.core.frame.DataFrame实例的方法
    计算列的成对相关，不包括NA/null值。
    
    参数
    
    方法：{'pearson', 'kendall', 'spearman'} 或可调用的方法
        * Pearson：标准相关系数
        * Kendall：Kendall Tau相关系数。
        * Spearman：Spearman等级相关性
        * 可调用：输入两个1d ndarrays，可调用。
            并返回一个float。注意，从corr
            将有1沿对角线的对角线，并且将是对称的
            无论可调用者的行为如何
            ...版本添加：: 0.24.0
    
    min_periods : int, 可选
        每一对列所需的最低观测次数
        以获得有效的结果。目前仅适用于Pearson
        和Spearman相关性。
'''

"\n关于模块pandas.core.frame.com中的方法corr的帮助。\n\ncorr(method='pearson', min_periods=1) pandas.core.frame.DataFrame实例的方法\n    计算列的成对相关，不包括NA/null值。\n    \n    参数\n    \n    方法：{'pearson', 'kendall', 'spearman'} 或可调用的方法\n        * Pearson：标准相关系数\n        * Kendall：Kendall Tau相关系数。\n        * Spearman：Spearman等级相关性\n        * 可调用：输入两个1d ndarrays，可调用。\n            并返回一个float。注意，从corr\n            将有1沿对角线的对角线，并且将是对称的\n            无论可调用者的行为如何\n            ...版本添加：: 0.24.0\n    \n    min_periods : int, 可选\n        每一对列所需的最低观测次数\n        以获得有效的结果。目前仅适用于Pearson\n        和Spearman相关性。\n"

Rank method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:
排名法产生一个数据排名，并将各组的排名平均值（默认情况下）分配给各组的排名。

df
df['A_Rank'] = df.A.rank()
df

	A	B	A_Rank
0	13	13	3.5
1	19	18	10.0
2	17	10	8.0
3	16	15	7.0
4	12	10	1.5
5	13	10	3.5
6	18	17	9.0
7	12	11	1.5
8	14	11	5.0
9	15	17	6.0

#help(df.A.rank)
'''
关于模块pandas.core.generic.module pandas.core.generic.module rank中的方法 rank的帮助。

rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) pandas.core.core.series实例的方法
    计算沿轴的数值数据等级（1到n）。
    
    默认情况下，相同的值被分配了一个等级，该等级为
    这些值的等级。
    
    参数
    
    axis : {0或'index', 1或'columns'}, 默认为0
        直接排名的索引。
    method : {'平均', 'min', 'max', 'first', 'dense'}, 默认为'平均'
        如何对具有相同值的记录组进行排序？
        (即平局)。
    
        *平均：小组的平均排名
        * 最低：组内最低等级
        * 最高：组内最高等级
        * 第一：按数组中的顺序分配等级。
        * 密集：像 "min "一样，但组与组之间的等级总是增加1。
    numeric_only : bool, 可选的
        对于DataFrame对象，如果设置为True，只对数字列进行排序。
    na_option : {'keep', 'top', 'bottom'}, 默认为'keep'
        如何对NaN值进行排序。
    
        * 保留：为NaN值分配NaN等级。
        *顶部：如果从高到低，给NaN值分配最小的等级。
        * 底部：如果从高到低，给NaN值分配最高等级。
    升序：bool，默认为True
        元素是否应该按升序排列。
    pct : bool, 默认为 False
        是否以百分位数显示返回的排名。
        形式。

通过www.DeepL.com/Translator（免费版）翻译
'''

'\n关于模块pandas.core.generic.module pandas.core.generic.module rank中的方法 rank的帮助。\n\nrank(axis=0, method=\'average\', numeric_only=None, na_option=\'keep\', ascending=True, pct=False) pandas.core.core.series实例的方法\n    计算沿轴的数值数据等级（1到n）。\n    \n    默认情况下，相同的值被分配了一个等级，该等级为\n    这些值的等级。\n    \n    参数\n    \n    axis : {0或\'index\', 1或\'columns\'}, 默认为0\n        直接排名的索引。\n    method : {\'平均\', \'min\', \'max\', \'first\', \'dense\'}, 默认为\'平均\'\n        如何对具有相同值的记录组进行排序？\n        (即平局)。\n    \n        *平均：小组的平均排名\n        * 最低：组内最低等级\n        * 最高：组内最高等级\n        * 第一：按数组中的顺序分配等级。\n        * 密集：像 "min "一样，但组与组之间的等级总是增加1。\n    numeric_only : bool, 可选的\n        对于DataFrame对象，如果设置为True，只对数字列进行排序。\n    na_option : {\'keep\', \'top\', \'bottom\'}, 默认为\'keep\'\n        如何对NaN值进行排序。\n    \n        * 保留：为NaN值分配NaN等级。\n        *顶部：如果从高到低，给NaN值分配最小的等级。\n        * 底部：如果从高到低，给NaN值分配最高等级。\n    升序：bool，默认为True\n        元素是否应该按升序排列。\n    pct : bool, 默认为 False\n        是否以百分位数显示返回的排名。\n        形式。\n\n通过www.DeepL.com/Translator（免费版）翻译\n'

Window Functions|窗口功能

For working with data, a number of window functions are provided for computing common window or rolling statistics.
Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis.
在处理数据时，提供了一些窗口函数，用于计算常见的窗口或滚动统计。
其中有计数、求和、均值、均值、中位数、相关、方差、协方差、标准差、偏斜度和峰度。

sales_data = pd.read_csv('../Data/sales-data.csv')
sales_data

	Month	Sales
0	1-01	266.0
1	1-02	145.9
2	1-03	183.1
3	1-04	119.3
4	1-05	180.3
5	1-06	168.5
6	1-07	231.8
7	1-08	224.5
8	1-09	192.8
9	1-10	122.9
10	1-11	336.5
11	1-12	185.9
12	2-01	194.3
13	2-02	149.5
14	2-03	210.1
15	2-04	273.3
16	2-05	191.4
17	2-06	287.0
18	2-07	226.0
19	2-08	303.6
20	2-09	289.9
21	2-10	421.6
22	2-11	264.5
23	2-12	342.3
24	3-01	339.7
25	3-02	440.4
26	3-03	315.9
27	3-04	439.3
28	3-05	401.3
29	3-06	437.4
30	3-07	575.5
31	3-08	407.6
32	3-09	682.0
33	3-10	475.3
34	3-11	581.3
35	3-12	646.9

r = sales_data.Sales.rolling(window=5)

r.count()

0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     5.0
6     5.0
7     5.0
8     5.0
9     5.0
10    5.0
11    5.0
12    5.0
13    5.0
14    5.0
15    5.0
16    5.0
17    5.0
18    5.0
19    5.0
20    5.0
21    5.0
22    5.0
23    5.0
24    5.0
25    5.0
26    5.0
27    5.0
28    5.0
29    5.0
30    5.0
31    5.0
32    5.0
33    5.0
34    5.0
35    5.0
Name: Sales, dtype: float64

r.max()

0       NaN
1       NaN
2       NaN
3       NaN
4     266.0
5     183.1
6     231.8
7     231.8
8     231.8
9     231.8
10    336.5
11    336.5
12    336.5
13    336.5
14    336.5
15    273.3
16    273.3
17    287.0
18    287.0
19    303.6
20    303.6
21    421.6
22    421.6
23    421.6
24    421.6
25    440.4
26    440.4
27    440.4
28    440.4
29    440.4
30    575.5
31    575.5
32    682.0
33    682.0
34    682.0
35    682.0
Name: Sales, dtype: float64

Time aware rolling

dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
                    index=pd.date_range('20130101 09:00:00',
                                        periods=5,
                                        freq='s'))
dft

	B
2013-01-01 09:00:00	0.0
2013-01-01 09:00:01	1.0
2013-01-01 09:00:02	2.0
2013-01-01 09:00:03	NaN
2013-01-01 09:00:04	4.0

dft.rolling('2s').sum()

	B
2013-01-01 09:00:00	0.0
2013-01-01 09:00:01	1.0
2013-01-01 09:00:02	3.0
2013-01-01 09:00:03	2.0
2013-01-01 09:00:04	4.0

r.agg(np.sum)

0        NaN
1        NaN
2        NaN
3        NaN
4      894.6
5      797.1
6      883.0
7      924.4
8      997.9
9      940.5
10    1108.5
11    1062.6
12    1032.4
13     989.1
14    1076.3
15    1013.1
16    1018.6
17    1111.3
18    1187.8
19    1281.3
20    1297.9
21    1528.1
22    1505.6
23    1621.9
24    1658.0
25    1808.5
26    1702.8
27    1877.6
28    1936.6
29    2034.3
30    2169.4
31    2261.1
32    2503.8
33    2577.8
34    2721.7
35    2793.1
Name: Sales, dtype: float64

r.agg([np.sum, np.mean])

	sum	mean
0	NaN	NaN
1	NaN	NaN
2	NaN	NaN
3	NaN	NaN
4	894.6	178.92
5	797.1	159.42
6	883.0	176.60
7	924.4	184.88
8	997.9	199.58
9	940.5	188.10
10	1108.5	221.70
11	1062.6	212.52
12	1032.4	206.48
13	989.1	197.82
14	1076.3	215.26
15	1013.1	202.62
16	1018.6	203.72
17	1111.3	222.26
18	1187.8	237.56
19	1281.3	256.26
20	1297.9	259.58
21	1528.1	305.62
22	1505.6	301.12
23	1621.9	324.38
24	1658.0	331.60
25	1808.5	361.70
26	1702.8	340.56
27	1877.6	375.52
28	1936.6	387.32
29	2034.3	406.86
30	2169.4	433.88
31	2261.1	452.22
32	2503.8	500.76
33	2577.8	515.56
34	2721.7	544.34
35	2793.1	558.62

Rolling vs Expanding

data = pd.DataFrame([
    ['a', 1],
    ['a', 2],
    ['a', 3],
    ['b', 5],
    ['b', 6],
    ['b', 7],
    ['b', 8],
    ['c', 10],
    ['c', 11],
    ['c', 12],
    ['c', 13]
], columns = ['category', 'value'])
data

	category	value
0	a	1
1	a	2
2	a	3
3	b	5
4	b	6
5	b	7
6	b	8
7	c	10
8	c	11
9	c	12
10	c	13

data.value.expanding(1).sum()

0      1.0
1      3.0
2      6.0
3     11.0
4     17.0
5     24.0
6     32.0
7     42.0
8     53.0
9     65.0
10    78.0
Name: value, dtype: float64

data.value.rolling(2).sum()

0      NaN
1      3.0
2      5.0
3      8.0
4     11.0
5     13.0
6     15.0
7     18.0
8     21.0
9     23.0
10    25.0
Name: value, dtype: float64

Expanding - If we use the expanding window with initial size 1, it will create a window that in the first step contains only the first row. In the second step, it contains both the first and the second row. In every step, one additional row is added to the window, and the aggregating function is being recalculated.
Rolling - Rolling windows are totally different. In this case, we specify the size of the window which is moving. What happens when we set the rolling window size to 2?
- In the first step, it is going to contain the first row and one undefined row, so I am going to get NaN as a result.
- In the second step, the window moves and now contains the first and the second row. Now it is possible to calculate the aggregate function. In the case of this example, the sum of both rows.
- In the third step, the window moves again and no longer contains the first row. Instead of that now it calculates the sum of the second and the third row.
展开 - 如果我们使用初始大小为1的扩展窗口，它将创建一个窗口，在第一步中只包含第一行。在第二步中，它同时包含第一行和第二行。在每一步中，窗口中都会增加一条额外的行，并对聚合函数进行重新计算。
滚动–滚动窗口是完全不同的。在这种情况下，我们指定的是移动窗口的大小。当我们将滚动窗口的大小设置为2时，会发生什么情况呢？
- 在第一步中，它将包含第一行和一个未定义的行，所以我将得到NaN作为结果。
- 在第二步中，窗口移动了，现在包含了第一行和第二行。现在就可以计算出聚合函数了。在本例中，是两行的总和。
- 在第三步中，窗口再次移动，不再包含第一行。相反，现在它计算的是第二行和第三行之和。

Data Transformation using Map, Apply & GroupBy

Transforming Series using Map
Transforming across multiple Series using apply
GroupBy - Splitting, Applying & Combine
使用map变换序列
2.使用apply进行跨多个系列的转换
GroupBy - 分割、应用和合并

import pandas as pd
import numpy as np
hr_data = pd.read_csv('../Data/HR_comma_sep.csv.txt')
hr_data.rename(columns={'sales':'department'}, inplace=True)

Transforming Series using Map

hr_data.head()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	department	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

map for transforming left column with some categorical information

hr_data['left_categorical'] = hr_data.left.map({1:'True',0:'False'})

hr_data.head()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	department	salary	left_categorical
0	0.38	0.53	2	157	3	1	sales	low	True
1	0.80	0.86	5	262	6	1	sales	medium	True
2	0.11	0.88	7	272	4	1	sales	medium	True
3	0.72	0.87	5	223	5	1	sales	low	True
4	0.37	0.52	2	159	3	1	sales	low	True

Transforming data across multiple Series

If satisfaction_level > .9, increase number_project by 1
Multiple columns can’t be dealt with map, we need apply for that

def increase_proj(r):
    if r.satisfaction_level > .9:
        return r.number_project + 1
    else:
        return r.number_project

hr_data['new_number_project'] = hr_data.apply(increase_proj, axis=1)

Filtering all the folks for which this happened

hr_data[hr_data.number_project != hr_data.new_number_project].head()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	department	salary	left_categorical	new_number_project
7	0.92	0.85	5	259	5	0	1	sales	low	True	6
106	0.91	1.00	4	257	5	0	1	accounting	medium	True	5
191	0.92	0.87	4	226	6	1	1	technical	medium	True	5
231	0.92	0.99	5	255	6	0	1	sales	low	True	6
352	0.91	0.91	4	262	6	0	1	support	low	True	5

GroupBy

grouped = hr_data.groupby(['department'])

Compute first & last of group values

grouped.first()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	salary	left_categorical	new_number_project
department
IT	0.11	0.93	7	308	4	0	1	0	medium	True	7
RandD	0.12	1.00	3	278	4	0	1	0	medium	True	3
accounting	0.41	0.46	2	128	3	0	1	0	low	True	2
hr	0.45	0.57	2	134	3	0	1	0	low	True	2
management	0.85	0.91	5	226	5	0	1	0	medium	True	5
marketing	0.40	0.54	2	137	3	0	1	0	medium	True	2
product_mng	0.43	0.54	2	153	3	0	1	0	medium	True	2
sales	0.38	0.53	2	157	3	0	1	0	low	True	2
support	0.40	0.55	2	147	3	0	1	0	low	True	2
technical	0.10	0.94	6	255	4	0	1	0	low	True	6

grouped.last()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	salary	left_categorical	new_number_project
department
IT	0.90	0.92	4	271	5	0	1	0	medium	True	4
RandD	0.81	0.92	5	239	5	0	1	0	medium	True	5
accounting	0.36	0.54	2	153	3	0	1	0	medium	True	2
hr	0.40	0.47	2	144	3	0	1	0	medium	True	2
management	0.42	0.57	2	147	3	1	1	0	low	True	2
marketing	0.44	0.52	2	149	3	0	1	0	low	True	2
product_mng	0.46	0.55	2	147	3	0	1	0	medium	True	2
sales	0.39	0.45	2	140	3	0	1	0	medium	True	2
support	0.37	0.52	2	158	3	0	1	0	low	True	2
technical	0.43	0.57	2	159	3	1	1	0	low	True	2

grouped.nth(2)

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	salary	left_categorical	new_number_project
department
IT	0.36	0.56	2	132	3	0	1	0	medium	True	2
RandD	0.37	0.55	2	127	3	0	1	0	medium	True	2
accounting	0.09	0.62	6	294	4	0	1	0	low	True	6
hr	0.45	0.55	2	140	3	0	1	0	low	True	2
management	0.42	0.48	2	129	3	0	1	0	low	True	2
marketing	0.11	0.77	6	291	4	0	1	0	low	True	6
product_mng	0.76	0.86	5	223	5	1	1	0	medium	True	5
sales	0.11	0.88	7	272	4	0	1	0	medium	True	7
support	0.40	0.54	2	148	3	0	1	0	low	True	2
technical	0.45	0.50	2	126	3	0	1	0	low	True	2

grouped.groups

{'IT': Int64Index([   61,    62,    63,    64,    65,    70,   138,   139,   140,
               141,
             ...
             14808, 14809, 14810, 14815, 14929, 14930, 14931, 14932, 14933,
             14938],
            dtype='int64', length=1227),
 'RandD': Int64Index([  301,   302,   303,   304,   305,   453,   454,   455,   456,
               457,
             ...
             14816, 14817, 14818, 14819, 14820, 14939, 14940, 14941, 14942,
             14943],
            dtype='int64', length=787),
 'accounting': Int64Index([   28,    29,    30,    79,   105,   106,   107,   155,   181,
               182,
             ...
             14849, 14850, 14851, 14896, 14897, 14898, 14946, 14972, 14973,
             14974],
            dtype='int64', length=767),
 'hr': Int64Index([   31,    32,    33,    34,   108,   109,   110,   111,   184,
               185,
             ...
             14854, 14855, 14899, 14900, 14901, 14902, 14975, 14976, 14977,
             14978],
            dtype='int64', length=739),
 'management': Int64Index([   60,    82,   137,   158,   213,   235,   290,   311,   366,
               387,
             ...
             14598, 14653, 14674, 14729, 14750, 14805, 14826, 14873, 14928,
             14949],
            dtype='int64', length=630),
 'marketing': Int64Index([   77,    83,    84,    85,   148,   149,   150,   151,   152,
               153,
             ...
             14827, 14828, 14829, 14874, 14875, 14876, 14944, 14950, 14951,
             14952],
            dtype='int64', length=858),
 'product_mng': Int64Index([   66,    67,    68,    69,    71,    72,    73,    74,    75,
                76,
             ...
             14737, 14738, 14811, 14812, 14813, 14814, 14934, 14935, 14936,
             14937],
            dtype='int64', length=902),
 'sales': Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                 9,
             ...
             14962, 14963, 14964, 14965, 14966, 14967, 14968, 14969, 14970,
             14971],
            dtype='int64', length=4140),
 'support': Int64Index([   46,    47,    48,    49,    50,    51,    52,    53,    54,
                55,
             ...
             14947, 14990, 14991, 14992, 14993, 14994, 14995, 14996, 14997,
             14998],
            dtype='int64', length=2229),
 'technical': Int64Index([   35,    36,    37,    38,    39,    40,    41,    42,    43,
                44,
             ...
             14980, 14981, 14982, 14983, 14984, 14985, 14986, 14987, 14988,
             14989],
            dtype='int64', length=2720)}

hr_data.groupby(['department','salary']).groups

{('IT',
  'high'): Int64Index([ 1281,  1359,  1437,  1515,  3192,  3193,  3194,  3195,  3200,
              3270,  3504,  3799,  3802,  4260,  4264,  4269,  4720,  5097,
              5098,  5557,  5558,  5559,  5560,  5561,  5634,  5712,  5790,
              5865,  5943,  6024,  6093,  6547,  6550,  7009,  7011,  7012,
              7087,  7474,  7544,  7845,  7846,  7847,  7848,  7849,  7998,
              8076,  8154,  8229,  8307,  8308,  8309,  8314,  8385,  8772,
              8917,  8919,  9375,  9835, 10213, 10593, 10671, 10672, 10673,
             10678, 10747, 10749, 10980, 11268, 11601, 11700, 11707, 12804,
             12882, 12883, 12884, 12889, 12958, 12960, 13191, 13479, 13812,
             13911, 13918],
            dtype='int64'),
 ('IT',
  'low'): Int64Index([  138,   139,   140,   141,   142,   147,   214,   215,   216,
               217,
             ...
             14731, 14732, 14733, 14734, 14806, 14807, 14808, 14809, 14810,
             14815],
            dtype='int64', length=609),
 ('IT',
  'medium'): Int64Index([   61,    62,    63,    64,    65,    70,   294,   295,   300,
               376,
             ...
             14511, 14587, 14663, 14739, 14929, 14930, 14931, 14932, 14933,
             14938],
            dtype='int64', length=535),
 ('RandD',
  'high'): Int64Index([ 1827,  1905,  1983,  3201,  3202,  3203,  3657,  3658,  3659,
              3660,  3661,  3735,  3813,  4044,  4729,  5185,  6025,  6026,
              6027,  6028,  6177,  6255,  6330,  6408,  6558,  7018,  7476,
              7477,  7552,  8009,  8315,  8316,  8317,  8318,  8773,  8774,
              8775,  8776,  8777,  8850,  8928,  9382,  9384, 10300, 11659,
             11661, 11662, 13870, 13872, 13873, 14941],
            dtype='int64'),
 ('RandD',
  'low'): Int64Index([  605,   833,   834,   835,   836,   837,   985,   986,   987,
               988,
             ...
             13213, 13214, 13215, 13216, 13217, 13871, 13874, 13875, 14816,
             14942],
            dtype='int64', length=364),
 ('RandD',
  'medium'): Int64Index([  301,   302,   303,   304,   305,   453,   454,   455,   456,
               457,
             ...
             14666, 14667, 14668, 14817, 14818, 14819, 14820, 14939, 14940,
             14943],
            dtype='int64', length=372),
 ('accounting',
  'high'): Int64Index([  384,  1632,  1710,  3234,  3235,  3236,  3387,  3540,  3664,
              3768,  4124,  4228,  4684,  4686,  4734,  4762,  5190,  5524,
              5525,  5526,  5982,  5983,  5984,  6060,  6488,  6564,  6592,
              6795,  7510,  7888,  8349,  8350,  8351,  8502,  8580,  8655,
              8780,  8883,  9084,  9340,  9613,  9614,  9799,  9801,  9843,
              9844,  9877,  9920,  9921,  9996,  9997, 10072, 10073, 10334,
             10636, 10637, 10638, 10866, 10944, 11101, 11214, 11971, 11972,
             12384, 12847, 12848, 12849, 13077, 13155, 13312, 13425, 14182,
             14183, 14595],
            dtype='int64'),
 ('accounting',
  'low'): Int64Index([   28,    29,    30,    79,   155,   224,   225,   232,   410,
               486,
             ...
             14621, 14697, 14698, 14699, 14773, 14774, 14775, 14849, 14850,
             14851],
            dtype='int64', length=358),
 ('accounting',
  'medium'): Int64Index([  105,   106,   107,   181,   182,   183,   258,   259,   260,
               308,
             ...
             14741, 14747, 14823, 14896, 14897, 14898, 14946, 14972, 14973,
             14974],
            dtype='int64', length=335),
 ('hr',
  'high'): Int64Index([  111,  1788,  1866,  3237,  3238,  3465,  3618,  3696,  3772,
              4233,  4687,  4690,  5219,  5527,  5528,  5985,  5986,  5987,
              5988,  6138,  6216,  6594,  7054,  7057,  8352,  8353,  8733,
              8811,  8812,  8887,  9343,  9802,  9805, 10338, 10639, 10640,
             10641, 10642, 12111, 12850, 12851, 12852, 12853, 14322, 14902],
            dtype='int64'),
 ('hr',
  'low'): Int64Index([   31,    32,    33,    34,   226,   227,   228,   565,   566,
               641,
             ...
             14245, 14437, 14438, 14439, 14776, 14777, 14852, 14853, 14854,
             14855],
            dtype='int64', length=335),
 ('hr',
  'medium'): Int64Index([  108,   109,   110,   184,   185,   186,   187,   261,   262,
               263,
             ...
             14744, 14778, 14779, 14899, 14900, 14901, 14975, 14976, 14977,
             14978],
            dtype='int64', length=359),
 ('management',
  'high'): Int64Index([ 1203,  2217,  3114,  3363,  3667,  4127,  4509,  5096,  5325,
              5556,
             ...
             14148, 14149, 14150, 14151, 14186, 14204, 14205, 14206, 14207,
             14208],
            dtype='int64', length=225),
 ('management',
  'low'): Int64Index([   82,   137,   158,   213,   235,   290,   366,   442,   463,
               518,
             ...
             14446, 14501, 14577, 14653, 14674, 14729, 14805, 14873, 14928,
             14949],
            dtype='int64', length=180),
 ('management',
  'medium'): Int64Index([   60,   311,   387,   539,   615,   691,   767,   843,   919,
               974,
             ...
             13727, 13881, 14005, 14089, 14157, 14271, 14522, 14598, 14750,
             14826],
            dtype='int64', length=225),
 ('marketing',
  'high'): Int64Index([  306,   540,   618,  2295,  3289,  3662,  3668,  3891,  4122,
              4128,  4129,  4130,  4434,  4587,  4588,  4732,  5194,  5655,
              6486,  6492,  6493,  6562,  6953,  6954,  6955,  7023,  7029,
              7107,  7260,  7480,  7488,  7942,  8013,  8404,  8474,  8482,
              8778,  9006,  9393,  9471,  9549,  9612,  9624,  9702,  9703,
              9842,  9919,  9995, 10071, 10312, 11099, 11105, 11106, 11107,
             11484, 11608, 11665, 11673, 11796, 11949, 11970, 11998, 12306,
             12540, 12618, 13310, 13316, 13317, 13318, 13695, 13819, 13876,
             13884, 14007, 14160, 14181, 14209, 14517, 14751, 14829],
            dtype='int64'),
 ('marketing',
  'low'): Int64Index([   83,    84,    85,   148,   149,   150,   151,   152,   153,
               159,
             ...
             14524, 14525, 14601, 14752, 14874, 14875, 14876, 14950, 14951,
             14952],
            dtype='int64', length=402),
 ('marketing',
  'medium'): Int64Index([   77,   382,   388,   389,   458,   464,   465,   466,   534,
               542,
             ...
             14669, 14675, 14676, 14677, 14745, 14753, 14821, 14827, 14828,
             14944],
            dtype='int64', length=376),
 ('product_mng',
  'high'): Int64Index([   72,  1593,  1671,  1749,  3196,  3197,  3198,  3199,  3348,
              3426,  3579,  3804,  4267,  4725,  5562,  5563,  6021,  6022,
              6023,  6097,  6099,  6553,  7015,  7548,  7850,  7851,  7852,
              7853,  8310,  8311,  8312,  8313,  8463,  8541,  8619,  8694,
              9379,  9840, 10674, 10675, 10676, 10677, 10827, 10905, 11175,
             11253, 11264, 11445, 11523, 11704, 11835, 11988, 12072, 12885,
             12886, 12887, 12888, 13038, 13116, 13386, 13464, 13475, 13656,
             13734, 13915, 14046, 14199, 14283],
            dtype='int64'),
 ('product_mng',
  'low'): Int64Index([   73,   143,   144,   145,   146,   219,   220,   221,   222,
               448,
             ...
             14659, 14660, 14735, 14736, 14737, 14738, 14811, 14812, 14813,
             14814],
            dtype='int64', length=451),
 ('product_mng',
  'medium'): Int64Index([   66,    67,    68,    69,    71,    74,    75,    76,   296,
               297,
             ...
             14583, 14584, 14585, 14586, 14661, 14662, 14934, 14935, 14936,
             14937],
            dtype='int64', length=383),
 ('sales',
  'high'): Int64Index([  696,   774,   852,   930,  1008,  1086,  1164,  1242,  1320,
              1398,
             ...
             13503, 13581, 13620, 13888, 13890, 13929, 13940, 13944, 13948,
             14121],
            dtype='int64', length=269),
 ('sales',
  'low'): Int64Index([    0,     3,     4,     5,     6,     7,     8,     9,    10,
                11,
             ...
             14958, 14959, 14960, 14961, 14962, 14963, 14964, 14965, 14966,
             14967],
            dtype='int64', length=2099),
 ('sales',
  'medium'): Int64Index([    1,     2,    99,   100,   101,   102,   103,   104,   177,
               178,
             ...
             14891, 14892, 14893, 14894, 14895, 14945, 14968, 14969, 14970,
             14971],
            dtype='int64', length=1772),
 ('support',
  'high'): Int64Index([  657,   735,   813,   891,   969,  2139,  2334,  2412,  2490,
              2568,
             ...
             12949, 13018, 13269, 13313, 13446, 13450, 13542, 13879, 14085,
             14868],
            dtype='int64', length=141),
 ('support',
  'low'): Int64Index([   46,    47,    48,    49,    50,    51,    52,    53,    54,
                55,
             ...
             14947, 14990, 14991, 14992, 14993, 14994, 14995, 14996, 14997,
             14998],
            dtype='int64', length=1146),
 ('support',
  'medium'): Int64Index([  309,   428,   461,   504,   505,   506,   537,   581,   582,
               583,
             ...
             14748, 14792, 14793, 14794, 14795, 14824, 14867, 14870, 14871,
             14872],
            dtype='int64', length=942),
 ('technical',
  'high'): Int64Index([  189,   267,   345,   423,   462,   501,   579,  1047,  1125,
              1944,
             ...
             13851, 13906, 14400, 14478, 14556, 14634, 14673, 14712, 14790,
             14980],
            dtype='int64', length=201),
 ('technical',
  'low'): Int64Index([   35,    36,    37,    38,    39,    40,    41,    42,    43,
                44,
             ...
             14913, 14925, 14926, 14927, 14948, 14981, 14986, 14987, 14988,
             14989],
            dtype='int64', length=1372),
 ('technical',
  'medium'): Int64Index([  113,   114,   115,   116,   188,   191,   192,   193,   194,
               265,
             ...
             14866, 14904, 14905, 14906, 14907, 14979, 14982, 14983, 14984,
             14985],
            dtype='int64', length=1147)}

Selecting a group

grouped = hr_data.groupby(['department','salary'])

grouped.get_group(('technical','low')).head()

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	department	salary	left_categorical	new_number_project
35	0.10	0.94	6	255	4	1	technical	low	True	6
36	0.38	0.46	2	137	3	1	technical	low	True	2
37	0.45	0.50	2	126	3	1	technical	low	True	2
38	0.11	0.89	6	306	4	1	technical	low	True	6
39	0.41	0.54	2	152	3	1	technical	low	True	2

Aggregation

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.
These operations are similar to the aggregating API, window functions API, and resample API.

grouped.number_project.aggregate(np.mean)

department   salary
IT           high      3.867470
             low       3.794745
             medium    3.833645
RandD        high      3.764706
             low       3.804945
             medium    3.913978
accounting   high      3.905405
             low       3.801676
             medium    3.832836
hr           high      3.888889
             low       3.692537
             medium    3.590529
management   high      3.777778
             low       3.777778
             medium    4.008889
marketing    high      3.425000
             low       3.751244
             medium    3.675532
product_mng  high      3.705882
             low       3.824834
             medium    3.804178
sales        high      3.858736
             low       3.757980
             medium    3.785553
support      high      3.794326
             low       3.787086
             medium    3.825902
technical    high      3.651741
             low       3.910350
             medium    3.878814
Name: number_project, dtype: float64

grouped.agg([np.mean,np.max])

		satisfaction_level		last_evaluation		number_project		average_montly_hours		time_spend_company		Work_accident		left		promotion_last_5years		new_number_project
		mean	amax	mean	amax	mean	amax	mean	amax	mean	amax	mean	amax	mean	amax	mean	amax	mean	amax
department	salary
IT	high	0.638193	0.99	0.716627	0.99	3.867470	6	194.927711	275	3.072289	6	0.048193	1	0.048193	1	0.000000	0	4.012048	6
	low	0.610099	1.00	0.715665	1.00	3.794745	7	201.382594	308	3.438424	10	0.146141	1	0.282430	1	0.003284	1	3.916256	7
	medium	0.624187	1.00	0.718187	1.00	3.833645	7	204.295327	308	3.564486	10	0.132710	1	0.181308	1	0.001869	1	3.945794	7
RandD	high	0.586667	0.97	0.700588	0.95	3.764706	6	199.745098	287	3.529412	8	0.176471	1	0.078431	1	0.019608	1	3.843137	6
	low	0.623929	1.00	0.714176	1.00	3.804945	7	198.747253	308	3.381868	8	0.195055	1	0.151099	1	0.008242	1	3.903846	7
	medium	0.620349	1.00	0.711694	1.00	3.913978	7	202.954301	301	3.330645	6	0.145161	1	0.166667	1	0.061828	1	4.043011	7
accounting	high	0.614054	0.97	0.724595	1.00	3.905405	6	205.905405	277	3.216216	8	0.202703	1	0.067568	1	0.081081	1	3.986486	6
	low	0.574162	1.00	0.713883	1.00	3.801676	7	199.899441	308	3.438547	10	0.111732	1	0.276536	1	0.005587	1	3.905028	7
	medium	0.583642	1.00	0.720299	1.00	3.832836	7	201.465672	310	3.680597	10	0.122388	1	0.298507	1	0.017910	1	3.940299	7
hr	high	0.673111	0.99	0.743778	0.99	3.888889	6	209.066667	289	2.911111	6	0.088889	1	0.133333	1	0.044444	1	4.066667	6
	low	0.608657	1.00	0.717821	1.00	3.692537	7	202.456716	310	3.259701	6	0.137313	1	0.274627	1	0.005970	1	3.797015	7
	medium	0.580306	1.00	0.696100	1.00	3.590529	7	193.863510	310	3.501393	8	0.108635	1	0.325905	1	0.030641	1	3.710306	7
management	high	0.653333	0.98	0.715822	1.00	3.777778	6	200.248889	286	5.164444	10	0.160000	1	0.004444	1	0.200000	1	3.844444	6
	low	0.610722	1.00	0.712833	1.00	3.777778	7	200.744444	307	3.411111	10	0.166667	1	0.327778	1	0.038889	1	3.900000	7
	medium	0.597867	1.00	0.741111	1.00	4.008889	7	202.653333	304	4.155556	10	0.164444	1	0.137778	1	0.075556	1	4.084444	7
marketing	high	0.605250	1.00	0.663625	1.00	3.425000	6	185.575000	286	3.512500	10	0.162500	1	0.112500	1	0.062500	1	3.600000	6
	low	0.602910	0.99	0.727587	1.00	3.751244	7	204.487562	310	3.527363	10	0.154229	1	0.313433	1	0.027363	1	3.883085	7
	medium	0.638218	1.00	0.714495	1.00	3.675532	7	196.869681	300	3.627660	10	0.167553	1	0.180851	1	0.071809	1	3.776596	7
product_mng	high	0.614118	0.99	0.665735	0.98	3.705882	6	194.632353	307	3.617647	10	0.191176	1	0.088235	1	0.000000	0	3.882353	6
	low	0.620909	1.00	0.725831	1.00	3.824834	7	201.048780	310	3.434590	10	0.150776	1	0.232816	1	0.000000	0	3.937916	7
	medium	0.619112	1.00	0.710418	1.00	3.804178	7	199.637076	310	3.498695	10	0.133159	1	0.227154	1	0.000000	0	3.895561	7
sales	high	0.648959	1.00	0.699814	0.99	3.858736	7	201.178439	306	3.550186	10	0.137546	1	0.052045	1	0.044610	1	3.988848	7
	low	0.600838	1.00	0.709247	1.00	3.757980	7	200.363030	307	3.464030	10	0.126251	1	0.332063	1	0.009528	1	3.870414	7
	medium	0.625327	1.00	0.711778	1.00	3.785553	7	201.520316	310	3.614560	10	0.160835	1	0.170993	1	0.038375	1	3.930023	7
support	high	0.655035	0.99	0.714113	1.00	3.794326	6	203.985816	286	3.219858	10	0.219858	1	0.056738	1	0.000000	0	3.943262	6
	low	0.591710	1.00	0.719494	1.00	3.787086	7	198.900524	310	3.484293	10	0.151832	1	0.339442	1	0.006108	1	3.902269	7
	medium	0.645149	1.00	0.728854	1.00	3.825902	7	202.535032	310	3.307856	10	0.148620	1	0.167728	1	0.013800	1	3.951168	7
technical	high	0.625970	1.00	0.699453	1.00	3.651741	6	200.044776	284	3.313433	10	0.149254	1	0.124378	1	0.004975	1	3.781095	6
	low	0.594322	1.00	0.723367	1.00	3.910350	7	203.064869	310	3.397230	10	0.142128	1	0.275510	1	0.008746	1	4.035714	7
	medium	0.620968	1.00	0.722180	1.00	3.878814	7	202.248474	310	3.445510	10	0.136007	1	0.256321	1	0.013078	1	3.993897	7

Descriptive statistics of grouped data

grouped.size()

department   salary
IT           high        83
             low        609
             medium     535
RandD        high        51
             low        364
             medium     372
accounting   high        74
             low        358
             medium     335
hr           high        45
             low        335
             medium     359
management   high       225
             low        180
             medium     225
marketing    high        80
             low        402
             medium     376
product_mng  high        68
             low        451
             medium     383
sales        high       269
             low       2099
             medium    1772
support      high       141
             low       1146
             medium     942
technical    high       201
             low       1372
             medium    1147
dtype: int64

grouped.describe()

		satisfaction_level								last_evaluation		...	promotion_last_5years		new_number_project
		count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
department	salary
IT	high	83.0	0.638193	0.223749	0.15	0.5250	0.650	0.7800	0.99	83.0	0.716627	...	0.0	0.0	83.0	4.012048	1.152875	2.0	3.0	4.0	5.0	6.0
	low	609.0	0.610099	0.258915	0.09	0.4100	0.650	0.8200	1.00	609.0	0.715665	...	0.0	1.0	609.0	3.916256	1.289334	2.0	3.0	4.0	5.0	7.0
	medium	535.0	0.624187	0.243297	0.09	0.4900	0.660	0.8100	1.00	535.0	0.718187	...	0.0	1.0	535.0	3.945794	1.235726	2.0	3.0	4.0	5.0	7.0
RandD	high	51.0	0.586667	0.228785	0.10	0.4400	0.600	0.7450	0.97	51.0	0.700588	...	0.0	1.0	51.0	3.843137	1.137938	2.0	3.0	4.0	5.0	6.0
	low	364.0	0.623929	0.242586	0.09	0.4700	0.675	0.8200	1.00	364.0	0.714176	...	0.0	1.0	364.0	3.903846	1.187208	2.0	3.0	4.0	5.0	7.0
	medium	372.0	0.620349	0.250293	0.09	0.4775	0.650	0.8300	1.00	372.0	0.711694	...	0.0	1.0	372.0	4.043011	1.262471	2.0	3.0	4.0	5.0	7.0
accounting	high	74.0	0.614054	0.237319	0.11	0.5000	0.620	0.8300	0.97	74.0	0.724595	...	0.0	1.0	74.0	3.986486	1.140695	2.0	3.0	4.0	5.0	6.0
	low	358.0	0.574162	0.252250	0.09	0.4000	0.590	0.7800	1.00	358.0	0.713883	...	0.0	1.0	358.0	3.905028	1.350144	2.0	3.0	4.0	5.0	7.0
	medium	335.0	0.583642	0.262273	0.09	0.4000	0.630	0.8000	1.00	335.0	0.720299	...	0.0	1.0	335.0	3.940299	1.295785	2.0	3.0	4.0	5.0	7.0
hr	high	45.0	0.673111	0.250616	0.09	0.5500	0.730	0.8600	0.99	45.0	0.743778	...	0.0	1.0	45.0	4.066667	1.136182	2.0	4.0	4.0	5.0	6.0
	low	335.0	0.608657	0.239902	0.09	0.4400	0.620	0.8100	1.00	335.0	0.717821	...	0.0	1.0	335.0	3.797015	1.235935	2.0	3.0	4.0	5.0	7.0
	medium	359.0	0.580306	0.253324	0.09	0.4050	0.600	0.7950	1.00	359.0	0.696100	...	0.0	1.0	359.0	3.710306	1.330636	2.0	3.0	4.0	4.0	7.0
management	high	225.0	0.653333	0.194436	0.14	0.5300	0.680	0.8000	0.98	225.0	0.715822	...	0.0	1.0	225.0	3.844444	1.152551	2.0	3.0	4.0	5.0	6.0
	low	180.0	0.610722	0.254620	0.09	0.4350	0.655	0.8050	1.00	180.0	0.712833	...	0.0	1.0	180.0	3.900000	1.268880	2.0	3.0	4.0	5.0	7.0
	medium	225.0	0.597867	0.233161	0.09	0.4800	0.630	0.7600	1.00	225.0	0.741111	...	0.0	1.0	225.0	4.084444	1.234539	2.0	3.0	4.0	5.0	7.0
marketing	high	80.0	0.605250	0.255784	0.14	0.4350	0.610	0.8275	1.00	80.0	0.663625	...	0.0	1.0	80.0	3.600000	1.164887	2.0	3.0	3.0	4.0	6.0
	low	402.0	0.602910	0.256258	0.09	0.4200	0.630	0.8100	0.99	402.0	0.727587	...	0.0	1.0	402.0	3.883085	1.322645	2.0	3.0	4.0	5.0	7.0
	medium	376.0	0.638218	0.227333	0.09	0.4900	0.670	0.8200	1.00	376.0	0.714495	...	0.0	1.0	376.0	3.776596	1.181224	2.0	3.0	4.0	5.0	7.0
product_mng	high	68.0	0.614118	0.248038	0.09	0.4500	0.625	0.8225	0.99	68.0	0.665735	...	0.0	0.0	68.0	3.882353	1.165799	2.0	3.0	4.0	5.0	6.0
	low	451.0	0.620909	0.248181	0.09	0.4450	0.650	0.8300	1.00	451.0	0.725831	...	0.0	0.0	451.0	3.937916	1.335217	2.0	3.0	4.0	5.0	7.0
	medium	383.0	0.619112	0.234720	0.09	0.4500	0.640	0.8100	1.00	383.0	0.710418	...	0.0	0.0	383.0	3.895561	1.271745	2.0	3.0	4.0	5.0	7.0
sales	high	269.0	0.648959	0.236264	0.10	0.5200	0.690	0.8200	1.00	269.0	0.699814	...	0.0	1.0	269.0	3.988848	1.137827	2.0	3.0	4.0	5.0	7.0
	low	2099.0	0.600838	0.251686	0.09	0.4200	0.630	0.8100	1.00	2099.0	0.709247	...	0.0	1.0	2099.0	3.870414	1.352319	2.0	3.0	4.0	5.0	7.0
	medium	1772.0	0.625327	0.249707	0.09	0.4500	0.660	0.8300	1.00	1772.0	0.711778	...	0.0	1.0	1772.0	3.930023	1.224935	2.0	3.0	4.0	5.0	7.0
support	high	141.0	0.655035	0.225644	0.15	0.5100	0.670	0.8600	0.99	141.0	0.714113	...	0.0	0.0	141.0	3.943262	1.080827	2.0	3.0	4.0	5.0	6.0
	low	1146.0	0.591710	0.255661	0.09	0.4000	0.630	0.8000	1.00	1146.0	0.719494	...	0.0	1.0	1146.0	3.902269	1.339052	2.0	3.0	4.0	5.0	7.0
	medium	942.0	0.645149	0.234231	0.09	0.5100	0.680	0.8300	1.00	942.0	0.728854	...	0.0	1.0	942.0	3.951168	1.171642	2.0	3.0	4.0	5.0	7.0
technical	high	201.0	0.625970	0.219279	0.10	0.4900	0.640	0.7900	1.00	201.0	0.699453	...	0.0	1.0	201.0	3.781095	1.136592	2.0	3.0	4.0	5.0	6.0
	low	1372.0	0.594322	0.264359	0.09	0.4100	0.630	0.8200	1.00	1372.0	0.723367	...	0.0	1.0	1372.0	4.035714	1.327829	2.0	3.0	4.0	5.0	7.0
	medium	1147.0	0.620968	0.246691	0.09	0.4500	0.660	0.8300	1.00	1147.0	0.722180	...	0.0	1.0	1147.0	3.993897	1.298730	2.0	3.0	4.0	5.0	7.0

30 rows × 72 columns

grouped = hr_data.groupby(['department'])

grouped.agg(mean_projects=('number_project','mean'), mean_satisfaction=('satisfaction_level','mean'))

	mean_projects	mean_satisfaction
department
IT	3.816626	0.618142
RandD	3.853875	0.619822
accounting	3.825293	0.582151
hr	3.654939	0.598809
management	3.860317	0.621349
marketing	3.687646	0.618601
product_mng	3.807095	0.619634
sales	3.776329	0.614447
support	3.803948	0.618300
technical	3.877941	0.607897

grouped.transform(lambda x : x + 2 )

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	new_number_project
0	2.38	2.53	4	159	5	2	3	2	4
1	2.80	2.86	7	264	8	2	3	2	7
2	2.11	2.88	9	274	6	2	3	2	9
3	2.72	2.87	7	225	7	2	3	2	7
4	2.37	2.52	4	161	5	2	3	2	4
...	...	...	...	...	...	...	...	...	...
14994	2.40	2.57	4	153	5	2	3	2	4
14995	2.37	2.48	4	162	5	2	3	2	4
14996	2.37	2.53	4	145	5	2	3	2	4
14997	2.11	2.96	8	282	6	2	3	2	8
14998	2.37	2.52	4	160	5	2	3	2	4

14999 rows × 9 columns

sljwy

发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

私信关注

pandas学习2