利用Python进行数据分析之pandas

1.层次化索引

层次化索引是pandas的一项重要功能，它使你能在一个轴上拥有多个（两个以上）索引级别。

 1 data=Series(np.random.randn(10),
 2 index=[['a','a','a','b','b','b','c','c','d','d'],
 3 [1,2,3,1,2,3,1,2,2,3]])
 4 
 5 data
 6 Out[6]: 
 7 a  1   -2.842857
 8    2    0.376199
 9    3   -0.512978
10 b  1    0.225243
11    2   -1.242407
12    3   -0.663188
13 c  1   -0.149269
14    2   -1.079174
15 d  2   -0.952380
16    3   -1.113689
17 dtype: float64

这就是带MultiIndex索引的Series的格式化输出形式。索引之间的“间隔”表示“直接使用上面的标签”。

1 data.index
2 Out[7]: 
3 MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
4            labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

层次化索引在数据重塑和基于分组的操作中扮演重要角色。

2.pd.read_csv与DataFrame的参数对比

help(pd.DataFrame)

Parameters
 |  ----------
 |  data : numpy ndarray (structured or homogeneous), dict, or DataFrame
 |      Dict can contain Series, arrays, constants, or list-like objects
 |  index : Index or array-like
 |      Index to use for resulting frame. Will default to np.arange(n) if
 |      no indexing information part of input data and no index provided
 |  columns : Index or array-like
 |      Column labels to use for resulting frame. Will default to
 |      np.arange(n) if no column labels are provided
 |  dtype : dtype, default None
 |      Data type to force. Only a single dtype is allowed. If None, infer
 |  copy : boolean, default False
 |      Copy data from inputs. Only affects DataFrame / 2d ndarray input

help(pd.read_csv)

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

对比：

设置列名的方式不同，当读取没有列名的文件时，可以设置header='None',自动分配列名，或者手动设置names = ['a',...'z']

df =pd.read_csv('data/ex2.csv',names=['a','b','c','d','message'],index_col='message')
df

Out[8]:

	a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

3 数据拆分

bins = [18,25,35,60,100]
ages =[20,22,25,30,31,19,33,28,44,51,66,34]
cats = pd.cut(ages,bins)
cats

Out[2]:

[(18, 25], (18, 25], (18, 25], (25, 35], (25, 35], ..., (25, 35], (35, 60], (35, 60], (60, 100], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats.labels

Out[3]:

array([0, 0, 0, 1, 1, 0, 1, 1, 2, 2, 3, 1], dtype=int8)

group_names=['Youth','youngAdult','MiddleAged','Senior']
cats = pd.cut(ages,bins,labels = group_names)
cats

Out[18]:

[Youth, Youth, Youth, youngAdult, youngAdult, ..., youngAdult, MiddleAged, MiddleAged, Senior, youngAdult]
Length: 12
Categories (4, object): [Youth < youngAdult < MiddleAged < Senior]

4.Pandas中的数据聚合与分组计算之groupby

df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})

Out[17]:

	data1	data2	key1	key2
0	1.165397	-0.923362	a	one
1	0.849728	-0.937067	a	two
2	-0.751545	0.576415	b	one
3	-0.270348	-0.458194	b	two
4	-0.225201	-0.076616	a	one

grouped = df['data1'].groupby(df['key1'])
grouped.mean()

Out[19]:

key1
a    0.596641
b   -0.510946
Name: data1, dtype: float64

means = df['data1'].groupby([df['key1'],df['key2']]).mean()

Out[32]:

key1  key2
a     one     0.470098
      two     0.849728
b     one    -0.751545
      two    -0.270348
Name: data1, dtype: float64

df.groupby([df['key1'],df['key2']]).size()

Out[26]:

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

除了内建函数外，亦可以利用自定义函数，来进行分组运算

people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['AA','BB','CC','DD','EE'])
people

Out[64]:

	a	b	c	d	e
AA	-1.969581	-0.467297	1.003785	0.708328	-0.045470
BB	0.783007	-0.097895	2.508619	0.392152	-0.647674
CC	0.744150	0.150627	-2.206021	3.002937	-0.127511
DD	0.575631	-1.202379	1.169723	0.502523	0.889531
EE	-0.573331	-0.023822	-1.461885	0.763456	0.763352

mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red'}
people.groupby(mapping,axis=1).sum()

Out[66]:

	blue	red
AA	1.712113	-2.482348
BB	2.900771	0.037438
CC	0.796916	0.767267
DD	1.672246	0.262783
EE	-0.698429	0.166199

5 数据聚合

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)

Out[5]:

	data1	data2	key1	key2
0	9	6	a	one
1	1	8	b	one
2	9	8	a	two
3	8	6	b	three
4	5	6	a	two
5	6	7	b	two
6	4	6	a	one
7	7	7	a	three

df_obj5.groupby(['key1','key2']).sum()

Out[6]:

		data1	data2
key1	key2
a	one	13	12
	three	7	7
	two	14	14
b	one	1	8
	three	8	6
	two	6	7

def peak_range(df):
return df.max()-df.min()
print(df_obj5.groupby(['key1','key2']).agg(peak_range))
print(df_obj5.groupby(['key1','key2']).agg(lambda df:df.max()-df.min()))

            data1  data2
key1 key2               
a    one        5      0
     three      0      0
     two        4      2
b    one        0      0
     three      0      0
     two        0      0
            data1  data2
key1 key2               
a    one        5      0
     three      0      0
     two        4      2
b    one        0      0
     three      0      0
     two        0      0