pandas 之多层索引

In many applications, data may be spread across a number of files or datasets or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, and rearrange data.
(在许多应用中，数据可以分布在多个文件或数据集中，或者以不易分析的形式排列。本章重点介绍帮助组合和重新排列数据的工具.)

import numpy as np 
import pandas as pd

多层索引

Hierarchical indexing is an important featuer of pandas that enables you to have multiple(two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to to work with higher dimensional data in a lower dimensional form.(通过多层索引的方式去从低维看待高维数据). Let's start with a simple example; create a Series with a list of lists(or arrays) as the index:

data = pd.Series(np.random.randn(9),
                index=['a,a,a,b,b,c,c,d,d'.split(','),
                      [1,2,3,1,3,1,2,2,3]])

data

a  1    0.874880
   2    1.424326
   3   -2.028509
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
d  2   -1.008064
   3    0.988234
dtype: float64

What you're seeing is a prettified view of a Series with a MultiIndex as its index. The 'gaps' in the index display mean "use the lable directly above":

data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

With a hierarchically indexed object(分层索引对象), so-called partial indexing is possible, enabling you to concisely(便捷地) select subsets of the data.

data['b']  # 1 3

1   -1.081833
3   -0.072116
dtype: float64

data['b':'c']  # 1 3 1 2

b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
dtype: float64

data.loc[['b', 'd']]  # loc 通常按名字取, iloc 按下标取

b  1   -1.081833
   3   -0.072116
d  2   -1.008064
   3    0.988234
dtype: float64

"Selection is even possible from an inner level" 

data.loc[:, 2]

'Selection is even possible from an inner level'






a    1.424326
c   -1.246831
d   -1.008064
dtype: float64

Hierarchical indexing plays an important role in reshapeing data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method:

data.unstack()

	1	2	3
a	0.874880	1.424326	-2.028509
b	-1.081833	NaN	-0.072116
c	0.575918	-1.246831	NaN
d	NaN	-1.008064	0.988234

The inverse operation of unstack is stack:

data.unstack().stack()  # 相当于没变

a  1    0.874880
   2    1.424326
   3   -2.028509
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
d  2   -1.008064
   3    0.988234
dtype: float64

stack and unstack will be explored more detail later in this chapter.

With a DataFrame, either axis can have a hierarchical index:

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index=[['a','a','b','b'], [1,2,1,2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                             ['Green', 'Red', 'Green']]
                    )

frame

		Ohio		Colorado
		Green	Red	Green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

The hierarchical levels can have names(as strings or any Python objects). If so, these will show up in the console output:

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state', 'color']

"可设置行列索引的名字呢"
frame

'可设置行列索引的名字呢'

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

Be careful to distinguish(分辨) the index names 'state' and 'color'

Wiht partial column indexing you can similarly select groups of columns:

(使用部分列索引, 可以相应地使用列组)

frame['Ohio']

	color	Green	Red
key1	key2
a	1	0	1
a	2	3	4
b	1	6	7
b	2	9	10

A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this.

tmp = pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

tmp

MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

重排列和Level排序

At times you will need to rearange the order of the levels on an axis or sort the data by the value in one specific level. The swaplevel takes two levle numbers or names and return a new object with the levels interchanged(but the data is otherwise unaltered):

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.swaplevel('key1', 'key2')  # 交换索引level

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

sort_index, on the other hand, sorts the data using only the values in a single level. When swapping levels, it's not uncommon to also use sort_index so that the result is lexicographically(词典的) sorted by the indicated level:

frame.sort_index(level=1)

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

# cj
frame.sort_index(level=0)

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

"先交换轴索引, 再按照轴0排序"
frame.swaplevel(0, 1).sort_index(level=0)

'先交换轴索引, 再按照轴0排序'

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
1	b	6	7	8
2	a	3	4	5
2	b	9	10	11

Data selection performance is much better on hierarchically indexed if the index is lexicographically sorted starting with the outermost level-that is the result of calling sort_index()
如果索引从最外层开始按字典顺序排序，则在分层索引上，>数据选择性能要好得多——这是调用sort index()的结果

按level描述性统计

Many descriptive and summary statistic on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:

frame
frame.sum(level='key2')

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

state	Ohio		Colorado
color	Green	Red	Green
key2
1	6	8	10
2	12	14	16

frame.sum(level='color', axis=1)

	color	Green	Red
key1	key2
a	1	2	1
a	2	8	4
b	1	14	7
b	2	20	10

Under the hood, this utilizes(利用) pandas's groupby machinery, which will be discussed in more detail later in the book.

将DF某列值作为行索引

It's not unusual(不寻常的) to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame's columns. Here' an example DataFrame:
想要使用DataFrame中的一个或多个列作为行索引并不罕见; 或者，您可能希望将行索引移动到DataFrame的列中。这是一个示例DataFrame：

frame = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, -1),
    'c':"one,one,one,two,two,two,two".split(','),  # cj
    'd':[0, 1, 2, 0, 1, 2, 3]
})

frame

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3

DataFrame's set_index function will create a new DataFrame using one or more of its columns as the index:

"将 c, d 列作为index, 同时去掉c, d"
frame2 = frame.set_index(['c', 'd']) 

frame2

'将 c, d 列作为index, 同时去掉c, d'

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

By default the columns are removed from the DataFrame, though you can leave them in:

frame.set_index(['c', 'd'], drop=False)

		a	b	c	d
c	d
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3

reset_index, on the other hand, does the opposite of set_index; the hierachical index levels are moved into the columns:

frame2

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

"将多层index给还原到列去..."
frame2.reset_index()

'将多层index给还原到列去...'

	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

# cj test
time.clock()

6e-07

def f(x, l=[]):
    for i in range(x):
        l.append(i*i)
    print(l)
    
    
f(2)
f(3, [3,2,1])
f(3)

[0, 1]
[3, 2, 1, 0, 1, 4]
[0, 1, 0, 1, 4]