利用Python进行数据分析之pandas

1.层次化索引

层次化索引是pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引级别。

 1 data=Series(np.random.randn(10),
 2 index=[['a','a','a','b','b','b','c','c','d','d'],
 3 [1,2,3,1,2,3,1,2,2,3]])
 4 
 5 data
 6 Out[6]: 
 7 a  1   -2.842857
 8    2    0.376199
 9    3   -0.512978
10 b  1    0.225243
11    2   -1.242407
12    3   -0.663188
13 c  1   -0.149269
14    2   -1.079174
15 d  2   -0.952380
16    3   -1.113689
17 dtype: float64

这就是带MultiIndex索引的Series的格式化输出形式。索引之间的“间隔”表示“直接使用上面的标签”。

1 data.index
2 Out[7]: 
3 MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
4            labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

层次化索引在数据重塑和基于分组的操作中扮演重要角色。

2.pd.read_csv与DataFrame的参数对比

help(pd.DataFrame)

Parameters
 |  ----------
 |  data : numpy ndarray (structured or homogeneous), dict, or DataFrame
 |      Dict can contain Series, arrays, constants, or list-like objects
 |  index : Index or array-like
 |      Index to use for resulting frame. Will default to np.arange(n) if
 |      no indexing information part of input data and no index provided
 |  columns : Index or array-like
 |      Column labels to use for resulting frame. Will default to
 |      np.arange(n) if no column labels are provided
 |  dtype : dtype, default None
 |      Data type to force. Only a single dtype is allowed. If None, infer
 |  copy : boolean, default False
 |      Copy data from inputs. Only affects DataFrame / 2d ndarray input

help(pd.read_csv)

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

对比:

设置列名的方式不同,当读取没有列名的文件 时,可以设置header='None',自动分配列名,或者手动设置names = ['a',...'z']

df =pd.read_csv('data/ex2.csv',names=['a','b','c','d','message'],index_col='message')
df

Out[8]:

  a b c d
message        
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

3 数据拆分

bins = [18,25,35,60,100]
ages =[20,22,25,30,31,19,33,28,44,51,66,34]
cats = pd.cut(ages,bins)
cats

Out[2]:

[(18, 25], (18, 25], (18, 25], (25, 35], (25, 35], ..., (25, 35], (35, 60], (35, 60], (60, 100], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats.labels

Out[3]:

array([0, 0, 0, 1, 1, 0, 1, 1, 2, 2, 3, 1], dtype=int8)
group_names=['Youth','youngAdult','MiddleAged','Senior']
cats = pd.cut(ages,bins,labels = group_names)
cats

Out[18]:
[Youth, Youth, Youth, youngAdult, youngAdult, ..., youngAdult, MiddleAged, MiddleAged, Senior, youngAdult]
Length: 12
Categories (4, object): [Youth < youngAdult < MiddleAged < Senior]

4.Pandas中的数据聚合与分组计算之groupby

df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})

Out[17]:

  data1 data2 key1 key2
0 1.165397 -0.923362 a one
1 0.849728 -0.937067 a two
2 -0.751545 0.576415 b one
3 -0.270348 -0.458194 b two
4 -0.225201 -0.076616 a one
  • grouped = df['data1'].groupby(df['key1'])
  • grouped.mean()
  • Out[19]:

    key1
    a    0.596641
    b   -0.510946
    Name: data1, dtype: float64
  1. means = df['data1'].groupby([df['key1'],df['key2']]).mean()
  2. Out[32]:

    key1  key2
    a     one     0.470098
          two     0.849728
    b     one    -0.751545
          two    -0.270348
    Name: data1, dtype: float64
  • df.groupby([df['key1'],df['key2']]).size()
  • Out[26]:

    key1  key2
    a     one     2
          two     1
    b     one     1
          two     1
    dtype: int64

除了内建函数外,亦可以利用自定义函数,来进行分组运算

people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['AA','BB','CC','DD','EE'])
people

Out[64]:

  a b c d e
AA -1.969581 -0.467297 1.003785 0.708328 -0.045470
BB 0.783007 -0.097895 2.508619 0.392152 -0.647674
CC 0.744150 0.150627 -2.206021 3.002937 -0.127511
DD 0.575631 -1.202379 1.169723 0.502523 0.889531
EE -0.573331 -0.023822 -1.461885 0.763456 0.763352

mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red'}
people.groupby(mapping,axis=1).sum()

Out[66]:

  blue red
AA 1.712113 -2.482348
BB 2.900771 0.037438
CC 0.796916 0.767267
DD 1.672246 0.262783
EE -0.698429 0.166199


5 数据聚合

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],
                 'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                 'data1': np.random.randint(1,10, 8),
                 'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)

Out[5]:

  data1 data2 key1 key2
0 9 6 a one
1 1 8 b one
2 9 8 a two
3 8 6 b three
4 5 6 a two
5 6 7 b two
6 4 6 a one
7 7 7 a three

df_obj5.groupby(['key1','key2']).sum()

Out[6]:

    data1 data2
key1 key2    
a one 13 12
three 7 7
two 14 14
b one 1 8
three 8 6
two 6 7

def peak_range(df):
    return df.max()-df.min()
print(df_obj5.groupby(['key1','key2']).agg(peak_range))
print(df_obj5.groupby(['key1','key2']).agg(lambda df:df.max()-df.min()))

            data1  data2
key1 key2               
a    one        5      0
     three      0      0
     two        4      2
b    one        0      0
     three      0      0
     two        0      0
            data1  data2
key1 key2               
a    one        5      0
     three      0      0
     two        4      2
b    one        0      0
     three      0      0
     two        0      0

猜你喜欢

转载自blog.csdn.net/qq_20412595/article/details/81166231