DataFrame understand
DataFrame can be seen as a number of Series objects orderly arrangement, where the "arrangement" means these Series have a common index.
First, read the file
dt = pd.read_csv(path) dt = pd.read_excel(path) dt = pd.read_table(path, sep=',')
Second, the index
The first index is iloc attribute, that value, and the slice are explicit, dt.iloc [1: 3] # Notes: closing the left and right from the beginning of the open interval 0
The second category is the index loc attribute is the index is implicit, such as dt.loc [: 'Illinois',: 'pop']
The third index ix may implement one mixing effect, such as dt.ix [: 3,: 'pop']
Third, merge with the connector
1、pd.concat()
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None)
Wherein, axis = 0 are combined in rows, axis = 1 is based on a combined basis (also written axis = 'col'). axis = 1 when combined in accordance with the index.
ser1 = pd.Series(['A','B','C']) ser2 = pd.Series(['D','E','F']) ser3 = pd.Series(['G','H','I']) a = pd.concat([ser1,ser2])
as a result, if incorporated directly by column Ser3, it will be given.
0 A 1 B 2 C 0 D 1 E 2 F dtype: object
pd.concat([a, ser3], axis=1)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Because the index is a retention of the original two Index Series, and by setting ignore_index = True, reset the index,
2、 pd.append()
ser1.append(ser2)
Pandas's append () does not update the value of the original object, but creates a new object for the combined data.
3 pd.merge ()
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False,
sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
See help (pd.merge)
Fourth, the packets (Gruopby)
Groupby method requires the use of packet statistics, which works by dividing (Split), applications (Apply) and combinations (Combine) result.
Groupby data is divided in accordance with the specified column, returns a DataFrameGroupBy object. DataFrameGroupBy objects hidden inside a plurality of sets of data, but does not function before calculating the cumulative application.
'data1':range(6),
"Dt2 ': Rng. Rndint ( 0, 10, 6)},
columns=['key','data1','data2'])
print(df)
df.groupby('key')
(1) Polymerization aggregate
After applying function will DataFrameGroupBy object is expanded computing.
df.groupby('key').aggregate(['min',np.median,max])
data1 | data2 | |||||
---|---|---|---|---|---|---|
me | median | max | me | median | max | |
key | ||||||
A | 0 | 1.5 | 3 | 3 | 4.0 | 5 |
B | 1 | 2.5 | 4 | 0 | 3.5 | 7 |
C | 2 | 3.5 | 5 | 3 | 6.0 | 9 |
You can also specify different functions required by the column dictionary
df.groupby('key').aggregate({'data1':'min', 'data2':'max'})
data1 | data2 | |
---|---|---|
key | ||
A | 0 | 5 |
B | 1 | 7 |
C | 2 | 9 |
(2) filtering filter
def filter_func(x): return x['data2'].std() > 4 print(df.groupby('key').std()) print(df.groupby('key').filter(filter_func))
data1 | data2 | |
---|---|---|
key | ||
A | 2.12132 | 1.414214 |
B | 2.12132 | 4.949747 |
C | 2.12132 | 4.242641 |
key | data1 | data2 | |
---|---|---|---|
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
(3) Conversion transform
df.groupby('key').transform(lambda x:x-x.mean())
data1 | data2 | |
---|---|---|
0 | -1.5 | 1.0 |
1 | -1.5 | -3.5 |
2 | -1.5 | -3.0 |
3 | 1.5 | -1.0 |
4 | 1.5 | 3.5 |
5 | 1.5 | 3.0 |
V. PivotTable
See help (pd.pivot_table)
titanic.pivot_table('survived', index='sex', columns='class', aggfunc={'survived':sum, 'fare':'mean'})
Six other operations (sort, de-duplication, and is calculated according to the ranks of the application function, etc.)
1, Sort
Sort by index
df.sort_index(ascending=False)
key | data1 | data2 | |
---|---|---|---|
5 | C | 5 | 9 |
4 | B | 4 | 7 |
3 | A | 3 | 3 |
2 | C | 2 | 3 |
1 | B | 1 | 0 |
0 | A | 0 | 5 |
Ordered by values
df.sort_values(by=['key','data2'])
key | data1 | data2 | |
---|---|---|---|
3 | A | 3 | 3 |
0 | A | 0 | 5 |
1 | B | 1 | 0 |
4 | B | 4 | 7 |
2 | C | 2 | 3 |
5 | C | 5 | 9 |
2, de-duplication
df.drop_duplicates('data2', keep='first')
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
想要知道某一列有多少个不重复的元素可以用df['data1'].nunique()
3、删除Drop
按照行删除
先选出需要删除的行的index,再删除指定index
df.drop(df.loc[df['key']=='A'].index, axis=0)
key | data1 | data2 | |
---|---|---|---|
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
按照列删除
df.drop(['key','data2'], axis=1)
DataFrame理解
DataFrame可以看做是有序排列的若干Series对象,这里的“排列”是指这些Series都有共同的索引。
一、读取文件
dt = pd.read_csv(path) dt = pd.read_excel(path) dt = pd.read_table(path, sep=',')
二、索引
第一类索引是iloc属性,表示取值和切片都是显式的,dt.iloc[1:3] #注:从0开始的左闭右开区间
第二类索引是loc属性,表示索引是隐式的,如dt.loc[:'Illinois', :'pop']
第三种索引ix可实现一种混合效果,如dt.ix[:3, :'pop']
三、合并与连接
1、pd.concat()
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None)
其中,axis=0是按照行合并,axis=1是按列合并(也可写成axis='col')。axis=1时是按照索引合并的。
ser1 = pd.Series(['A','B','C']) ser2 = pd.Series(['D','E','F']) ser3 = pd.Series(['G','H','I']) a = pd.concat([ser1,ser2])
a的结果如下,如果直接按列合并ser3,就会报错。
0 A 1 B 2 C 0 D 1 E 2 F dtype: object
pd.concat([a, ser3], axis=1)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
因为此时a的索引是保留原本的两个Series的Index的,可通过设置ignore_index=True,重置索引,
2、 pd.append()
ser1.append(ser2)
Pandas的append()不直接更新原有对象的值,而是为合并后的数据创建一个新对象。
3、pd.merge()
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False,
sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
详见help(pd.merge)
四、分组(Gruopby)
分组统计时需要使用到groupby方法,其原理是通过分割(split)、应用(apply)和组合(combine)得到结果。
Groupby是对数据按照指定列进行分割,返回一个DataFrameGroupBy对象。DataFrameGroupBy对象里面隐藏着若干组数据,但是没有应用累计函数之前不会计算。
'data1':range(6),
'data2':rng.randint(0,10,6)},
columns=['key','data1','data2'])
print(df)
df.groupby('key')
(1)聚合aggregate
应用函数后会对DataFrameGroupBy对象展开计算。
df.groupby('key').aggregate(['min',np.median,max])
data1 | data2 | |||||
---|---|---|---|---|---|---|
min | median | max | min | median | max | |
key | ||||||
A | 0 | 1.5 | 3 | 3 | 4.0 | 5 |
B | 1 | 2.5 | 4 | 0 | 3.5 | 7 |
C | 2 | 3.5 | 5 | 3 | 6.0 | 9 |
还可以通过字典指定不同列需要的函数
df.groupby('key').aggregate({'data1':'min', 'data2':'max'})
data1 | data2 | |
---|---|---|
key | ||
A | 0 | 5 |
B | 1 | 7 |
C | 2 | 9 |
(2)过滤filter
def filter_func(x): return x['data2'].std() > 4 print(df.groupby('key').std()) print(df.groupby('key').filter(filter_func))
data1 | data2 | |
---|---|---|
key | ||
A | 2.12132 | 1.414214 |
B | 2.12132 | 4.949747 |
C | 2.12132 | 4.242641 |
key | data1 | data2 | |
---|---|---|---|
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
(3)转换transform
df.groupby('key').transform(lambda x:x-x.mean())
data1 | data2 | |
---|---|---|
0 | -1.5 | 1.0 |
1 | -1.5 | -3.5 |
2 | -1.5 | -3.0 |
3 | 1.5 | -1.0 |
4 | 1.5 | 3.5 |
5 | 1.5 | 3.0 |
五、数据透视表
详见help(pd.pivot_table)
titanic.pivot_table('survived', index='sex', columns='class', aggfunc={'survived':sum, 'fare':'mean'})
六、其他操作(排序、去重、计算及按行列应用函数等)
1、排序
按照索引排序
df.sort_index(ascending=False)
key | data1 | data2 | |
---|---|---|---|
5 | C | 5 | 9 |
4 | B | 4 | 7 |
3 | A | 3 | 3 |
2 | C | 2 | 3 |
1 | B | 1 | 0 |
0 | A | 0 | 5 |
按照值排序
df.sort_values(by=['key','data2'])
key | data1 | data2 | |
---|---|---|---|
3 | A | 3 | 3 |
0 | A | 0 | 5 |
1 | B | 1 | 0 |
4 | B | 4 | 7 |
2 | C | 2 | 3 |
5 | C | 5 | 9 |
2、去重
df.drop_duplicates('data2', keep='first')
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
Want to know a column how many elements can not be repeated with df [ 'data1']. Nunique ()
3, Delete Drop
Delete in rows
First elected to be deleted index rows, and then delete the specified index
df.drop(df.loc[df['key']=='A'].index, axis=0)
key | data1 | data2 | |
---|---|---|---|
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
Delete according to the column
df.drop(['key','data2'], axis=1)