http://shzhangji.com/cnblogs/2017/07/23/learn-pandas-from-a-sql-perspective/
先构建
import pandas as pd
import numpy as np
a = [1,2,3,4,5]
b = [11,2,13,4,15]
c = [1,11,3,4,25]
d = np.array([a,b,c])
df = pd.DataFrame(d,columns=['a','b','c','d','e'])
print(df)
axis的理解
axis的重点在于方向,而不是行和列。
当axis=1时,如果是求平均,那么是从左到右横向求平均;如果是拼接,那么也是左右横向拼接;如果是drop,那么也是横向发生变化,体现为列的减少。
考虑了方向,即axis=1为横向,axis=0为纵向,而不是行和列。
合并两个pd格式
pandas.concat(objs, axis=0, join_axes=None, ignore_index=False)
objs:合并对象
axis:合并方式,默认0表示按列合并,1表示按行合并
ignore_index:是否忽略索引
append
合并两个pandas数据,列相同,然后
concat
'''按照行合并'''
df2 = df[['c','d']]
df3 = pd.concat([df,df2], axis = 0, sort=False, ignore_index=True)
print(df3)
df3 = pd.concat([df,df2], axis = 1, sort=False, ignore_index=True)
print(df3)
结果为:
axis = 0
a b c d e
0 1.0 2.0 3 4 5.0
1 11.0 2.0 13 4 15.0
2 1.0 11.0 3 4 25.0
0 NaN NaN 3 4 NaN
1 NaN NaN 13 4 NaN
2 NaN NaN 3 4 NaN
axis = 1
a b c d e c d
0 1 2 3 4 5 3 4
1 11 2 13 4 15 13 4
2 1 11 3 4 25 3 4
merge
'''按照字段左连接合并'''
e = ["2","6","35"]
df2 = pd.DataFrame(np.array([c[:3], e]).T, columns = ['c','E'])
df_merge = pd.merge(df, df2, on = 'c', how = 'left')
print("merge\n", df_merge)
'''left就是按照左边的为准'''
结果为:
a b c d E
0 1 1 1 1 2
1 2 12 12 5 6
2 3 13 3 3 35
3 4 1 1 34 2
4 5 15 25 23 NaN
筛选操作
where
df['c'] = df['c'].apply(int)
condition_1 = df['c'] > 5
condition_2 = df_merge['f'].isnull()
condition_3 = df_merge['f'].notnull()
print("where\n", df_merge[condition_1 & condition_2])
print("where\n", df_merge[condition_1 & ~condition_2])
print("where\n", df_merge[condition_1 & condition_3])
结果:
a b c d E
4 5 15 25 23 NaN
a b c d E
1 2 12 12 5 6
a b c d E
1 2 12 12 5 6
聚合
官方文档:https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
聚合包含两部分,一是分组字段,二是聚合函数
---------------------------------------------------------多个字段---------------------------------------------------------
temp = df_merge.groupby(['b', 'c']).agg({
'a': np.sum,
'd': np.max
})
print("group\n", temp)
结果:
b c
1 1 14 34
12 12 2 5
13 3 3 3
15 25 5 23
---------------------------------------------------------单个字段---------------------------------------------------------
print(df_merge.groupby('b')['a'].agg(['min', 'max']))
结果:
b min max
1 1 4
12 2 2
13 3 3
15 5 5
---------------------------------------------------------分组迭代---------------------------------------------------------
grouped = df_merge.groupby(['b'])
for index_b, value in grouped:
print("b =",index_b,"--index:",value.index, "\n",value)
print("direct get\n",grouped.get_group("13"))
b = 1
a b c d E
0 1 1 1 1 2
3 4 1 1 34 2
b = 12
a b c d E
1 2 12 12 5 6
b = 13
a b c d E
2 3 13 3 3 35
b = 15
a b c d E
4 5 15 25 23 NaN
直接获取该分组的结果
a b c d E
2 3 13 3 3 35
sort
两种sort。sort_index的作用是啥呢?
1、sort_values
temp_df = df_merge.sort_values(by=['c', 'b'], ascending=False, inplace = False)
print(temp_df)
print(df_merge['b'].dtype)
df_merge['c'] = df_merge['c'].apply(int)
df_merge['b'] = df_merge['b'].apply(int)
temp_df = df_merge.sort_values(by=['c', 'b'], ascending=False, inplace = False)
print(temp_df)
print(df_merge['b'].dtype)
结果:
a b c d E
2 3 13 3 3 35
4 5 15 25 23 NaN
1 2 12 12 5 6
0 1 1 1 1 2
3 4 1 1 34 2
object
a b c d E
4 5 15 25 23 NaN
1 2 12 12 5 6
2 3 13 3 3 35
0 1 1 1 1 2
3 4 1 1 34 2
int64
2、sort_index
temp_df = df_merge.set_index('b').sort_index()
print(temp_df)
join
结果:
rank
结果:
全部代码