需求:过滤掉pandas DataFrame中出现次数较少的行,可以采用下面的写法:df为待过滤数据
df_family_car = df.groupby("PLATE_INFO_EX").filter(lambda x: (len(x) > 500 and len(x)<1000))
详细研究groupby用法,参考链接:https://blog.csdn.net/songbinxu/article/details/79839363
https://blog.csdn.net/youngbit007/article/details/54288603/
新建数据:
import pandas as pd
df = pd.DataFrame({'key1':list('aabba'),
'key2': ['one','two','one','two','one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
df
Out[83]:
key1 key2 data1 data2
0 a one -0.643930 -0.856232
1 a two 0.863575 -0.577838
2 b one 0.261961 -1.045156
3 b two 0.820736 0.790127
4 a one -0.991311 -0.999499
groupby 迭代,group返回时一个tupe,可以迭代
for name,group in df.groupby('key1'):
print name
print group
#结果:
#a
# data1 data2 key1 key2
#0 -1.389589 0.605121 a one
#1 0.057731 1.387236 a two
#4 0.973961 -1.540356 a one
#b
# data1 data2 key1 key2
#2 -0.476933 -0.110656 b one
#3 -0.015403 0.117257 b two
#多键的情况
for (k1,k2),group in df.groupby(['key1','key2']):
print k1,k2
print group
#结果:
#a one
# data1 data2 key1 key2
# 0 -0.474012 0.159072 a one
# 4 -2.049148 0.389898 a one
# a two
# data1 data2 key1 key2
# 1 2.471597 1.335773 a two
# b one
# data1 data2 key1 key2
# 2 0.249875 0.181691 b one
# b two
# data1 data2 key1 key2
# 3 0.458725 0.040619 b two
1:key内部value求和,累计求和,求积
# key内部求和
gp = df.groupby(["key1"])["data1"].sum().reset_index() # reset_index重置index
gp.rename(columns={"data1":"sum_of_value"},inplace=True) # rename改列名
gp
Out[85]:
key1 sum_of_value
0 a -0.771667
1 b 1.082697
# key内部求value的累计和
gp = df.groupby(["key1"])["data1"].cumsum().reset_index()
gp.rename(columns={"data1":"cumsum_of_value"},inplace=True)
gp
Out[88]:
index cumsum_of_value
0 0 -0.643930
1 1 0.219645
2 2 0.261961
3 3 1.082697
4 4 -0.771667
# key内部value全部相乘
gp = df.groupby(["key1"])["data1"].prod().reset_index()
gp.rename(columns={"data1":"prod_of_value"},inplace=True)
gp
Out[91]:
key1 prod_of_value
0 a 0.551250
1 b 0.215001
2:key内部value求均值.mean(),最大值.max(),最小值.min,最大值索引.idmax()
# key内部求均值
gp = df.groupby(["key1"])["data1"].mean().reset_index()
gp.rename(columns={"data1":"mean_of_value"},inplace=True)
gp
Out[93]:
key1 mean_of_value
0 a -0.257222
1 b 0.541349
#....最大最小值写法同上
# key内部求value最大值在原DataFrame中的index
gp =df.groupby(["key1"])["data1"].idxmax().reset_index()
gp.rename(columns={"data1":"maxidx_of_value"},inplace=True)
gp
Out[95]:
key1 maxidx_of_value
0 a 1
1 b 3
3:key内部value的排名,value相同排名会出现小数,排名中会出现排名2.5的值
# key内部求value最大值在原DataFrame中的index
gp =df.groupby(["key1"])["data1"].rank().reset_index()
gp.rename(columns={"data1":"maxidx_of_value"},inplace=True)
gp
Out[97]:
index maxidx_of_value
0 0 2.0
1 1 3.0
2 2 1.0
3 3 2.0
4 4 1.0
4:size()统计出现次数,类似于分组value_counts()、
gp =df.groupby(["key1","key2"]).size().reset_index()
gp.rename(columns={0:"count"},inplace=True)
gp
Out[110]:
key1 key2 count
0 a one 2
1 a two 1
2 b one 1
3 b two 1
----------------------------------------
另一个例子:
条件统计user 访问了brand_id 每天的次数
gp = data.groupby(["user_id","brand_id","day"]).size().reset_index()
gp.rename(columns={0:"count"},inplace=True)
-------------------------------------------