Foreword
Operating in packet data aggregation are involved in a number of relational or non-relational database, substantially principles are similar, the polymerization operation or operation according to one or more packet services to the acquired data fields, provides a very PANDAS friendly packet aggregation function, the art can easily use the data packet operation of different dimensions polymerization
API brief description
Data structure of a DataFrame
import numpy as np
import pandas as pd
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
The results show
1, a field grouping
#通过key1分组
grouped = df.groupby(by='key1')
print(grouped)
print(type(grouped))
for i in grouped:
print(i)
print("-"*50)
Show results
The field of the packet, to obtain a group of objects, the object of this packet, PANDAS method provides a series of operations (functions), use may be directly used, averaging mean than the rabbit, count statistics, etc.
Continue to traverse through the group objects, each value can be found but also a group of tuples, the name is the name of the packet, is a value of a row, a group can get a variety of operations
Seeking an average value of a column
#求data1平均值
mean_data1 = grouped['data1'].mean()
print(mean_data1)
2, a plurality of packet fields
#获取多个分组,并计算平均值
multi_group = df.groupby(by=[df['key1'],df['key2']])['data1'].count()
print(multi_group)
Or may be written as follows
multi_group = df.groupby(by=['key1','key2'])['data1'].count()
print(multi_group)
Well understood, a plurality of packets, in general, are divided into small dimensions from a large dimension, hierarchy regardless of the number of groups divided, that the last layer of data or data integrity of rows and columns, followed by the layers of down to fetch data, such as the above, key1 and key2 according to the first packet, and obtain the data from the row data1 packet statistics
Of course, eventually went to that column statistics can be EDITORIAL can be placed on the back, so the following is also possible to write
multi_group_names = df['data1'].groupby(by=[df['key1'],df['key2']]).count()
print(multi_group_names)
Important Note:
In the above grouping operations eventually get to the grouped objects, when in fact we get is the type of series, of course, can be DataFrame types, grouped objects to get different types of functions for grouping objects operation will be different, in this manner the following example, it is possible to obtain results of the type DataFrame
multi_group_names = df[['data1']].groupby(by=[df['key1'],df['key2']]).count()
print(type(multi_group_names))
print(multi_group_names)
Data aggregation
polymerization, refers to any conversion process capable of producing data from an array of scalar values. For example: mean, count, min and sum the like. Many common aggregation operations are in place to calculate the optimization of data collection statistical information to realize, however, is not the only use these methods. You can define the polymerization operation can also be called on any object grouping method has been defined. For example, quantile Quantile Series sample may be calculated or column DataFrame
1, after calculating the quantile data1 grouped key1 (0.9) value
grouped=df.groupby('key1')
print(grouped) #返回groupby对象
print(grouped['data1'].quantile(0.9))
2, the self-polymerization function definition, just pass them to aggregate or method agg
grouped=df.groupby('key1')
#自定义函数
def peak_to_peak(arr):
return arr.max()-arr.min()
print(grouped.agg(peak_to_peak))
Note: custom function slower than those through the optimization function, which is very large because the overhead (function calls / data rearranging the like) in the construction of the intermediate packet data block, so caution, but in some special service the scene is very effective
Case explain
The following table of data using excel 2019 national college are related to the use of analysis, the interception of part of the data shown in
this table lists 2019 different provinces, different cities in each province, the University of rank and other information fields the following were told by a few simple needs
1, count the number of universities in each province
iopath = './全国高校名单excel版(2019).xlsx'
df = pd.read_excel(iopath,sheet_name='全国高校')
#统计各个身份的大学,首先根据省进行分组,得到分组的对象
#得到的分组对象是一个元祖,由于元祖是可迭代的对象,一个是分组名称,一个是dataFrame对象,
# 然后就可以使用分组相关的函数进行后续操作了
grouped = df.groupby(by="province")
print(grouped['universityName'].count())
2, count the number of colleges and universities in Hebei Province
group_count = df.groupby(by="province")['universityName'].count()['河北省']
print(group_count)
3, count the number of college in one of the provinces
hebei_data = df[df['province'] == '河北省']
print(hebei_data['universityName'].count())
4, the number of universities under the various City, Hebei Province
hebei_data = df[df['province'] == '河北省']
city_grouped = hebei_data.groupby(by='city')
print(city_grouped.count()['universityName'])
5, multi-field packet statistics of each province, the number of universities in each province following
multi_group = df.groupby(by=['province','city'])['universityName'].count()
print(multi_group)
According to the above-mentioned multi-field packet, and therefore it can be written as follows
multi_group_names = df['universityName'].groupby(by=[df['province'],df['city']]).count()
print(multi_group_names)
Draw
After analysis and statistics, generally requires drawing showing the results, the following number of universities above acquired through different provinces of the country, as well as get the number of universities in Hebei Province graphical display
The number of colleges and universities Figure 1, different provinces of the country
import pandas as pd
from matplotlib import pyplot as plt
iopath = './全国高校名单excel版(2019).xlsx'
df = pd.read_excel(iopath,sheet_name='全国高校')
print(df.head(2))
#避免中文乱码
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#呈现出全国省份排名前10的高校数量
grouped = df.groupby(by='province').count()['universityName'].sort_values(ascending=False)[:10]
print(grouped)
_x = grouped.index
_y = grouped.values
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()
Show renderings
2, Hebei Province, the number of universities in different regions map
#展示出河北省的每个城市的高校数量
df = df[df['province']=='河北省']
grouped = df.groupby(by='city').count()['universityName'].sort_values(ascending=False)
# 设置中文显示字体,避免乱码
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
print(grouped)
_x = grouped.index
_y = grouped.values
plt.bar(range(len(_x)),_y,width=0.3,color='orange')
plt.xticks(range(len(_x)),_x)
plt.show()
Graphic details of the drawing can continue to improve here do not talk too much
Benpian this is basically over, focusing on common operation of the packet to do a simple explanation, and aggregation operations is very important pandas content, free follow-up will continue to do in-depth research, and finally, thanks for watching!