Polymerization data packet with pandas

Foreword

Operating in packet data aggregation are involved in a number of relational or non-relational database, substantially principles are similar, the polymerization operation or operation according to one or more packet services to the acquired data fields, provides a very PANDAS friendly packet aggregation function, the art can easily use the data packet operation of different dimensions polymerization

API brief description

Data structure of a DataFrame

import numpy as np
import pandas as pd

df=pd.DataFrame({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
               'data1':np.random.randn(5),
               'data2':np.random.randn(5)})
print(df)

The results show
Here Insert Picture Description

1, a field grouping

#通过key1分组
grouped = df.groupby(by='key1')
print(grouped)
print(type(grouped))

for i in grouped:
    print(i)
    print("-"*50)

Show results
Here Insert Picture Description

The field of the packet, to obtain a group of objects, the object of this packet, PANDAS method provides a series of operations (functions), use may be directly used, averaging mean than the rabbit, count statistics, etc.

Continue to traverse through the group objects, each value can be found but also a group of tuples, the name is the name of the packet, is a value of a row, a group can get a variety of operations

Seeking an average value of a column

#求data1平均值
mean_data1 = grouped['data1'].mean()
print(mean_data1)

Here Insert Picture Description
2, a plurality of packet fields

#获取多个分组,并计算平均值
multi_group = df.groupby(by=[df['key1'],df['key2']])['data1'].count()
print(multi_group)

Or may be written as follows

multi_group = df.groupby(by=['key1','key2'])['data1'].count()
print(multi_group)

Well understood, a plurality of packets, in general, are divided into small dimensions from a large dimension, hierarchy regardless of the number of groups divided, that the last layer of data or data integrity of rows and columns, followed by the layers of down to fetch data, such as the above, key1 and key2 according to the first packet, and obtain the data from the row data1 packet statistics
Here Insert Picture Description

Of course, eventually went to that column statistics can be EDITORIAL can be placed on the back, so the following is also possible to write

multi_group_names = df['data1'].groupby(by=[df['key1'],df['key2']]).count()
print(multi_group_names)

Here Insert Picture Description

Important Note:
In the above grouping operations eventually get to the grouped objects, when in fact we get is the type of series, of course, can be DataFrame types, grouped objects to get different types of functions for grouping objects operation will be different, in this manner the following example, it is possible to obtain results of the type DataFrame

multi_group_names = df[['data1']].groupby(by=[df['key1'],df['key2']]).count()
print(type(multi_group_names))
print(multi_group_names)

Here Insert Picture Description

Data aggregation
polymerization, refers to any conversion process capable of producing data from an array of scalar values. For example: mean, count, min and sum the like. Many common aggregation operations are in place to calculate the optimization of data collection statistical information to realize, however, is not the only use these methods. You can define the polymerization operation can also be called on any object grouping method has been defined. For example, quantile Quantile Series sample may be calculated or column DataFrame

1, after calculating the quantile data1 grouped key1 (0.9) value

grouped=df.groupby('key1')
print(grouped)  #返回groupby对象
print(grouped['data1'].quantile(0.9))

Here Insert Picture Description

2, the self-polymerization function definition, just pass them to aggregate or method agg

grouped=df.groupby('key1')

#自定义函数
def peak_to_peak(arr):
    return arr.max()-arr.min()
print(grouped.agg(peak_to_peak))

Here Insert Picture Description
Note: custom function slower than those through the optimization function, which is very large because the overhead (function calls / data rearranging the like) in the construction of the intermediate packet data block, so caution, but in some special service the scene is very effective

Case explain

The following table of data using excel 2019 national college are related to the use of analysis, the interception of part of the data shown in
Here Insert Picture Description
this table lists 2019 different provinces, different cities in each province, the University of rank and other information fields the following were told by a few simple needs

1, count the number of universities in each province

iopath = './全国高校名单excel版(2019).xlsx'
df = pd.read_excel(iopath,sheet_name='全国高校')

#统计各个身份的大学,首先根据省进行分组,得到分组的对象
#得到的分组对象是一个元祖,由于元祖是可迭代的对象,一个是分组名称,一个是dataFrame对象,
# 然后就可以使用分组相关的函数进行后续操作了
grouped = df.groupby(by="province")
print(grouped['universityName'].count())

2, count the number of colleges and universities in Hebei Province

group_count = df.groupby(by="province")['universityName'].count()['河北省']
print(group_count)

3, count the number of college in one of the provinces

hebei_data = df[df['province'] == '河北省']
print(hebei_data['universityName'].count())

4, the number of universities under the various City, Hebei Province

hebei_data = df[df['province'] == '河北省']
city_grouped = hebei_data.groupby(by='city')
print(city_grouped.count()['universityName'])

Here Insert Picture Description

5, multi-field packet statistics of each province, the number of universities in each province following

multi_group = df.groupby(by=['province','city'])['universityName'].count()
print(multi_group)

Here Insert Picture Description

According to the above-mentioned multi-field packet, and therefore it can be written as follows

multi_group_names = df['universityName'].groupby(by=[df['province'],df['city']]).count()
print(multi_group_names)

Draw

After analysis and statistics, generally requires drawing showing the results, the following number of universities above acquired through different provinces of the country, as well as get the number of universities in Hebei Province graphical display

The number of colleges and universities Figure 1, different provinces of the country

import pandas as pd
from matplotlib import pyplot as plt

iopath = './全国高校名单excel版(2019).xlsx'
df = pd.read_excel(iopath,sheet_name='全国高校')
print(df.head(2))

#避免中文乱码
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#呈现出全国省份排名前10的高校数量
grouped = df.groupby(by='province').count()['universityName'].sort_values(ascending=False)[:10]
print(grouped)
_x = grouped.index
_y = grouped.values
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()

Show renderings
Here Insert Picture Description

2, Hebei Province, the number of universities in different regions map

#展示出河北省的每个城市的高校数量
df = df[df['province']=='河北省']
grouped = df.groupby(by='city').count()['universityName'].sort_values(ascending=False)
# 设置中文显示字体,避免乱码
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
print(grouped)
_x = grouped.index
_y = grouped.values
plt.bar(range(len(_x)),_y,width=0.3,color='orange')
plt.xticks(range(len(_x)),_x)
plt.show()

Graphic details of the drawing can continue to improve here do not talk too much

Benpian this is basically over, focusing on common operation of the packet to do a simple explanation, and aggregation operations is very important pandas content, free follow-up will continue to do in-depth research, and finally, thanks for watching!

He published 191 original articles · won praise 99 · views 210 000 +

Guess you like

Origin blog.csdn.net/zhangcongyi420/article/details/103964075