Python teaching | Pandas group aggregation and data sorting

Python teaching column aims to provide beginners with a systematic and comprehensive Python programming learning experience. Through step-by-step explanation of the basic language and programming logic of Python, combined with practical cases, novices can easily understand Python!
>>>Click here to view previous Python teaching content

Table of contents

Part1Foreword

Part2 Group Aggregation Overview

Part3Pandas grouping function——groupby()

Part4 data sorting

Part5Summary

Part6Python Tutorial


Part1Foreword

Many of our data are flat, especially panel data. For example, a company has multiple types of shareholders, and each type of shareholder may contain multiple shareholders. If we want to count the investment amount based on shareholder type, we need to group the shareholders by shareholder type, and then add up the investment amount of all shareholders in the group. In addition to summing the data within a group, we can also perform various operations such as mean, variance, and count. In Python (Pandas), we can not only group and calculate some numerical information, but also use function aggregation to customize the merging of character information within the group. In this article, we will introduce to you how to use Pandas to perform basic grouping and aggregation of data. By the way, we will also introduce how to sort the data.

This tutorial is based on pandas version 1.5.3.

All Python code in this article was written in the integrated development environment Visual Studio Code (VScode) using the interactive development environment Jupyter Notebook. Please use Jupyter Notebook to open the code shared in this article.

Click the link to the original article to see how to obtain all the demo code and data used for demonstration in this article: Python teaching | Pandas grouping aggregation and data sorting

Part2 Group Aggregation Overview

In practical applications, data sets are often very large and complex, containing a variety of information and variables. Many times our data is based on individuals, and one row of data describes various information about an individual. However, sometimes there is a phenomenon where one row of data cannot store certain information. For example, we have shareholder information data of an agricultural-related listed company. The data contains two pieces of information: 股东名称and 股东认缴出资额. Since a listed company may contain more than one shareholder, in order to make the data clearer, the data is usually expanded and saved so that each Shareholder information is on its own line, and other information remains unchanged. However, when it is necessary to analyze the shareholder capital contributions of different listed companies, you will find that the data is not so intuitive. For this situation, we can use data grouping aggregation to group (classify) the data according to certain columns and calculate the information of the data under each grouping, thereby simplifying the difficulty of data analysis. For example, in the above scenario, we can use data grouping aggregation to group the data according to company names, and then calculate the sum, average and other statistical indicators of shareholders' investment in each group, so as to better understand the investment situation and provide further analysis Provide useful data support.

Common data processing tools generally include the function of data grouping and aggregation. In Excel, common aggregation methods such as sums and averages in different groups can be realized through "Pivot Tables"; in Stata, grouping aggregation can be completed using  collapse commands ; In the database (SQL), it is implemented through the GROUP BY clause; if you use Python, you can use the functions in Pandas  groupby()to implement group aggregation. The above four methods each have their own advantages and disadvantages in data grouping and aggregation. We will make a brief summary below.

  • Excel: The advantage is that it is simple to use, the interface is simple, and the interaction is clear; the disadvantage is that it cannot handle slightly larger data sets, which has always been a common problem with office software such as Excel. Although the upper limit of data storage in Excel is 1048576 rows, the actual amount of data processed is It is already on the verge of collapse when it comes to hundreds of thousands of rows of data.

  • Stata: It is more suitable for social science scholars who have a foundation in Stata, and it is easy to get started; however, Stata is not very popular and only meets the usage habits of a small number of people who need data.

  • SQL: SQL can be said to be the originator of grouped data processing. Since SQL can use the hard disk to complete data operations, it will not cause the computer to crash when processing extremely large data (larger than the memory space), but the efficiency is slightly worse. However, modifying data in SQL is a relatively troublesome matter. This is the same as Stata and Python, and it is not as flexible as Excel.

  • Python: Pandas is a universal tool for data processing and analysis. Grouping and aggregating data is of course a very simple matter. Compared with Excel, Pandas can process a much larger amount of data, and it is not limited to routine aggregation. Sum, count, mean, etc., and you can also use custom functions to complete more personalized aggregation operations. Of course, this Python processing method has a relatively high entry barrier and requires users to have a certain foundation in the Python language.

In this article, we will focus on how to use Pandas in Python to perform grouping and aggregation of data.

Part3Pandas grouping function——groupby()

Before the formal introduction groupby(), we first use pandas to read the shareholder information data (sample) of agricultural-related listed companies for demonstration.

import pandas as pd
# 读取演示用的数据
data = pd.read_csv('./涉农上市公司股东信息表_13条样例数据.csv')
data

picture

Next we will introduce groupby()the function, which is an attribute function of the DataFrame type in Pandas, that is, it can only be called by tabular data DataFrame groupby()(the Series type also has an groupby()attribute function, but it is rarely used). Its function is to group tables according to specified data fields and return one 分组器对象. Its basic syntax, common parameter list and meaning are as follows.

df.groupby(by=None, axis=0, as_index=True, dropna=True, sort=True)
parameter name Parameter optional value Parameter usage meaning
by A single field name or a list containing field names Necessary parameters are the basis for grouping functions, and field names need to be passed in. If grouping is based on a certain field, just pass in the name of the field; if grouping is based on multiple fields, you need to pass in the names of all the fields that the grouping is based on. Be sure to pay attention to the order of the multiple field names.
axis 0 or 1 The default value is 0, which means grouping by rows, which is also the most commonly used and conforms to our data usage habits; when it is 1, it means grouping by columns (not the focus).
as_index True or False The default value is True, which means that the field by which the grouping is based is used as the row index of the grouping result. If the grouping basis contains multiple fields, then all fields will be set as row indexes, and the aggregated data is data with multi-level row indexes. . When set to False, the data is only grouped and the grouping by field will not be set as an index after aggregation.
dropsy True or False The default value is True, which means that when there are missing values ​​in the grouping by field, the rows containing the missing values ​​will be actively deleted; when set to False, the missing values ​​will also be treated as a group for grouping.
sort True or False The default value is True, which means the aggregation results will be sorted; when set to False, there will be no sorting, and they will be arranged according to the original order of the grouping fields, while improving the efficiency of grouping.

As mentioned above, groupby()the function will return a grouper containing the grouping results. Since a data table contains multiple groups after grouping, it is difficult to display them directly, so the return value is an invisible grouper result, and the grouper It also has more attributes and can complete more aggregation operations. Next we process the demo data and 企业名称group the data based on fields. The code is as follows.

# 以【企业名称】字段为依据对数据进行分组
data_grouped = data.groupby(by='企业名称', as_index=False)
# 尝试输出查看得到的分组器
print(data_grouped)
# 得到: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F5F2CA3490>
# 尝试输出分组器,只得到一个 DataFrameGroupBy 对象

Although we cannot directly view the contents of the grouper, we can use a loop to obtain the contents of different groups. The code is as follows.

# 使用字典推导式将分组器转为字典,“键”为分组依据,“值”为一个组的数据,是一个表格
All_groups = {name:group for name, group in data_grouped}
print(All_groups)
# 输出如下(已优化输出,实际输出排版略有不同)
"""
{   '上海海融食品科技股份有限公司':             企业名称    股东类别  股东名称    认缴出资额(万元)
                            11  上海海融食品科技股份有限公司   个人     黄海瑚     1728.0
                            12  上海海融食品科技股份有限公司   个人     黄海晓     2400.0,
    '东阿阿胶股份有限公司':                      企业名称    股东类别  股东名称    认缴出资额(万元)
                            0  东阿阿胶股份有限公司         个人        国家股      12108.1385
                            1  东阿阿胶股份有限公司         企业        社会公众股  28763.0164,
    '乐山巨星农牧股份有限公司':                   企业名称    股东类别  股东名称    认缴出资额(万元)
                            7  乐山巨星农牧股份有限公司   企业  四川乐山振静皮革制品有限公司     8866.0
                            8  乐山巨星农牧股份有限公司   个人             贺正刚     3471.0
                            9  乐山巨星农牧股份有限公司   企业    四川和邦投资集团有限公司     1663.0,
    '金正大生态工程集团股份有限公司':              企业名称    股东类别  股东名称                   认缴出资额(万元)
                            2  金正大生态工程集团股份有限公司   企业    雅戈尔投资有限公司          7500.0
                            3  金正大生态工程集团股份有限公司   企业    CRF化肥投资有限公司         7500.0
                            4  金正大生态工程集团股份有限公司   企业    临沂金正大投资控股有限公司   30672.0
                            5  金正大生态工程集团股份有限公司   企业    人民币普通股                10000.0
                            6  金正大生态工程集团股份有限公司   个人    万连步                      14328.0,
    '隆基绿能科技股份有限公司':                   企业名称    股东类别  股东名称    认缴出资额(万元)
                            10  隆基绿能科技股份有限公司        个人    李振国等    377201.6757}
"""

It can be seen from the output that groupby() bydivides the data into different groups according to the grouping field in the parameter. For example, in the above operation, we 企业名称group according to the field, then the data in each group after grouping is Data with the same company name. This is the basic principle of grouping.

After the data is grouped, aggregation operations are required. What is aggregation? In the previous step, we have grouped the data, so the aggregation operation is to calculate certain fields of the data in each group, and finally merge the calculated results in each group together. 企业名称For example, we have grouped according to the previous step . Then when we need to analyze the total subscribed capital of all shareholders in each enterprise, we can use the "sum" method to calculate 认缴出资额(万元)the total within the group, and then add the total of all groups The summed results are combined together to return the final aggregated result. The code is as follows.

# 分组,对“认缴出资额(万元)”做加总聚合
Result = data.groupby(by='企业名称', as_index=False).sum()   # .sum() 表示对所有可加总的字段做加总聚合
Result

picture

Observing the above results, you can find that 认缴出资额(万元)the value after aggregation is already the sum of the values ​​in each group, and has been sorted according to the size of the values ​​in the column. This is because the sortparameters are effective. In addition, in the above code, we actively set parameters as_index=False, which means that the grouping field is set to the data field instead of the data row index. If we do not set this parameter, we will get the following results.

picture

It can be seen that when no as_indexparameters are set, the grouping by field 企业名称will become the row index of the aggregation result, and the string "company name" is the name of the index.

In addition, it can be found that the aggregation result only contains two fields. This is because the grouping basis (by parameter) has only one 企业名称field, and only numeric values 认缴出资额(万元)​​can meet sum()the requirements of the summation function during aggregation, so the aggregation result only contains these two fields. List. Of course, this is the result under pandas version 1.5.3. After pandas is updated to version 2.0.0, the working principle of the grouped aggregation function has been optimized. If you run the above code under pandas version 2.0.0, you will get the following result.

picture

As you can see, the sum function in pandas 2.0.0 sum()not only sums data columns that store numeric types, but also concatenates data columns whose values ​​are character types. So in order to avoid unnecessary mistakes, we can use a more comprehensive aggregate function - agg(). agg()It is a function under the grouper. You can freely perform different aggregation operations on different fields in the group. Below we demonstrate how to use agg()function + anonymous function to connect all shareholders of the same type in the same company using commas (,). Up , the code is as follows.

# 根据“企业名称”和“股东类别”分组,对“股东名称”做字符连接,中间使用顿号隔开
Result = data.groupby(by=['企业名称', '股东类别'], as_index=False)\
                .agg({'股东名称': lambda x : '、'.join(x)})  
Result

picture

In the above code, groupby()the function by=['企业名称', '股东类别']indicates that it will be grouped according to 企业名称the 股东类别two fields. agg()The function accepts a dictionary. The "key" in the dictionary represents the name of the field to be aggregated, and the "value" in the dictionary represents the aggregation method of the field. Common ones include sum ( ), mean ( ), variance 'sum'( 'mean') 'std', Maximum value ( max), minimum value ( min), etc. Most of these aggregation methods are suitable for numerical fields. If these calculation methods cannot meet the needs, we can also use custom functions or anonymous functions to freely process the fields to be aggregated within the group. For example, in the above code, an anonymous function is used to piece together the character data in the group. When writing the anonymous function, we can regard the input value as xa one-dimensional Series containing the fields to be aggregated, and this is actually the case.

In addition, agg()the function can also aggregate different fields in the group in different ways at the same time, and can implement multiple aggregation methods on a field. For example, the following code groups based on and, and then sums and sums 企业名称the 股东类别fields 认缴出资额(万元). There are three types of aggregation: mean and maximum, and at the same time, the 股东名称fields are combined.

# 同时对多个字段进行不同方式的聚合操作
Result = data.groupby(by=['企业名称', '股东类别'], as_index=False)\
                .agg({'认缴出资额(万元)': ['sum', 'mean', max],
                        '股东名称': lambda x : '、'.join(x)}) 
Result

picture

The above is the basic method of using grouped aggregation in Pandas.

Part4 data sorting

I have written a lot about Pandas, but I have never found an opportunity to introduce data sorting in Pandas. In this issue, I will briefly introduce how to use Pandas to sort data.

The most common data sorting in Pandas is df.sort_values()to sort based on the data values ​​in the table. Its basic syntax and usage and meaning of common parameters are as follows.

df.sort_values(by=IndexLabel, 
               axis=0, 
               ascending=True, 
               na_position='last', 
               inplace=False, 
               ignore_index=False)
parameter name Parameter optional value Parameter usage meaning
by A single field name or a list containing field names Specifies which columns to sort by, which can be a column name or a list of column names. If this parameter is not set, all columns will be sorted by default.
axis 0 or 1 Specify which axis to sort on. The default value is 0, which means the data rows are sorted, which is more commonly used. If set to 1, the data columns are sorted, which can be used in transposed data.
ascending True or False. When the sorting field is not unique, a list containing True and False can be passed in. Specify whether to sort in ascending order. The default value is True, which means sorting in ascending order. If set to False, it sorts in descending order. If the by parameter contains multiple fields, you can set this parameter to a list containing True and False. For example, when parameter, you can set it to indicate by=[A,B]that ascending=[True,False]column A will be sorted in ascending order, and then column B will be sorted in descending order.
na_position 'last' or 'first' Specify the position of missing values. The default value is 'last', which means that missing values ​​are placed at the end. It can also be set to 'first', which means that missing values ​​are placed at the front.
inplace True or False Specify whether to operate on the original DataFrame. The default value is False, which means not to operate on the original DataFrame and return a sorted new DataFrame. If set to True, it means to sort directly on the original data (caller), and at the same time No value will be returned.
ignore_index True or False Specifies whether to ignore the index. The default value is False, which means the original index is retained. It can also be set to True, which means the original index is ignored and a new continuous index is regenerated.

Data sorting is a delicate job, so there are many functional parameters, but these parameters are very easy to understand and will not be too troublesome to use. Below we give an example of how to sort data.

We still take the agricultural-related listed company data used above as an example. Before sorting, we first use the df.sample() function to disrupt the order of the data. The code is as follows.

# 打乱数据(行)的顺序
unordered_data = data.sample(frac=1)
unordered_data    # 查看乱序后的数据

picture

Next we sort this out-of-order data. If we want the data rows of the same company to be together after sorting, and the data rows of the same company to be arranged in descending order according to the amount of subscribed capital, then we can use the following code.

# 根据'企业名称'升序排序的基础上,再根据'认缴出资额(万元)'进行降序排序,排序后的结果赋给新变量 sorted_data
sorted_data = unordered_data.sort_values(by=['企业名称', '认缴出资额(万元)'],
                                         ascending=[True, False])
sorted_data    # 查看排序后的新数据

picture

It can be seen from the sorting results that on 企业名称the basis of sorting, they are sorted in descending order according to the amount of investment. Since no parameters are actively set ignore_index=True, the data index in the sorting result is still the original index. If subsequent data selection is required, discontinuous index values ​​will be detrimental to the selection operation. If you want the sorted data index values ​​to be continuous, you can set parameters ignore_index=Trueor use it after the sorted results .reset_index(drop=True)to reset the index. The above is the basic method of using data sorting in Pandas.

Part5Summary

Grouping, aggregation, and sorting are basic techniques in data processing. Grouping and aggregation can make flat data more intuitive and reduce data redundancy; sorting can allow us to understand the patterns of data more clearly. These are data analysis knowledge that must be acquired.

Part6Python Tutorial

 Table of contents

Guess you like

Origin blog.csdn.net/weixin_55633225/article/details/132058223