Pandas groupby usage instructions
1. Function description
According to the official document to explain the groupby function, you can refer to the grouping operation in SQL for understanding.
By “group by” we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
There are three main steps: - Group by conditional data
- Apply function processing to each individual grouped data
- The results are merged into the data structure.
Data processing and use:
Aggregation : aggregation, sum mean std max min var, etc.
Transformation : conversion, which can standardize data and handle null values
Filtration : filter, filter data based on grouping functions, such as sum mean
The following introduces grouping , followed by aggregation, transformation and filtering
2. Data grouping
(1) Data preparation
df = pd.DataFrame(
{
"product": ['computer','printer','pad','computer','printer','pad','computer','printer'],
"month": ['1月','2月','3月','4月','1月','2月','3月','4月'],
"score1": np.random.randint(60,100,8),
"score2": np.random.randint(60,100,8),
"score3": np.random.randint(60,100,8)
})
df
(2) Grouping
groupby must first specify the grouping principle, the first step of the groupby function, and its common parameters include:
by, grouping field, can be column name/series/dictionary/function, commonly used as column name
- axis, specifies the splitting direction, the default is 0, which means splitting along the line
- as_index, whether to use the grouping column name as the output index, the default is True; when it is set to False, it is equivalent to adding the reset_index function
- sort, which is consistent with the default sorting performed by the groupby operation in SQL. The groupby can also specify whether to sort the output results by index through the sort parameter
Group by product:
df.groupby('product',as_index=False)
Note that df.groupby('product',as_index=False) is a grouping object, not a dataframe, which is a bit difficult to understand at the beginning.
Summary:
The process of groupby is to divide the original DataFrame into several sub-DataFrames according to the fields of groupby . There are as many sub- DataFrames as there are groups .
A series of operations after groupby (such as agg, apply, etc.) are based on sub-DataFrame operations.
After understanding this, we can basically understand the main principles of groupby object operation in Pandas.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6a16ccdfd0>
Take a look at the content of the grouping object.
The name group can be named at will, and the name corresponds to the grouping column.
for name, group in df.groupby('product'):
print(name)
print(group)
It is easier to understand the grouping object by running the result:
computer
product month score1 score2 score3
0 computer 1月 66 66 90
3 computer 4月 89 91 84
6 computer 3月 88 88 63
pad
product month score1 score2 score3
2 pad 3月 72 63 82
5 pad 2月 69 79 60
printer
product month score1 score2 score3
1 printer 2月 69 62 75
4 printer 1月 91 87 97
7 printer 4月 73 69 84
(3) Combination grouping
Group by product month two columns:
for name, group in df.groupby(['product','month']):
print(name)
print(group)
name is the grouping of two values, the result is as follows:
('computer', '1月')
product month score1 score2 score3
0 computer 1月 66 66 90
('computer', '3月')
product month score1 score2 score3
6 computer 3月 88 88 63
('computer', '4月')
product month score1 score2 score3
3 computer 4月 89 91 84
('pad', '2月')
product month score1 score2 score3
5 pad 2月 69 79 60
('pad', '3月')
product month score1 score2 score3
2 pad 3月 72 63 82
('printer', '1月')
product month score1 score2 score3
4 printer 1月 91 87 97
('printer', '2月')
product month score1 score2 score3
1 printer 2月 69 62 75
('printer', '4月')
product month score1 score2 score3
7 printer 4月 73 69 84
Or use the list function to display the content of the groupby object.
list(df.groupby(['product','month']))
(4)first tail nth
The first group of grouped data after grouping
computer printer pad The data that appears for the first time:
df.groupby('product').first(1)
tail and last are the last set of grouped data.
tail includes all columns
df.groupby('product').tail(1)
last includes only data columns.
df.groupby('product').last(1)
The nth function is to display the Nth group of grouped data, which is convenient for querying data from the middle part of the group.
Display the third group of grouped data, starting from 0. This example is the last group of grouped data without pad products.
df.groupby('product').nth(2)
df.groupby('product').nth(1)
The middle group of grouped data, the effect is as follows:
3. Aggregation
(1) Single-column agg aggregation
df.groupby('product')['score1'].agg([np.sum, np.mean, np.std])
The sum, mean and standard deviation of score1, the effect is as follows:
(2) Multi-column aggregation
df.groupby('product').agg({'score1':np.sum, 'score2':np.mean, 'score3':np.std})
The effect is as follows:
the corresponding SQL can be compared, which is easier to understand:
select sum(score1),mean(score2),std(score3) from df group by product ;
(3) Multi-column multi-aggregation calculation
df.groupby('product').agg({'score1':[np.sum,np.max], 'score2':[np.mean,np.min], 'score3':[np.std,np.var]})
The effect is as follows:
(4) apply
The apply function is a very widely used conversion function. For example, for series objects, the processing granularity of the apply function is each element of the series (scalar); for dataframe objects, the processing granularity of the apply function is a row or column of the dataframe (series object ); and now for the group object after groupby, its processing granularity is a group ( sub dataframe object)
For example: calculate the mean difference of two columns
df.groupby('product').apply(lambda x: x['score3'].mean()-x['score1'].mean())
The result is as follows:
product
computer -2.000000
pad 0.500000
printer 7.666667
dtype: float64
4. Transformation
The corresponding result is obtained for each piece of data, and the samples in the same group will have the same
value, which is one-to-one correspondence through the index.
df.groupby('product')['score1'].transform('mean')
The result is as follows:
0 81.000000
1 77.666667
2 70.500000
3 81.000000
4 77.666667
5 70.500000
6 81.000000
7 77.666667
Name: score1, dtype: float64
Directly add a new column to the aggregation operation result through transform. The average value of score1, add a new column.
df['avg_score1'] = df.groupby('product')['score1'].transform('mean')
Corresponding to computer products, they are all the same value, and the effect is as follows: The difference between
transform and agg , for agg , it will calculate the mean value corresponding to different products and return it directly , but for transform , it will calculate the mean value of each product The corresponding results are obtained from the data, the samples in the same group will have the same value, and the results will be returned in the order of the original index after the mean value in the group is calculated .
The index in this example is 01234567.
5. Filtration
Conditional filtering is performed through aggregate functions, similar to having clauses in SQL.
df.groupby('product').filter(lambda x: x['score1'].mean()>80)
The effect is as follows:
df.groupby('product').filter(lambda x: x['score3'].mean()<75)
The result is as follows: