Pandas groupby usage instructions

Pandas groupby usage instructions

1. Function description

According to the official document to explain the groupby function, you can refer to the grouping operation in SQL for understanding.
By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.
    There are three main steps:
  • Group by conditional data
  • Apply function processing to each individual grouped data
  • The results are merged into the data structure.
    Data processing and use:
    Aggregation : aggregation, sum mean std max min var, etc.
    Transformation : conversion, which can standardize data and handle null values
    ​​Filtration : filter, filter data based on grouping functions, such as sum mean
    The following introduces grouping , followed by aggregation, transformation and filtering

2. Data grouping

(1) Data preparation

df = pd.DataFrame(
    {
        "product": ['computer','printer','pad','computer','printer','pad','computer','printer'],
        "month": ['1月','2月','3月','4月','1月','2月','3月','4月'],
        "score1": np.random.randint(60,100,8),
        "score2": np.random.randint(60,100,8),
        "score3": np.random.randint(60,100,8)
    })
df

insert image description here

(2) Grouping

groupby must first specify the grouping principle, the first step of the groupby function, and its common parameters include:

by, grouping field, can be column name/series/dictionary/function, commonly used as column name

  • axis, specifies the splitting direction, the default is 0, which means splitting along the line
  • as_index, whether to use the grouping column name as the output index, the default is True; when it is set to False, it is equivalent to adding the reset_index function
  • sort, which is consistent with the default sorting performed by the groupby operation in SQL. The groupby can also specify whether to sort the output results by index through the sort parameter

Group by product:

df.groupby('product',as_index=False)   

Note that df.groupby('product',as_index=False) is a grouping object, not a dataframe, which is a bit difficult to understand at the beginning.

Summary:
The process of groupby is to divide the original DataFrame into several sub-DataFrames according to the fields of groupby . There are as many sub- DataFrames as there are groups .
A series of operations after groupby (such as agg, apply, etc.) are based on sub-DataFrame operations.
After understanding this, we can basically understand the main principles of groupby object operation in Pandas.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6a16ccdfd0>

Take a look at the content of the grouping object.
The name group can be named at will, and the name corresponds to the grouping column.

for name, group in df.groupby('product'):
    print(name)
    print(group)
  

It is easier to understand the grouping object by running the result:

computer
    product month  score1  score2  score3
0  computer    1月      66      66      90
3  computer    4月      89      91      84
6  computer    3月      88      88      63
pad
  product month  score1  score2  score3
2     pad    3月      72      63      82
5     pad    2月      69      79      60
printer
   product month  score1  score2  score3
1  printer    2月      69      62      75
4  printer    1月      91      87      97
7  printer    4月      73      69      84

(3) Combination grouping

Group by product month two columns:

for name, group in df.groupby(['product','month']):
    print(name)
    print(group)

name is the grouping of two values, the result is as follows:

('computer', '1月')
    product month  score1  score2  score3
0  computer    1月      66      66      90
('computer', '3月')
    product month  score1  score2  score3
6  computer    3月      88      88      63
('computer', '4月')
    product month  score1  score2  score3
3  computer    4月      89      91      84
('pad', '2月')
  product month  score1  score2  score3
5     pad    2月      69      79      60
('pad', '3月')
  product month  score1  score2  score3
2     pad    3月      72      63      82
('printer', '1月')
   product month  score1  score2  score3
4  printer    1月      91      87      97
('printer', '2月')
   product month  score1  score2  score3
1  printer    2月      69      62      75
('printer', '4月')
   product month  score1  score2  score3
7  printer    4月      73      69      84

Or use the list function to display the content of the groupby object.

list(df.groupby(['product','month']))   

(4)first tail nth

The first group of grouped data after grouping
computer printer pad The data that appears for the first time:

df.groupby('product').first(1)

insert image description here
tail and last are the last set of grouped data.
tail includes all columns

df.groupby('product').tail(1)

insert image description here
last includes only data columns.

df.groupby('product').last(1)

insert image description here
The nth function is to display the Nth group of grouped data, which is convenient for querying data from the middle part of the group.
Display the third group of grouped data, starting from 0. This example is the last group of grouped data without pad products.

df.groupby('product').nth(2)

insert image description here

df.groupby('product').nth(1)

The middle group of grouped data, the effect is as follows:
insert image description here

3. Aggregation

(1) Single-column agg aggregation

df.groupby('product')['score1'].agg([np.sum, np.mean, np.std])

The sum, mean and standard deviation of score1, the effect is as follows:
insert image description here

(2) Multi-column aggregation

df.groupby('product').agg({'score1':np.sum, 'score2':np.mean, 'score3':np.std})

The effect is as follows:
insert image description here
the corresponding SQL can be compared, which is easier to understand:

select sum(score1),mean(score2),std(score3) from df group by product ;

(3) Multi-column multi-aggregation calculation

df.groupby('product').agg({'score1':[np.sum,np.max], 'score2':[np.mean,np.min], 'score3':[np.std,np.var]})

The effect is as follows:
insert image description here

(4) apply

The apply function is a very widely used conversion function. For example, for series objects, the processing granularity of the apply function is each element of the series (scalar); for dataframe objects, the processing granularity of the apply function is a row or column of the dataframe (series object ); and now for the group object after groupby, its processing granularity is a group ( sub dataframe object)
For example: calculate the mean difference of two columns

df.groupby('product').apply(lambda x: x['score3'].mean()-x['score1'].mean())

The result is as follows:

product
computer   -2.000000
pad         0.500000
printer     7.666667
dtype: float64

4. Transformation

The corresponding result is obtained for each piece of data, and the samples in the same group will have the same
value, which is one-to-one correspondence through the index.

df.groupby('product')['score1'].transform('mean')

The result is as follows:

0    81.000000
1    77.666667
2    70.500000
3    81.000000
4    77.666667
5    70.500000
6    81.000000
7    77.666667
Name: score1, dtype: float64

Directly add a new column to the aggregation operation result through transform. The average value of score1, add a new column.

df['avg_score1'] = df.groupby('product')['score1'].transform('mean')

Corresponding to computer products, they are all the same value, and the effect is as follows: The difference between
insert image description here
transform and agg , for agg , it will calculate the mean value corresponding to different products and return it directly , but for transform , it will calculate the mean value of each product The corresponding results are obtained from the data, the samples in the same group will have the same value, and the results will be returned in the order of the original index after the mean value in the group is calculated .

The index in this example is 01234567.

5. Filtration

Conditional filtering is performed through aggregate functions, similar to having clauses in SQL.

df.groupby('product').filter(lambda x: x['score1'].mean()>80)

The effect is as follows:
insert image description here

df.groupby('product').filter(lambda x: x['score3'].mean()<75)

The result is as follows:
insert image description here

Guess you like

Origin blog.csdn.net/qq_39065491/article/details/131104146