Detailed explanation of pandas groupby usage

Project github address: bitcarmanlee easy-algorithm-interview-and-practice
welcome everyone to star, leave a message, and learn and progress together

1. Group groupby

In the process of daily data analysis, there is often a need for grouping. Specifically, the data is divided into different groups according to one or more fields, and then further analysis is performed, such as finding the number of groups, the maximum, minimum, and average values ​​in the groups. In SQL, it is the famous groupby operation.
In pandas, there is also a corresponding groupby operation, let's take a look at how to use groupby in pandas.

2. The data structure of groupby

First we look at the following code

def ddd():
    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
    nums = [10, 20, 30, 20, 15, 10, 12]
    df = pd.DataFrame({"level": levels, "num": nums})
    g = df.groupby('level')
    print(g)
    print()
    print(list(g))

The output is as follows:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10f6f96d0>

[('L1',   level  num
0    L1   10
1    L1   20
2    L1   30), ('L2',   level  num
3    L2   20
4    L2   15), ('L3',   level  num
5    L3   10
6    L3   12)]

After doing the groupby operation, what you get is a DataFrameGroupBy object. If you print the object directly, the memory address will be displayed.
In order to observe the data conveniently, we use the list method to convert and find that it is a tuple. The first element in the tuple is the value of level. The second element in the ancestor is the entire dataframe under its group.

3. Basic usage of groupby

def group1():
    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
    nums = [10, 20, 30, 20, 15, 10, 12]
    scores = [100, 200, 300, 200, 150, 100, 120]
    df = pd.DataFrame({"level": levels, "num": nums, "score": scores})
    result = df.groupby('level').agg({'num': 'sum', 'score': 'mean'})
    allnum = result['num'].sum()
    result['rate'] = result['num'].map(lambda x: x / allnum)
    print(result)

The final output:

       num  score      rate
level                      
L1      60    200  0.512821
L2      35    175  0.299145
L3      22    110  0.188034

The above example shows the basic usage of groupby.
Group the dataframe by level, then sum the num column and average the score column to get the result.
At the same time, we also want to get the proportion of the sum of num in all num sums in each group. So we first seek the synthesis of num, and then use the map method to add a column to the result to find its proportion!

4. The usage of transform

Let's look at a more complex example below.

def t10():
    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
    nums = [10, 20, 30, 20, 15, 10, 12]
    df = pd.DataFrame({"level": levels, "num": nums})
    ret = df.groupby('level')['num'].mean().to_dict()
    df['avg_num'] = df['level'].map(ret)
    print(ret)
    print(df)
{'L1': 20.0, 'L2': 17.5, 'L3': 11.0}
  level  num  avg_num
0    L1   10     20.0
1    L1   20     20.0
2    L1   30     20.0
3    L2   20     17.5
4    L2   15     17.5
5    L3   10     11.0
6    L3   12     11.0

In the above method, after we group the levels, we want to add a column to the data set, and we want to add the average value corresponding to each level to each row of data.
The above solution is to first obtain the average value of each group, convert it into a dict, and then use the map method to add the average value of each group.

def trans():
    levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
    nums = [10, 20, 30, 20, 15, 10, 12]
    df = pd.DataFrame({"level": levels, "num": nums})
    df['avg_num'] = df.groupby('level')['num'].transform('mean')
    print(df)

If you use the transform method, the code can be simpler and more intuitive, as shown above.

The role of the transform method: call the function to generate a dataFrame with the same index as the original df on each group, and return the dataFrame with the same index as the original object and filled with the converted value, which is equivalent to adding to the original dataframe a row.

Guess you like

Origin blog.csdn.net/bitcarmanlee/article/details/111501223