The 8 most common techniques for grouping data in Python

pandasis groupbya very powerful function in data processing. Although many students are already very familiar with it, some tips still need to be shared with everyone.

In order to demonstrate to you, we use a public dataset for illustration.

import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

Random sampling of 5 pieces, the data looks like this.

>>> iris.sample(5)
     sepal_length  sepal_width  petal_length  petal_width     species
95            5.7          3.0           4.2          1.2  versicolor
71            6.1          2.8           4.0          1.3  versicolor
133           6.3          2.8           5.1          1.5   virginica
4             5.0          3.6           1.4          0.2      setosa
33            5.5          4.2           1.4          0.2      setosa

Because it is a grouping function, the objects to be classified must be of category type. In this data, here we speciestake grouping as an example.

First, speciescreate groupbya object. Here, the object is generated separately groupbybecause it will be used repeatedly later, and it can be directly linked if you are practical and proficient.

iris_gb = iris.groupby('species')

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, technical exchange improvement, all of which can be obtained by adding the communication group, the group has more than 2,000 friends, the best way to add notes is: source + interest directions, making it easy to find like-minded friends.

Method ①, Add WeChat account: dkl88194, Remarks: from CSDN + python
Method ②, WeChat search official account: Python learning and data mining, background reply: python

1. Create a frequency table

If I want to know speciesthe quantity in each class, then the function that can be used directly groupbyis sizeas follows.

>>> iris_gb.size()
species
setosa        50
versicolor    50
virginica     50
dtype: int64

Second, calculate the commonly used descriptive statistics

For example, if I want to calculate the mean by group, then use mean()a function.

>>> # 计算均值
>>> iris_gb.mean()
            sepal_length  sepal_width  petal_length  petal_width
species                                                         
setosa             5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
virginica          6.588        2.974         5.552        2.026

By default, if there are no restrictions, mean()the function will compute the mean over all variable features. If I want to calculate the mean of only one variable, I can specify that variable as shown below.

>>> # 单列
>>> iris_gb['sepal_length'].mean()
species
setosa        5.006
versicolor    5.936
virginica     6.588
Name: sepal_length, dtype: float64
>>> # 双列
>>> iris_gb[['sepal_length', 'petal_length']].mean()
            sepal_length  petal_length
species                               
setosa             5.006         1.462
versicolor         5.936         4.260
virginica          6.588         5.552

Similarly, other descriptive statistics min, max(), medianheand stdare used in the same way.

3. Find the index of the maximum value (minimum value)

If we want to find the index of the maximum or minimum value of each group, there is a convenient function that can be used directly.

>>> iris_gb.idxmax()
            sepal_length  sepal_width  petal_length  petal_width
species                                                         
setosa                14           15            24           43
versicolor            50           85            83           70
virginica            131          117           118          100

How to apply it?

For example, when we want to find sepal_lengththe entire record corresponding to the maximum value of each group, we can use it like this. Note that here is the entire record, which is equivalent to sepal_lengthfiltering according to the condition of the maximum value.

>>> sepal_largest = iris.loc[iris_gb['sepal_length'].idxmax()]
>>> sepal_largest
     sepal_length  sepal_width  petal_length  petal_width     species
14            5.8          4.0           1.2          0.2      setosa
50            7.0          3.2           4.7          1.4  versicolor
131           7.9          3.8           6.4          2.0   virginica

Fourth, reset the index after Groupby

Many times, we groupbyhave to perform other operations after processing. That is, we want to reset the grouping index to be normal rows and columns.

The first method may be commonly used by everyone, that is, by reset_index()resetting the out-of-order index.

>>> iris_gb.max().reset_index()
      species  sepal_length  sepal_width  petal_length  petal_width
0      setosa           5.8          4.4           1.9          0.6
1  versicolor           7.0          3.4           5.1          1.8
2   virginica           7.9          3.8           6.9          2.5

But in fact, there is another usage that looks more friendly. You can groupbyset as_indexthe parameters at any time, and you can also achieve the same effect.

>>> iris.groupby('species', as_index=False).max()
      species  sepal_length  sepal_width  petal_length  petal_width
0      setosa           5.8          4.4           1.9          0.6
1  versicolor           7.0          3.4           5.1          1.8
2   virginica           7.9          3.8           6.9          2.5

5. Summary of various statistics

The above are all operations of a single statistic, so what if I want to operate several at the same time?

groupbyAnother great usage is aggto use it in conjunction with aggregate functions.

>>> iris_gb[['sepal_length', 'petal_length']].agg(["min", "mean"])

           sepal_length        petal_length       
                    min   mean          min   mean
species                                           
setosa              4.3  5.006          1.0  1.462
versicolor          4.9  5.936          3.0  4.260
virginica           4.9  6.588          4.5  5.552

In aggit, we only need to list the name of the statistic, and we can perform multi-dimensional statistics on each column at the same time.

6. Aggregation of specific columns

We also saw that the multiple operations above are the same for each column. In actual use, we may have different requirements for each column.

So in this case, we can set different statistics for different columns individually.

>>> iris_gb.agg({
    
    "sepal_length": ["min", "max"], "petal_length": ["mean", "std"]})
           sepal_length      petal_length          
                    min  max         mean       std
species                                            
setosa              4.3  5.8        1.462  0.173664
versicolor          4.9  7.0        4.260  0.469911
virginica           4.9  7.9        5.552  0.551895

7. NamedAgg named statistics

Now I have a new idea. The above multi-level index looks a bit unfriendly. I want to combine the statistics and column names under each column separately. NamedAggColumn naming can be done using .

>>> iris_gb.agg(
...     sepal_min=pd.NamedAgg(column="sepal_length", aggfunc="min"),
...     sepal_max=pd.NamedAgg(column="sepal_length", aggfunc="max"),
...     petal_mean=pd.NamedAgg(column="petal_length", aggfunc="mean"),
...     petal_std=pd.NamedAgg(column="petal_length", aggfunc="std")
... )
            sepal_min  sepal_max  petal_mean  petal_std
species                                                
setosa            4.3        5.8       1.462   0.173664
versicolor        4.9        7.0       4.260   0.469911
virginica         4.9        7.9       5.552   0.551895

Because NamedAggit is a tuple, we can also directly assign the tuple to the new name, the effect is the same, but it looks more concise.

iris_gb.agg(
    sepal_min=("sepal_length", "min"),
    sepal_max=("sepal_length", "max"),
    petal_mean=("petal_length", "mean"),
    petal_std=("petal_length", "std")
)

8. Use custom functions

In the above agg aggregation function, we all complete the operation by adding a statistic name. In addition, we can also directly give a function object.

>>> iris_gb.agg(pd.Series.mean)
            sepal_length  sepal_width  petal_length  petal_width
species                                                         
setosa             5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
virginica          6.588        2.974         5.552        2.026

Not only that, names and function objects can also be used together.

iris_gb.agg(["min", pd.Series.mean])

What's more, we can also customize functions, which are all possible.

>>> def double_length(x):
...     return 2*x.mean()
... 
>>> iris_gb.agg(double_length)
            sepal_length  sepal_width  petal_length  petal_width
species                                                         
setosa            10.012        6.856         2.924        0.492
versicolor        11.872        5.540         8.520        2.652
virginica         13.176        5.948        11.104        4.052

Of course, if you want to be more concise, you can also use lambdafunctions. In short, the usage is very flexible and can be combined and matched freely.

iris_gb.agg(lambda x: x.mean())

The above are groupbythe 8 operations that may be used during use. If you use it skillfully, you will find that this function is really powerful.

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/131349594