pandas
is groupby
a very powerful function in data processing. Although many students are already very familiar with it, some tips still need to be shared with everyone.
In order to demonstrate to you, we use a public dataset for illustration.
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Random sampling of 5 pieces, the data looks like this.
>>> iris.sample(5)
sepal_length sepal_width petal_length petal_width species
95 5.7 3.0 4.2 1.2 versicolor
71 6.1 2.8 4.0 1.3 versicolor
133 6.3 2.8 5.1 1.5 virginica
4 5.0 3.6 1.4 0.2 setosa
33 5.5 4.2 1.4 0.2 setosa
Because it is a grouping function, the objects to be classified must be of category type. In this data, here we species
take grouping as an example.
First, species
create groupby
a object
. Here, the object is generated separately groupby
because it will be used repeatedly later, and it can be directly linked if you are practical and proficient.
iris_gb = iris.groupby('species')
Technology Exchange
Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.
Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, technical exchange improvement, all of which can be obtained by adding the communication group, the group has more than 2,000 friends, the best way to add notes is: source + interest directions, making it easy to find like-minded friends.
Method ①, Add WeChat account: dkl88194, Remarks: from CSDN + python
Method ②, WeChat search official account: Python learning and data mining, background reply: python
1. Create a frequency table
If I want to know species
the quantity in each class, then the function that can be used directly groupby
is size
as follows.
>>> iris_gb.size()
species
setosa 50
versicolor 50
virginica 50
dtype: int64
Second, calculate the commonly used descriptive statistics
For example, if I want to calculate the mean by group, then use mean()
a function.
>>> # 计算均值
>>> iris_gb.mean()
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
By default, if there are no restrictions, mean()
the function will compute the mean over all variable features. If I want to calculate the mean of only one variable, I can specify that variable as shown below.
>>> # 单列
>>> iris_gb['sepal_length'].mean()
species
setosa 5.006
versicolor 5.936
virginica 6.588
Name: sepal_length, dtype: float64
>>> # 双列
>>> iris_gb[['sepal_length', 'petal_length']].mean()
sepal_length petal_length
species
setosa 5.006 1.462
versicolor 5.936 4.260
virginica 6.588 5.552
Similarly, other descriptive statistics min
, max()
, medianhe
and std
are used in the same way.
3. Find the index of the maximum value (minimum value)
If we want to find the index of the maximum or minimum value of each group, there is a convenient function that can be used directly.
>>> iris_gb.idxmax()
sepal_length sepal_width petal_length petal_width
species
setosa 14 15 24 43
versicolor 50 85 83 70
virginica 131 117 118 100
How to apply it?
For example, when we want to find sepal_length
the entire record corresponding to the maximum value of each group, we can use it like this. Note that here is the entire record, which is equivalent to sepal_length
filtering according to the condition of the maximum value.
>>> sepal_largest = iris.loc[iris_gb['sepal_length'].idxmax()]
>>> sepal_largest
sepal_length sepal_width petal_length petal_width species
14 5.8 4.0 1.2 0.2 setosa
50 7.0 3.2 4.7 1.4 versicolor
131 7.9 3.8 6.4 2.0 virginica
Fourth, reset the index after Groupby
Many times, we groupby
have to perform other operations after processing. That is, we want to reset the grouping index to be normal rows and columns.
The first method may be commonly used by everyone, that is, by reset_index()
resetting the out-of-order index.
>>> iris_gb.max().reset_index()
species sepal_length sepal_width petal_length petal_width
0 setosa 5.8 4.4 1.9 0.6
1 versicolor 7.0 3.4 5.1 1.8
2 virginica 7.9 3.8 6.9 2.5
But in fact, there is another usage that looks more friendly. You can groupby
set as_index
the parameters at any time, and you can also achieve the same effect.
>>> iris.groupby('species', as_index=False).max()
species sepal_length sepal_width petal_length petal_width
0 setosa 5.8 4.4 1.9 0.6
1 versicolor 7.0 3.4 5.1 1.8
2 virginica 7.9 3.8 6.9 2.5
5. Summary of various statistics
The above are all operations of a single statistic, so what if I want to operate several at the same time?
groupby
Another great usage is agg
to use it in conjunction with aggregate functions.
>>> iris_gb[['sepal_length', 'petal_length']].agg(["min", "mean"])
sepal_length petal_length
min mean min mean
species
setosa 4.3 5.006 1.0 1.462
versicolor 4.9 5.936 3.0 4.260
virginica 4.9 6.588 4.5 5.552
In agg
it, we only need to list the name of the statistic, and we can perform multi-dimensional statistics on each column at the same time.
6. Aggregation of specific columns
We also saw that the multiple operations above are the same for each column. In actual use, we may have different requirements for each column.
So in this case, we can set different statistics for different columns individually.
>>> iris_gb.agg({
"sepal_length": ["min", "max"], "petal_length": ["mean", "std"]})
sepal_length petal_length
min max mean std
species
setosa 4.3 5.8 1.462 0.173664
versicolor 4.9 7.0 4.260 0.469911
virginica 4.9 7.9 5.552 0.551895
7. NamedAgg named statistics
Now I have a new idea. The above multi-level index looks a bit unfriendly. I want to combine the statistics and column names under each column separately. NamedAgg
Column naming can be done using .
>>> iris_gb.agg(
... sepal_min=pd.NamedAgg(column="sepal_length", aggfunc="min"),
... sepal_max=pd.NamedAgg(column="sepal_length", aggfunc="max"),
... petal_mean=pd.NamedAgg(column="petal_length", aggfunc="mean"),
... petal_std=pd.NamedAgg(column="petal_length", aggfunc="std")
... )
sepal_min sepal_max petal_mean petal_std
species
setosa 4.3 5.8 1.462 0.173664
versicolor 4.9 7.0 4.260 0.469911
virginica 4.9 7.9 5.552 0.551895
Because NamedAgg
it is a tuple, we can also directly assign the tuple to the new name, the effect is the same, but it looks more concise.
iris_gb.agg(
sepal_min=("sepal_length", "min"),
sepal_max=("sepal_length", "max"),
petal_mean=("petal_length", "mean"),
petal_std=("petal_length", "std")
)
8. Use custom functions
In the above agg aggregation function, we all complete the operation by adding a statistic name. In addition, we can also directly give a function object.
>>> iris_gb.agg(pd.Series.mean)
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Not only that, names and function objects can also be used together.
iris_gb.agg(["min", pd.Series.mean])
What's more, we can also customize functions, which are all possible.
>>> def double_length(x):
... return 2*x.mean()
...
>>> iris_gb.agg(double_length)
sepal_length sepal_width petal_length petal_width
species
setosa 10.012 6.856 2.924 0.492
versicolor 11.872 5.540 8.520 2.652
virginica 13.176 5.948 11.104 4.052
Of course, if you want to be more concise, you can also use lambda
functions. In short, the usage is very flexible and can be combined and matched freely.
iris_gb.agg(lambda x: x.mean())
The above are groupby
the 8 operations that may be used during use. If you use it skillfully, you will find that this function is really powerful.