Project github address: bitcarmanlee easy-algorithm-interview-and-practice
welcome everyone to star, leave a message, and learn and progress together
1. Group groupby
In the process of daily data analysis, there is often a need for grouping. Specifically, the data is divided into different groups according to one or more fields, and then further analysis is performed, such as finding the number of groups, the maximum, minimum, and average values in the groups. In SQL, it is the famous groupby operation.
In pandas, there is also a corresponding groupby operation, let's take a look at how to use groupby in pandas.
2. The data structure of groupby
First we look at the following code
def ddd():
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
nums = [10, 20, 30, 20, 15, 10, 12]
df = pd.DataFrame({"level": levels, "num": nums})
g = df.groupby('level')
print(g)
print()
print(list(g))
The output is as follows:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10f6f96d0>
[('L1', level num
0 L1 10
1 L1 20
2 L1 30), ('L2', level num
3 L2 20
4 L2 15), ('L3', level num
5 L3 10
6 L3 12)]
After doing the groupby operation, what you get is a DataFrameGroupBy object. If you print the object directly, the memory address will be displayed.
In order to observe the data conveniently, we use the list method to convert and find that it is a tuple. The first element in the tuple is the value of level. The second element in the ancestor is the entire dataframe under its group.
3. Basic usage of groupby
def group1():
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
nums = [10, 20, 30, 20, 15, 10, 12]
scores = [100, 200, 300, 200, 150, 100, 120]
df = pd.DataFrame({"level": levels, "num": nums, "score": scores})
result = df.groupby('level').agg({'num': 'sum', 'score': 'mean'})
allnum = result['num'].sum()
result['rate'] = result['num'].map(lambda x: x / allnum)
print(result)
The final output:
num score rate
level
L1 60 200 0.512821
L2 35 175 0.299145
L3 22 110 0.188034
The above example shows the basic usage of groupby.
Group the dataframe by level, then sum the num column and average the score column to get the result.
At the same time, we also want to get the proportion of the sum of num in all num sums in each group. So we first seek the synthesis of num, and then use the map method to add a column to the result to find its proportion!
4. The usage of transform
Let's look at a more complex example below.
def t10():
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
nums = [10, 20, 30, 20, 15, 10, 12]
df = pd.DataFrame({"level": levels, "num": nums})
ret = df.groupby('level')['num'].mean().to_dict()
df['avg_num'] = df['level'].map(ret)
print(ret)
print(df)
{'L1': 20.0, 'L2': 17.5, 'L3': 11.0}
level num avg_num
0 L1 10 20.0
1 L1 20 20.0
2 L1 30 20.0
3 L2 20 17.5
4 L2 15 17.5
5 L3 10 11.0
6 L3 12 11.0
In the above method, after we group the levels, we want to add a column to the data set, and we want to add the average value corresponding to each level to each row of data.
The above solution is to first obtain the average value of each group, convert it into a dict, and then use the map method to add the average value of each group.
def trans():
levels = ["L1", "L1", "L1", "L2", "L2", "L3", "L3"]
nums = [10, 20, 30, 20, 15, 10, 12]
df = pd.DataFrame({"level": levels, "num": nums})
df['avg_num'] = df.groupby('level')['num'].transform('mean')
print(df)
If you use the transform method, the code can be simpler and more intuitive, as shown above.
The role of the transform method: call the function to generate a dataFrame with the same index as the original df on each group, and return the dataFrame with the same index as the original object and filled with the converted value, which is equivalent to adding to the original dataframe a row.