[Turn] Pandas data processing (four)-clustering and grouping processing (Grouping)

Pandas data processing (4)-clustering and grouping processing (Grouping)

 

Understanding of Group

When processing data, in a data list, the elements of a certain column are used as the reference base point, and each unique element in the column corresponds to the related data of other columns. Here may I describe more complicated, you can use the following two tables of data Help understanding before and after processing:

The source data has 5 columns, namely age, gender, occupation, and zip_code;

 

 

Next, I need to group and analyze the occupation (occupation) column, and count the maximum, minimum, and average values ​​of gender and age for each type of occupation. The processing results are as follows:

 

 

Above is a brief introduction of the cluster grouping, Pandas package provides functions  goupby  daily operation, the paper will be based  Pandas  of  groupby  make use of a simple understanding

1. Library import, data read

import pandas as pd
​
users = pd.read_table("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user",sep ="|",index_col = 'user_id')
users.head()

The data is stored in the data set with age, gender, occupation, and zip_code as a sample, and the next processing will focus on the three columns of age, gender, and occupation as the analysis objects;

 

 

2. Convert gender into encoding form

def gender_to_numric(x):
    if x=='M':
        return 1
    if x =="F":
        return 0
​
# 利用新的函数创建新的列;
users['gender_n'] = users['gender'].apply(gender_to_numric)
users

F becomes 0, M is set to 1

 

 

3. Under the premise of 2, calculate the proportion of men in each occupation

value_counts() Count the total number of samples in a column

a = users.groupby("occupation").gender_n.sum()/users.occupation.value_counts()*100
a.sort_values(ascending =False)

Then sort from largest to smallest

 

4. Take occupation as the grouping base point, and count the oldest, youngest, and average age of each occupation

The agg() function is used here to play a data pipeline clustering effect

users.groupby("occupation").age.agg(["min","max","mean"])

 

 

When performing statistics on multiple columns of data at the same time, the agg() function is also used. The difference is that the dictionary form (dict) is used here: the key name is the column name, and the key value is the data category that needs to be counted, such as max, min, mean, count And other keywords, written in list form;

users.agg({列名:[“mean”,“max”,"min"]})

Based on the data in this article, if you want to view age and gender statistics at the same time, you can use the following command;

users.groupby("occupation").agg({"age":['mean','max','min'],'gender_n':['sum','count']})

 

 

5. Multi-column grouping and clustering

Above we performed a grouping and clustering analysis on the occupation column. Here, in the occupation grouping, the gender is grouped and clustered, and the total proportion of each gender in each occupation is calculated.

groupby(['列名1','列名2'...]) # The sequence of column names represents the sequence of grouping clustering:

# 求在每个职业中男女各占比例
gender_occp = users.groupby(["occupation","gender"]).agg({"gender":"count"})
gender_occp

 

 

6. On the basis of 5, calculate the proportion of gender in each occupation

The basic idea of ​​calculating the proportion of gender in each occupation is as follows:

  • 1. Count the number of genders in each occupation;
  • 2. Count the total number of samples in each occupation;
  • 1 and 2 are divided based on the occupation column;

Code part

# 求在每个职业中男女各占比例
gender_occp = users.groupby(["occupation","gender"]).agg({"gender":"count"})
​
# 为每一个职业计算 count
occup_count = users.groupby(['occupation']).agg("count")
​
# gender_occp
​
# 进行除法运算
occup_gender = gender_occp.div(occup_count,level = "occupation")*100
​
​
# 只筛选出 gender列
occup_gender.loc[:,'gender']

The DataFram.div function is used here to divide two DataFrames based on a certain column as the reference column. The final data type is float; the level parameter is used to specify the reference column; in addition to div, Pandas also provides add, sub, mul Operation functions such as, pow, etc., the usage is similar to the div method

The final result is as follows:

 

 

The above is the basic content of this article, finally thank you for reading!

Guess you like

Origin blog.csdn.net/weixin_52071682/article/details/112421341