Packet aggregation Pandas

# Import the relevant library 
Import numpy AS NP 
 Import PANDAS AS pd

Creating a Data

index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") 
data = { 
"age": [18, 30, 35, 18, np.nan, 30], 
"city": ["Bei Jing ", "Shang Hai ","Guang Zhou", "Shen Zhen", np.nan, " "], 
"sex": ["male", "male", "female", "male", np.nan, "female"],
"income": [3000, 8000, 8000, 4000, 6000, 7000] 
} 
user_info = pd.DataFrame(data=data, index=index) 
user_info
"""
     age    city      sex      income
name                
18.0 Bei Jing MALE 3000 tom 
Bob 30.0 Shang Hai MALE 8000 
Mary 35.0 Guang Zhou FEMALE 8000 
James 18.0 Shen Zhen MALE 4000 
Andy NaN NaN NaN 6000 
Alice 30.0 FEMALE 7000 
"" "

Objects into groups 

  Before performing packet statistics, the first thing to do is to be grouped 

# Based on gender grouping 
Grouped = user_info.groupby (user_info [ " Sex " ]) 
grouped.groups 

# more concise way: 
# based on gender grouped 
user_info.groupby ( " Sex " ) .groups 
 # first grouped by gender, then further grouped according to age 
user_info.groupby ([ " Sex " , " Age " ]). groups

Close Sort

  By default, groupby will sort the data during operation. If for better performance, can be provided sort = False

grouped = user_info.groupby(["sex"], sort=False) 
grouped.groups 

Select the column

  After using groupby group, a slice may be used [] operations to complete the selection of a column

grouped = user_info.groupby("sex")
grouped

Traversing packet 

  After the data packets can traverse

Packets traversing a single field

grouped = user_info.groupby("sex") 
for name, group in grouped: 
    print("name: {}".format(name)) 
    print("group: {}".format(group)) 
    print("--------------") 

"""
name: female
group: age        city    sex    income
name                                   
Mary   35.0  Guang Zhou  female    8000
Alice  30.0              female    7000
--------------
name: male
group: age        city   sex    income
name                                  
Tom 18.0 Bei Jing MALE 3000 
Bob 30.0 Shang Hai MALE 8000 
James 18.0 Shen Zhen MALE 4000 
"" "

Traversing a plurality of packet fields

grouped = user_info.groupby(["sex", "age"]) 
for name, group in grouped: 
    print("name: {}".format(name)) 
    print("group: {}".format(group)) 
    print("--------------") 

"""
name: ('female', 30.0)
group:         age city     sex  income
name                            
Alice  30.0       female    7000
--------------
name: ('female', 35.0)
group:        age        city     sex  income
name                                  
Mary  35.0  Guang Zhou  female    8000
--------------
name: ('male', 18.0)
group:         age       city   sex  income
name                                
Tom    18.0  Bei Jing   male    3000
James  18.0  Shen Zhen  male    4000
--------------
name: ('male', 30.0)
group:        age        city   sex  income
name                                
Bob   30.0  Shang Hai   male    8000
"""

Select a group

.get_group ()

  After grouping, we can get_group select a method wherein a group

grouped = user_info.groupby("sex") 
grouped.get_group("male") 
user_info.groupby(["sex", "age"]).get_group(("male", 18)) 

polymerization 

  The purpose of the group is to statistics, the statistical aggregation when needed , then we need to look at how the polymerization in points finish group. Some common aggregation operations are: count, sum, maximum, minimum, average value. The result is
a name to the group as a result object index. 

  Want to achieve aggregate operations, one way is to call agg method.

.agg() 

# Get the number of different gender included 
Grouped = user_info.groupby ( " Sex " ) 
Grouped [ " Age " ] .agg (len)
 # get the maximum age contain different gender 
Grouped = user_info.groupby ( " Sex " ) 
Grouped [ " Age " ] .agg (np.max)

If the polymerization is carried out in accordance with a plurality of keys, the default case where the resulting structure is a multi-layered index

grouped = user_info.groupby(["sex", "age"]) 
rs = grouped.agg(len) 
rs
"""
         city    income
sex     age        
female  30.0    1      1
     35.0    1       1
male    18.0    2      2
     30.0    1      1

To avoid the emergence of multi-layered index

.reset_index()

  Reset_index method calls for an object that contains multiple layers of index 

rs.reset_index() 

Parameters as_index = False

  When a packet, setting parameters as_index = False 

grouped = user_info.groupby(["sex", "age"], as_index=False) 
grouped["income"].agg(max) 

.describe () View data case

  Series and DataFrame contains describe a method, grouping we can use the same method to describe the situation to see the data.

grouped = user_info.groupby("sex") 
grouped.describe().reset_index() 

Application of a plurality of polymerization operation 

Get more statistics

  Sometimes the group, not just want to get a statistical results, there may be more

# Statistics at different income gender and average sum 
Grouped = user_info.groupby ( " Sex " ) 
Grouped [ " Income " ] .agg ([np.sum, np.mean]). Reset_index ()

Rename the statistical results

  If you want to rename the statistical results, it can be passed in the dictionary 

grouped = user_info.groupby("sex") 
grouped["income"].agg([np.sum, np.mean]).rename(columns={"sum": "income_sum", "mean": " income_mean"})

DataFrame applications listed on different polymerization operation

  Sometimes you may need to use different polymerization operation on different columns

# Statistical sum of the population under the age of different genders and mean income 
grouped.agg ({ " Age " : np.mean, " Income " : np.sum}). The rename (the Columns = { " Age " : " age_mean " , " Income " : " income_sum " .}) reset_index ()

transform operations

  When polymerization is carried out in front of the operation, the result is a group name to the object as a result of the index. Although you can specify as_index = False, but the index has not been indexed metadata. If we want to use the index of the original array, then you need to merge conversion

  The transform method simplifies this process, it will apply to all packets func parameter, and placing the result on the index of the original array (if the result is a scalar, it broadcasts)

# By an index is the result obtained agg group name 
Grouped = user_info.groupby ( " Sex " ) 
Grouped [ " Income " ] .agg (np.mean) 

# by an index obtained transform result is the original index, it will give the results are automatically associated with the original index = user_info.groupby Grouped ( "Sex") 
Grouped [ " Income " ] .transform (np.mean)

  Can be seen, the length of the original operation result obtained by the transform is consistent

apply operation 

  Apply the object to be processed will be split into a plurality of segments, then the incoming call to the function of each segment, and finally try pd.concat () combining the results. func Pandas return value may be an object or scalar and array size of the object is not limited.

# Used to apply the completion of the above polymerization 
Grouped = user_info.groupby ( " Sex " ) 
Grouped [ " Income " ] .apply (np.mean) 

# statistics highest income different n values prior to sex 
DEF F1 (Ser, NUM = 2 ): 
 return ser.nlargest (NUM) .ToList () 
Grouped [ " Income " ] .apply (f1) 

# get under the age of different genders mean 
DEF F2 (df): 
     return df [ " Age " ] .mean () 
grouped.apply (f2)

 

 

Guess you like

Origin www.cnblogs.com/zry-yt/p/11811978.html