# Import the relevant library Import numpy AS NP Import PANDAS AS pd
Creating a Data
index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") data = { "age": [18, 30, 35, 18, np.nan, 30], "city": ["Bei Jing ", "Shang Hai ","Guang Zhou", "Shen Zhen", np.nan, " "], "sex": ["male", "male", "female", "male", np.nan, "female"], "income": [3000, 8000, 8000, 4000, 6000, 7000] } user_info = pd.DataFrame(data=data, index=index) user_info """ age city sex income name 18.0 Bei Jing MALE 3000 tom Bob 30.0 Shang Hai MALE 8000 Mary 35.0 Guang Zhou FEMALE 8000 James 18.0 Shen Zhen MALE 4000 Andy NaN NaN NaN 6000 Alice 30.0 FEMALE 7000 "" "
Objects into groups
Before performing packet statistics, the first thing to do is to be grouped
# Based on gender grouping Grouped = user_info.groupby (user_info [ " Sex " ]) grouped.groups # more concise way: # based on gender grouped user_info.groupby ( " Sex " ) .groups # first grouped by gender, then further grouped according to age user_info.groupby ([ " Sex " , " Age " ]). groups
Close Sort
By default, groupby will sort the data during operation. If for better performance, can be provided sort = False
grouped = user_info.groupby(["sex"], sort=False) grouped.groups
Select the column
After using groupby group, a slice may be used [] operations to complete the selection of a column
grouped = user_info.groupby("sex") grouped
Traversing packet
After the data packets can traverse
Packets traversing a single field
grouped = user_info.groupby("sex") for name, group in grouped: print("name: {}".format(name)) print("group: {}".format(group)) print("--------------") """ name: female group: age city sex income name Mary 35.0 Guang Zhou female 8000 Alice 30.0 female 7000 -------------- name: male group: age city sex income name Tom 18.0 Bei Jing MALE 3000 Bob 30.0 Shang Hai MALE 8000 James 18.0 Shen Zhen MALE 4000 "" "
Traversing a plurality of packet fields
grouped = user_info.groupby(["sex", "age"]) for name, group in grouped: print("name: {}".format(name)) print("group: {}".format(group)) print("--------------") """ name: ('female', 30.0) group: age city sex income name Alice 30.0 female 7000 -------------- name: ('female', 35.0) group: age city sex income name Mary 35.0 Guang Zhou female 8000 -------------- name: ('male', 18.0) group: age city sex income name Tom 18.0 Bei Jing male 3000 James 18.0 Shen Zhen male 4000 -------------- name: ('male', 30.0) group: age city sex income name Bob 30.0 Shang Hai male 8000 """
Select a group
.get_group ()
After grouping, we can get_group select a method wherein a group
grouped = user_info.groupby("sex") grouped.get_group("male") user_info.groupby(["sex", "age"]).get_group(("male", 18))
polymerization
The purpose of the group is to statistics, the statistical aggregation when needed , then we need to look at how the polymerization in points finish group. Some common aggregation operations are: count, sum, maximum, minimum, average value. The result is
a name to the group as a result object index.
Want to achieve aggregate operations, one way is to call agg method.
.agg()
# Get the number of different gender included Grouped = user_info.groupby ( " Sex " ) Grouped [ " Age " ] .agg (len) # get the maximum age contain different gender Grouped = user_info.groupby ( " Sex " ) Grouped [ " Age " ] .agg (np.max)
If the polymerization is carried out in accordance with a plurality of keys, the default case where the resulting structure is a multi-layered index
grouped = user_info.groupby(["sex", "age"]) rs = grouped.agg(len) rs """ city income sex age female 30.0 1 1 35.0 1 1 male 18.0 2 2 30.0 1 1
To avoid the emergence of multi-layered index
.reset_index()
Reset_index method calls for an object that contains multiple layers of index
rs.reset_index()
Parameters as_index = False
When a packet, setting parameters as_index = False
grouped = user_info.groupby(["sex", "age"], as_index=False) grouped["income"].agg(max)
.describe () View data case
Series and DataFrame contains describe a method, grouping we can use the same method to describe the situation to see the data.
grouped = user_info.groupby("sex") grouped.describe().reset_index()
Application of a plurality of polymerization operation
Get more statistics
Sometimes the group, not just want to get a statistical results, there may be more
# Statistics at different income gender and average sum Grouped = user_info.groupby ( " Sex " ) Grouped [ " Income " ] .agg ([np.sum, np.mean]). Reset_index ()
Rename the statistical results
If you want to rename the statistical results, it can be passed in the dictionary
grouped = user_info.groupby("sex") grouped["income"].agg([np.sum, np.mean]).rename(columns={"sum": "income_sum", "mean": " income_mean"})
DataFrame applications listed on different polymerization operation
Sometimes you may need to use different polymerization operation on different columns
# Statistical sum of the population under the age of different genders and mean income grouped.agg ({ " Age " : np.mean, " Income " : np.sum}). The rename (the Columns = { " Age " : " age_mean " , " Income " : " income_sum " .}) reset_index ()
transform operations
When polymerization is carried out in front of the operation, the result is a group name to the object as a result of the index. Although you can specify as_index = False, but the index has not been indexed metadata. If we want to use the index of the original array, then you need to merge conversion
The transform method simplifies this process, it will apply to all packets func parameter, and placing the result on the index of the original array (if the result is a scalar, it broadcasts)
# By an index is the result obtained agg group name Grouped = user_info.groupby ( " Sex " ) Grouped [ " Income " ] .agg (np.mean) # by an index obtained transform result is the original index, it will give the results are automatically associated with the original index = user_info.groupby Grouped ( "Sex") Grouped [ " Income " ] .transform (np.mean)
Can be seen, the length of the original operation result obtained by the transform is consistent
apply operation
Apply the object to be processed will be split into a plurality of segments, then the incoming call to the function of each segment, and finally try pd.concat () combining the results. func Pandas return value may be an object or scalar and array size of the object is not limited.
# Used to apply the completion of the above polymerization Grouped = user_info.groupby ( " Sex " ) Grouped [ " Income " ] .apply (np.mean) # statistics highest income different n values prior to sex DEF F1 (Ser, NUM = 2 ): return ser.nlargest (NUM) .ToList () Grouped [ " Income " ] .apply (f1) # get under the age of different genders mean DEF F2 (df): return df [ " Age " ] .mean () grouped.apply (f2)