(Data Science Learning Letters 69) Detailed pandas in the map, apply, applymap, groupby, agg

* From the beginning Benpian data and code all the articles have been uploaded to my github repository: https://github.com/CNFeffery/DataScienceStudyNotes

I. Introduction

  pandas provides many simple and convenient method for single, multi-column operation or a batch of data packets polymerization operation, can greatly enhance the efficiency of the data analysis familiar with these methods, so that your code will be more elegantly simple, herein it will be for the pandas in the map (), apply (), applymap (), groupby (), agg () and other methods deployed in detail, combined with practical examples to help you better understand their tips (used in this article all code and data are stored in my github repository: https://github.com/CNFeffery/DataScienceStudyNotes the corresponding article folder).

 

Second, the non-polymer-based method

  Non herein polymerization means that no grouping operation before and after data processing, the length of the data string is not changed, so this section is not involved GroupBy (), is first read in the data, as used herein, the nation's baby name data, comprising 18.8 - some basic information corresponding to the year 2018, the nation's neonatal data for each name, in jupyterlab reads the data and prints the data set in order to understand our data set:

Import PANDAS PD AS 

# reads the data 
Data = pd.read_csv ( ' the data.csv ' ) 
data.head ()

# Check the columns of data types, data frames ranks number 
Print (data.dtypes)
 Print ()
 Print (data.shape)

2.1 map()

  Python-like built-in map () method, PANDAS the map () function method, or some dictionary index need to accept a single input value of each element of a particular object corresponding to a single column and serial linkages result obtained , such as where we want to get the gender column F, M converted for women, a new column of men, can have the following ways:

● dictionary mapping

  Here we write the dictionary between F, M and women, men-one mapping, re-use map () method to get the mapping columns:

# Define F-> F, M-> male mapping dictionary 
gender2xb = { ' F. ' : ' F ' , ' M ' : ' man ' }
 # using the map () method to obtain the corresponding gender column mapped column 
data.gender. map (gender2xb)

 ● lambda function

  Here we pass a lambda function to map () to achieve the desired function:

# Because it is known that only the data gender sex column F and M Therefore, write the following lambda function 
data.gender.map ( lambda the X-: ' Women '  IF the X- IS  ' F '  the else  ' male ' )

  ● General Functions

  You can also pass regular functions def definition:

def gender_to_xb(x):

    return '女性' if x is 'F' else '男性'

data.gender.map(gender_to_xb) 

   map () can pass the content can sometimes be very special, as in the following example:

● special objects

  Some receive a single input value and the output of objects can also be treated with map () method:

data.gender.map("This kid's gender is {}".format)

   When the map () and a parameter na_action, similar in R na.action, value of 'None' or 'ingore', for controlling the handling missing values ​​encountered, is set to 'ingore' serial arithmetic process Nan will ignore the value returned intact.

 

2.2 apply()

  apply () method is called with the best pandas, with its use map () like, mainly primary afferent input return output parameters are acceptable, but compared to the map () for separate processing Series, a apply () statements can be for single or multiple columns operation, covering a lot of usage scenarios, let's introduce are:

● single-column data

  Here we refer to 2.1 passing lambda function to apply () in:

data.gender.apply ( the lambda the X-: ' Women '  IF the X- IS  ' F '  the else  ' male ' )

   You can see here achieved with map () function the same.

● multiple columns of data

  apply () in its most special place multiple columns of data can be processed simultaneously, such as where we use to prepare a multi-function column data for each row makes up for descriptive words, and apply () function to pass a plurality of lambda value of the intake-written function (when calling DataFrame.apply (), apply () serial during actual processing of each row of data is not Series.apply () as a single value per treatment), treatment Note give apply multiple values ​​() add the parameter axis = 1:

def generate_descriptive_statement(year, name, gender, count):
    year, count = str(year), str(count)
    gender = '女性' if gender is 'F' else '男性'
    
    return '在{}年,叫做{}性别为{}的新生儿有{}个。'.format(year, name, gender, count)

data.apply(lambda row:generate_descriptive_statement(row['year'],
                                                      row['name'],
                                                      row['gender'],
                                                      row['count']),
           axis = 1)

 ● 结合tqdm给apply()过程添加进度条

  我们知道apply()在运算时实际上仍然是一行一行遍历的方式,因此在计算量很大时如果有一个进度条来监视运行进度就很舒服,在(数据科学学习手札53)Python中tqdm模块的用法中,我对基于tqdm为程序添加进度条做了介绍,而tqdm对pandas也是有着很好的支持,我们可以使用progress_apply()代替apply(),并在运行progress_apply()之前添加tqdm.tqdm.pandas(desc='')来启动对apply过程的监视,其中desc参数传入对进度进行说明的字符串,下面我们在上一小部分示例的基础上进行改造来添加进度条功能:

from tqdm import tqdm

def generate_descriptive_statement(year, name, gender, count):
    year, count = str(year), str(count)
    gender = '女性' if gender is 'F' else '男性'
    
    return '在{}年,叫做{}性别为{}的新生儿有{}个。'.format(year, name, gender, count)

#启动对紧跟着的apply过程的监视
tqdm.pandas(desc='apply')
data.progress_apply(lambda row:generate_descriptive_statement(row['year'],
                                                      row['name'],
                                                      row['gender'],
                                                      row['count']),
          axis = 1)

   可以看到在jupyter lab中运行程序的过程中,下方出现了监视过程的进度条,这样就可以实时了解apply过程跑到什么地方了。

 

2.3  applymap()

  applymap()是与map()方法相对应的专属于DataFrame对象的方法,类似map()方法传入函数、字典等,传入对应的输出结果,不同的是applymap()将传入的函数等作用于整个数据框中每一个位置的元素,因此其返回结果的形状与原数据框一致,譬如下面的简单示例,我们把婴儿姓名数据中所有的字符型数据消息小写化处理,对其他类型则原样返回:

def lower_all_string(x):
    if isinstance(x, str):
        return x.lower()
    else:
        return x

data.applymap(lower_all_string)

   其形状没有变化:

 

   配合applymap(),可以简洁地完成很多数据处理操作。

 

三、聚合类方法

  有些时候我们需要像SQL里的聚合操作那样将原始数据按照某个或某些离散型的列进行分组再求和、平均数等聚合之后的值,在pandas中分组运算是一件非常优雅的事。

3.1 利用groupby()进行分组

  要进行分组运算第一步当然就是分组,在pandas中对数据框进行分组使用到groupby()方法,其主要使用到的参数为by,这个参数用于传入分组依据的变量名称,当变量为1个时传入名称字符串即可,当为多个时传入这些变量名称列表,DataFrame对象通过groupby()之后返回一个生成器,需要将其列表化才能得到需要的分组后的子集,如下面的示例:

#按照年份和性别对婴儿姓名数据进行分组
groups = data.groupby(by=['year','gender'])
#查看groups类型
type(groups)

   可以看到它此时是生成器,下面我们用列表解析的方式提取出所有分组后的结果:

#利用列表解析提取分组结果
groups = [group for group in groups]

  查看其中的一个元素:

 

   可以看到每一个结果都是一个二元组,元组的第一个元素是对应这个分组结果的分组组合方式,第二个元素是分组出的子集数据框,而对于DataFrame.groupby()得到的结果,主要可以进行以下几种操作:

● 直接调用聚合函数

  譬如这里我们提取count列后直接调用max()方法:

#求每个分组中最高频次
data.groupby(by=['year','gender'])['count'].max()

 

   注意这里的year、gender列是以索引的形式存在的,想要把它们还原回数据框,使用reset_index(drop=False)即可:

 

 ● 结合apply()

  分组后的结果也可以直接调用apply(),这样可以编写更加自由的函数来完成需求,譬如下面我们通过自编函数来求得每年每种性别出现频次最高的名字及对应频次,要注意的是,这里的apply传入的对象是每个分组之后的子数据框,所以下面的自编函数中直接接收的df参数即为每个分组的子数据框:

import numpy as np

def find_most_name(df):
    return str(np.max(df['count']))+'-'+df['name'][np.argmax(df['count'])]

data.groupby(['year','gender']).apply(find_most_name).reset_index(drop=False)

 

3.2 利用agg()进行更灵活的聚合

  agg即aggregate,聚合,在pandas中可以利用agg()对Series、DataFrame以及groupby()后的结果进行聚合,其传入的参数为字典,键为变量名,值为对应的聚合函数字符串,譬如{'v1':['sum','mean'], 'v2':['median','max','min]}就代表对数据框中的v1列进行求和、均值操作,对v2列进行中位数、最大值、最小值操作,下面用几个简单的例子演示其具体使用方式:

 ● 聚合Series

  在对Series进行聚合时,因为只有1列,所以可以不使用字典的形式传递参数,直接传入函数名列表即可:

#求count列的最小值、最大值以及中位数
data['count'].agg(['min','max','median'])

  ● 聚合数据框

  对数据框进行聚合时因为有多列,所以要使用字典的方式传入聚合方案:

data.agg({'year': ['max','min'], 'count': ['mean','std']})

   值得注意的是,因为上例中对于不同变量的聚合方案不统一,所以会出现NaN的情况。

● 聚合groupby()结果

data.groupby(['year','gender']).agg({'count':['min','max','median']}).reset_index(drop=False)

   可以注意到虽然我们使用reset_index()将索引列还原回变量,但聚合结果的列名变成红色框中奇怪的样子,而在pandas 0.25.0以及之后的版本中,可以使用pd.NamedAgg()来为聚合后的每一列赋予新的名字:

data.groupby(['year','gender']).agg(
    min_count=pd.NamedAgg(column='count', aggfunc='min'),
    max_count=pd.NamedAgg(column='count', aggfunc='max'),
    median=pd.NamedAgg(column='count', aggfunc='median')).reset_index(drop=False)

   

  以上就是本文全部内容,如有笔误望指出! 

Guess you like

Origin www.cnblogs.com/feffery/p/11468762.html