Python third-party module data analysis Pandas module advanced application


1. Categorical Data 1. Concept:

"类别型变量"(Categorical Variable)是指仅有有限个取值的定性变量,表现为互不相容的类别或属性.在Pandas中的类型名为:category,又分
为"有序型"(如改进程度)"无序型"(如性别).类别型数据常用不同的int来表示,这种方法称为"分类编码表示法""字典编码表示法",这些int值
称为"分类编码""编码".这种做法可以大大提高分析时的性能,节约内存资源,并可在在保持编码不变的情况下对分类进行转换,:
①重命名分类
②加入1个新分类而不改变已有分类的顺序

2. Create
(1) Create by instantiation:

pd.Categorical(<values>[,categories=None,ordered=False,dtype=None,fastpath=False])
  #参数说明:
  	values:指定值;list-like
  	  #如果存在categories中不存在的类别,则该类别会被替换为NaN
	categories:指定全部类别;为Index-like(要求值唯一),默认为<values>中出现的类别
	  #类别可为任意不可变数据类型
	ordered:指定各类别间是否有排序;bool
	  #会自动判断顺序,不取决于输入顺序

#实例:
>>> pd.Categorical(["A","B","B","C","D","C","A"])
['A', 'B', 'B', 'C', 'D', 'C', 'A']
Categories (4, object): ['A', 'B', 'C', 'D']
>>> pd.Categorical(["A","B","B","C","D","C","A"],ordered=True)
['A', 'B', 'B', 'C', 'D', 'C', 'A']
Categories (4, object): ['A' < 'B' < 'C' < 'D']
>>> pd.Categorical(["A","B","B","D","C","A"],ordered=True)
['A', 'B', 'B', 'D', 'C', 'A']
Categories (4, object): ['A' < 'B' < 'C' < 'D']
>>> pd.Categorical(["A","B","B","C","D","C","A"],categories=["A","B","C"])
['A', 'B', 'B', 'C', NaN, 'C', 'A']
Categories (3, object): ['A', 'B', 'C']

(2) Created by the from_codes constructor:

pd.Categorical.from_codes(<codes>[,categories=None,ordered=None,dtype=None])
  #参数说明:
	codes:指定数据;int array-like(-1代表NaN)
	  #int表示categories/dtype中提供的索引等于该int的类别
	categories:指定全部类别;为Index-like(要求值唯一)
	  #类别可为任意不可变数据类型
	ordered:定各类别间是否有排序;bool
	  #顺序取决于输入顺序
	dtype:指定类型;为CategoricalDtype/"category"
	  #为CategoricalDtype时不能与categories/ordered同时指定;dtype/categories需要提供可取的类别

#实例:
>>> pd.Categorical.from_codes([1,0,0,1,1,0],categories=["a","b"])
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a', 'b']
>>> pd.Categorical.from_codes([1,0,0,1,1,0],dtype=pd.Categorical(["A","B"]).dtype)
['B', 'A', 'A', 'B', 'B', 'A']
Categories (2, object): ['A', 'B']
>>> pd.Categorical.from_codes([1,0,0,1,1,0],categories=["a","b"],ordered=True)
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a' < 'b']

(3) Created by Series/DataFrame:

>>> s1=pd.Series(["A","B","C","D"],dtype="category")
>>> s2=pd.Series(["A","B","C","D"])
>>> s2=s2.astype("category")
>>> s1.dtype,s2.dtype
(CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=False), CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=False))
>>> type(s1.values)
<class 'pandas.core.arrays.categorical.Categorical'>

3.cat attributes:

Series的cat属性提供了分类方法的入口:
>>> s1.cat.ordered
False
>>> s2=s2.cat.set_categories(["A","B","C","D"],ordered=True)
>>> s2.cat.ordered
True

4. Properties and methods
(1) Properties:

查看全部类别:<c>.categories

#实例:接上
>>> s1.values.categories
Index(['A', 'B', 'C', 'D'], dtype='object')

######################################################################################################################

查看全部编码:<c>.codes

#实例:接上
>>> s1.values.codes
array([0, 1, 2, 3], dtype=int8)

######################################################################################################################

查看类别间是否有序:<c>.ordered

#实例:接上
>>> s1.values.ordered
False

######################################################################################################################

查看数据类型:<c>.dtype
  #返回pandas.core.dtypes.dtypes.CategoricalDtype

>>> s1.values.dtype
CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=False)

(2) Method:

添加类别:<c>.add_categories(<new_categories>[,inplace=False])
  #参数说明:
    new_categories:指定要添加的类别;为category/category list-like

#实例:接上
>>> s1.values.add_categories("Z")
['A', 'B', 'C', 'D']
Categories (5, object): ['A', 'B', 'C', 'D', 'Z']

######################################################################################################################

设置类别间的顺序:<c>.as_ordered([inplace=False])

#实例:接上
>>> s1.values.as_ordered()
['A', 'B', 'C', 'D']
Categories (4, object): ['A' < 'B' < 'C' < 'D']

######################################################################################################################

取消类别间的顺序:<c>.as_unordered([inplace=False])

######################################################################################################################

删除指定类别:<c>.remove_categories(<removals>[,inplace=False])
  #参数说明:
	removals:指定要删除的类别;为category/category list-like

######################################################################################################################

删除未使用的类别:<c>.remove_unused_categories([inplace=False])

#实例:
>>> c=pd.Categorical.from_codes([1,0,0,1,1,0],categories=["a","b"])
>>> c
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a', 'b']
>>> c.remove_unused_categories()
['b', 'a', 'a', 'b', 'b', 'a']
Categories (2, object): ['a', 'b']

######################################################################################################################

修改类别名称:<c>.rename_categories(<new_categories>[,inplace=False])
  #注意:不能改变类别数量;被修改的类别的数据会被调整为新的类型(名)
  #参数说明:
  	new_categories:指定新的类别名;list-like/dict-like/callable

#实例:接上
>>> c.rename_categories(["A","B"])
['B', 'A', 'A', 'B', 'B', 'A']
Categories (2, object): ['A', 'B']
>>> c.rename_categories({
    
    "a":"A"})
['b', 'A', 'A', 'b', 'b', 'A']
Categories (2, object): ['A', 'b']

######################################################################################################################

修改类别顺序:<c>.reorder_categories(<new_categories>[,ordered=None,inplace=False])
  #和<c>.rename_categories()类似,但能修改为ordered CategoricalDtype

######################################################################################################################

修改为新类别:<c>.set_categories(<new_categories>[,ordered=None,rename=False,inplace=False])
  #注意:被修改的类别的数据会被调整为NaN
  #参数说明:
  	new_categories:指定新的类别;list-like/dict-like/callable

#实例:接上
>>> c.set_categories(["A","B"])
[NaN, NaN, NaN, NaN, NaN, NaN]
Categories (2, object): ['A', 'B']
>>> c.set_categories({
    
    "a":"A"})
[NaN, 'a', 'a', NaN, NaN, 'a']
Categories (1, object): ['a']

2. Chain programming and pipeline method
1. Chain programming:

创建的很多临时变量其实不会在分析中用到,这时可以采用链式编程:
>>> df=pd.DataFrame({
    
    "key":[1,2,1,2,1,1,2,1,2,2],"k1":[23,33,27,34,93,37,18,73,92,34],"k2":[1,2,3,4,5,6,7,8,9,0],"k3":[9,43,23,65,12,76,91,12,32,66]})
>>> df
   key  k1  k2  k3
0    1  23   1   9
1    2  33   2  43
2    1  27   3  23
3    2  34   4  65
4    1  93   5  12
5    1  37   6  76
6    2  18   7  91
7    1  73   8  12
8    2  92   9  32
9    2  34   0  66
>>> result=(df.assign(k2=df.k1-df.k3.mean()).groupby("key").k2.std())
>>> result
key
1    30.835045
2    28.656587
Name: k2, dtype: float64

2.Pipeline method:

执行指定函数:<s_or_df>.pipe(<func>[,*args,**kwargs])
  #相当于<func>(<s_or_df>[,*args,**kwargs]),但管道方法使链式编程变得更容易
  #参数说明:
	func:指定要执行的函数;为function
	  #要求至少接收1个参数,即<s_or_df>
	args,kwargs:指定要传入<func>的参数

#实例:接上
>>> def f(x):
...     return x+x
...
>>> df.pipe(f)
   key   k1  k2   k3
0    2   46   2   18
1    4   66   4   86
2    2   54   6   46
3    4   68   8  130
4    2  186  10   24
5    2   74  12  152
6    4   36  14  182
7    2  146  16   24
8    4  184  18   64
9    4   68   0  132

Guess you like

Origin blog.csdn.net/weixin_46131409/article/details/113551846