pandas之数据转换

pandas中的数据转换包括过滤、清理等

去除重复数据

duplicated()  判断各行是否是重复行

drop_duplicated()  移除重复行(保留第一次出现的)

没啥好说的，直接看例子：

In [20]: s = pd.DataFrame({'key':['a']*4+['b']*3,'key0':[1,1,2
    ...: ,3,3,4,4]})

In [21]: s.duplicated()    
Out[21]: 
0    False
1     True
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [22]: s.drop_duplicates()
Out[22]: 
  key  key0
0   a     1
2   a     2
3   a     3
4   b     3
5   b     4

In [23]: s.drop_duplicates('key')    # 可以根据某列去除重复的行
Out[23]: 
  key  key0
0   a     1
4   b     3

In [24]: s.drop_duplicates(['key','key0'])  # 传入一个列组成的列表，去除重复的行

Out[24]: 
  key  key0
0   a     1
2   a     2
3   a     3
4   b     3
5   b     4

In [25]: s.key.drop_duplicates()  # 嗯，这样写也是可以的
Out[25]: 
0    a
4    b
Name: key, dtype: object

利用函数或映射进行数据转换

In [61]: data = pd.DataFrame({'food':['bacon','pulled pork','b
    ...: acon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
    ...: 'ounces':[4,3,12,6,7.5,8,3,5,6]})
In [62]: data
Out[62]: 
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

假如你想添加一列表示该肉类食物来源的动物类型，我们先编写一个肉类到动物的映射。

In [63]: meat_to_animal = {
    ...:     'bacon':'pig',
    ...:     'pulled pork':'pig',
    ...:     'pastrami':'cow',
    ...:     'corned beef':'cow',
    ...:     'honey ham':'pig',
    ...:     'nova lox':'salmon'
    ...: }

Series的map方法可以接受一个函数或含有映射关系的字典型对象，

但是这里有个问题：有些大写了，有些没有。因此需要先转换大小写

In [64]: data['animal'] = data['food'].map(str.lower).map(meat_to_animal)

下面看一下map用来执行函数，即将data['food']的每个元素应用到隐含函数

In [65]: data['food'].map(lambda x:meat_to_animal[x.lower()])
Out[65]: 
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

替换值

replace()

In [26]: re = pd.Series([1,-9999,-9999,2,3,4,5,-1000,0])

In [27]: re
Out[27]: 
0       1
1   -9999
2   -9999
3       2
4       3
5       4
6       5
7   -1000
8       0
dtype: int64

In [28]: re.replace(-9999,np.nan)  # 替换值
Out[28]: 
0       1.0
1       NaN
2       NaN
3       2.0
4       3.0
5       4.0
6       5.0
7   -1000.0
8       0.0
dtype: float64

In [29]: re.replace([-9999,-1000],np.nan) # 替换多个
Out[29]: 
0    1.0
1    NaN
2    NaN
3    2.0
4    3.0
5    4.0
6    5.0
7    NaN
8    0.0
dtype: float64

In [30]: re.replace([-9999,-1000],[np.nan,0])  # 值与替换值对应的列表
Out[30]: 
0    1.0
1    NaN
2    NaN
3    2.0
4    3.0
5    4.0
6    5.0
7    0.0
8    0.0
dtype: float64

In [32]: re.replace({-9999:np.nan,-1000:0})   # 参数可以是一个字典
Out[32]: 
0    1.0
1    NaN
2    NaN
3    2.0
4    3.0
5    4.0
6    5.0
7    0.0
8    0.0
dtype: float64

重命名轴索引

rename() 会创建数据的副本，也可以传入 inplace=True 参数进行就地修改

In [41]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),inde
    ...: x=pd.Index(['Oh', 'Co'], name='state'),columns=pd.Ind
    ...: ex(['one', 'two', 'three'], name='number'))

In [42]: data.rename(index=str.title,columns=str.upper)
Out[42]: 
number  ONE  TWO  THREE
state
Oh        0    1      2
Co        3    4      5

In [43]: data.rename(index={'co':'sx'},columns={'one':'first'}  # 传入字典，可以部分修改
    ...: )
Out[43]: 
number  first  two  three
state
Oh          0    1      2
Co          3    4      5

离散化和面元划分

cut()

In [45]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [47]: 
In [47]: bins = [18, 25, 35, 60, 100]
In [48]: cats = pd.cut(ages, bins) # 可以指定哪边的区间是开的，例如左闭右开，只需要设置 pd.cut(ages, bins，right=False)

In [49]: cats     # 结果返回的是一个特殊的 Categories 对象
Out[49]: 
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

也可以为面元设置名称：

In [56]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
In [57]: pd.cut(ages, bins, labels=group_names)
Out[57]: 
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

若不传入具体的面元划分边界，只传入划分的面元个数，则会自动等长划分面元：

In [58]: data = np.random.rand(20)
In [59]: pd.cut(data, 4, precision=2)  # 分为4组，精度为2位
Out[59]: 
[(0.29, 0.52], (0.75, 0.98], (0.75, 0.98], (0.057, 0.29], (0.29, 0.52], ...,
 (0.75, 0.98], (0.75, 0.98], (0.75, 0.98], (0.057,0.29], (0.29, 0.52]]
Length: 20
Categories (4, interval[float64]): [(0.057, 0.29] < (0.29, 0.52] < (0.52, 0.75] < (0.75, 0.98]]

qcut函数是一个类似于cut的函数，可以根据样本分位数对数据进行面元划分。根据数据，cut可能无法
是各个面元数量数据点相同，qcut使用的是样本分位数，因此可以得大小基本相等的面元。
qcut就不举例了。

排列和随机采样

下面是随机选取一个DataFrame的一些行，做法就是随机产生行号，然后进行选取即可。

利用 numpy.random.permutation() 函数可以实现随机重排。
In [67]: df = pd.DataFrame(np.arange(5 * 4).reshape(5, 4))
In [68]: ran = np.random.permutation(5)
In [70]: ran
Out[70]: array([2, 3, 0, 1, 4])
In [71]: df
Out[71]: 
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19
In [72]: df.take(ran)
Out[72]: 
    0   1   2   3
2   8   9  10  11
3  12  13  14  15
0   0   1   2   3
1   4   5   6   7
4  16  17  18  19

计算指标/哑变量

将分类变量转换为“哑变量矩阵”(dummy matrix)或“指标矩阵”（indicator matrix）。如果DataFrame的某一列有k各不同的值，可以派生出一个k列的矩阵或者DataFrame（值为1和0）

In [74]: df = pd.DataFrame({'key':['b','b','a','c','a','b'],'d
    ...: ata1' : range(6)})
In [75]: pd.get_dummies(df['key'])
Out[75]: 
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

给指标DataFrame的列加上一个前缀

In [76]: dummies = pd.get_dummies(df['key'],prefix = 'key')
In [77]: dummies
Out[77]: 
   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0

顺便看看下面这个例子：

In [80]: df['data1']
Out[80]: 
0    0
1    1
2    2
3    3
4    4
5    5
Name: data1, dtype: int64
In [81]: type(df['data1'])
Out[81]: pandas.core.series.Series
In [82]: df[['data1']]
Out[82]: 
   data1
0      0
1      1
2      2
3      3
4      4
5      5
In [83]: type(df[['data1']])
Out[83]: pandas.core.frame.DataFrame

df['data1']得到一个Series，而df[['data1']]得到一个DataFrame

字符串操作

Python有简单易用的字符串和文本处理功能。大部分文本运算直接做成了字符串对象的内置方法。当然还能用正则表达式。pandas对此进行了加强，能够对数组数据应用字符串表达式和正则表达式，而且能处理烦人的缺失数据。

字符串对象方法

举几个简单的例子：

In [87]: zifuchuan = ' i can be a can, i do not balabala'  # 最前面有个空格
In [88]: sp = zifuchuan.split(',')
In [89]: sp
Out[89]: [' i can be a can', ' i do not balabala']
In [90]: ':::'.join(sp)
Out[90]: ' i can be a can::: i do not balabala'
In [91]: zifuchuan.index('can')
Out[91]: 3
In [92]: zifuchuan.index('i')
Out[92]: 1
In [94]: zifuchuan.count('can')
Out[94]: 2

正则表达式

正则表达式（regex）提供了一种灵活的在文本中搜索、匹配字符串的模式。

python中的正则表达式用的是re模块。re模块的函数分为3类：模式匹配、替换、拆分。

关于正则表达式的总结：http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

几举个简单的例子，以后有时间关于正则表达式再做一个总结。

In [119]: import re             # 首先导入Python的re模块

In [120]: text = 'I    love\t you'   # 注意：这句后面没有空格  
In [121]: re.split('\s+',text)
Out[121]: ['I', 'love', 'you']

In [122]: text = 'I    love\t you '  # 这句末尾多了一个空格，匹配时会把末尾的空格也算一个字符串
In [123]: re.split('\s+',text)
Out[123]: ['I', 'love', 'you', '']

上面的例子首先正则表达式会被编译，然后在text上调用其split方法。

也可以这样写：

In [125]: patten = re.compile('\s+')   #  先编译，得到一个可以重用的 regex 对象
In [126]: patten.split(text)
Out[126]: ['I', 'love', 'you', '']

findall、search、match 、sub

In [132]: text = """Dave [email protected]           
     ...: Steve [email protected]
     ...: Rob [email protected]
     ...: Ryan [email protected]
     ...: """

In [133]: patten = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'   # 匹配邮箱

In [134]: regex = re.compile(patten,flags=re.IGNORECASE)   #先编译，忽略大小写

In [135]: regex.findall(text)                        # 返回所有匹配到的模式
Out[135]: ['[email protected]', '[email protected]', '[email protected]
', '[email protected]']

In [137]: m = regex.search(text)                 #  返回匹配到的第一个模式 
In [138]: m
Out[138]: <_sre.SRE_Match object; span=(5, 20), match='dave@goo
gle.com'>
In [141]: text[m.start():m.end()]
Out[141]: '[email protected]'
In [144]: m.string                 # 返回原始匹配串
Out[144]: 'Dave [email protected]\nSteve [email protected]\nRob rob
@gmail.com\nRyan [email protected]\n'
In [143]: print(regex.match(text))  # match 只匹配开头  这里开头是‘Dave’，所以没有匹配到，返回None
None

In [147]: print(regex.sub('replace',text))  #  将匹配到的模式全部替换
Dave replace
Steve replace
Rob replace
Ryan replace

将匹配到的模式分组：

In [148]: patten = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,
     ...: 4})'
In [149]: regex = re.compile(patten,flags=re.IGNORECASE)
In [153]: regex.findall(text)
Out[153]: 
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
In [155]: s = regex.search(text)
In [156]: s
Out[156]: <_sre.SRE_Match object; span=(5, 20), match='dave@goo
gle.com'>
In [157]: s.groups()
Out[157]: ('dave', 'google', 'com')

给匹配到到的模式命名：

In [163]: regex = re.compile(r"""
     ...: (?P<username>[A-Z0-9._%+-]+)
     ...: @(?P<domain>[A-Z0-9.-]+)
     ...: \.(?P<suffix>[A-Z]{2,4})""",
     ...:  flags=re.IGNORECASE|re.VERBOSE)
     ...: 
In [164]: m = regex.match('[email protected]')
In [165]: m.groupdict()
Out[165]: {'domain': 'bright', 'suffix': 'net', 'username': 'we
sm'}
In [171]: f = regex.search(text)
In [172]: f.group('username')
Out[172]: 'dave'
In [173]: f = regex.findall(text)
In [174]: f
Out[174]: 
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

pandas矢量化字符串

In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev
     ...: e': '[email protected]','Rob': '[email protected]', 'Wes':
     ...:  np.nan})

In [177]: series
Out[177]: 
Dave     [email protected]
Rob        [email protected]
Steve    [email protected]
Wes                  NaN
dtype: object

通过Series的str方法可以对Series的内容进行操作：

In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev
     ...: e': '[email protected]','Rob': '[email protected]', 'Wes':
     ...:  np.nan})
In [177]: series
Out[177]: 
Dave     [email protected]
Rob        [email protected]
Steve    [email protected]
Wes                  NaN
dtype: object
In [178]: series.str.contains('rob')
Out[178]: 
Dave     False
Rob       True
Steve    False
Wes        NaN
dtype: object

In [179]: patten = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,
     ...: 4})'
In [180]: series.str.findall(patten,re.IGNORECASE)
Out[180]: 
Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
。。。

map函数在遇到NA值时会报错：

In [199]: matches = series.str.upper()
In [200]: matches
Out[200]: 
Dave     [email protected]
Rob        [email protected]
Steve    [email protected]
Wes                  NaN
dtype: object
In [202]:series.map(str.upper)         
---------------------------------------------------------------
TypeError                     Traceback (most recent call last)
<ipython-input-202-029ad3593723> in <module>()
----> 1 matches = series.map(str.upper)
...
TypeError: descriptor 'upper' requires a 'str' object but received a 'float'