pandas 处理数据概括

转自：http://www.cnblogs.com/big-face/p/5418416.html

Python基于pandas的数据处理（一）

1 import pandas as pd, numpy as np
2 dates = pd.date_range('20130101', periods=6)
3 df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

1 mutate + ifelse

1 df['E'] = np.where(df['D'] >= 0, '>=0', '<0')
2 df['F'] = np.random.randint(0, 2, 6)
3 df.assign(G = df.A * df.D) # 或者
4 df['F'] = df['F'].apply(str) #针对单列的
5 df.applymap(str) #这个相当于是mutate_each

2 table

1 pd.value_counts(df["E"])
2 pd.pivot_table(df,index=['E','F'])

3 index 也就是取df的rownames，但与R不一样的在于，df可能有多维rownames

1 df.index
2 df.set_index(['A'], drop = 0, append = 1) # 把已有的列设置为index，可保留之前的index，也可以把新的index在原数据中删除
3 df['dates'] = df.index # 新生成一列dates
4 df.reset_index(level=0, inplace=True) # 同上
5 df.reset_index(level=['index']) # 同上

4 删除列和行

1 df = df.drop('index', axis = 1) # 可以删除多列
2 df.drop(df.index[[1,3]])

5 column names

1 df.columns
2 df.columns = ['a', 'b', 'c', 'e', 'd', 'f'] # 重命名
3 df.rename(columns = {'A':'aa','B':'bb', 'C':'cc', 'D':'dd', 'E':'ee', 'F':'ff'}, inplace=True)
4 df.rename(columns=lambda x: x[1:].upper(), inplace=True) # 也可以用函数 inplace参数的意思就是代替原来的变量，深拷贝

6 哑变量 dummy variables

1 pd.Series(['a|b', np.nan, 'a|c']).str.get_dummies()

7 纯粹的df的矩阵，即不包含column和index

1 df.values
2 df.get_values()

8 summary

1 df.describe() # 只会针对数值型变量做计算

9 rbind

1 df2=pd.DataFrame([[5,6],[7,8]],columns=list('AB'))
2 df.append(df2, ignore_index=True)

10 group by 分组汇总计算，和pivot_table类似

1 df.groupby(['E','F']).mean()
2 df.groupby(['E','F']).agg(['sum', 'mean'])
3 pd.pivot_table(df,index=['E','F'], aggfunc=[np.sum, np.mean])
4 df.pivot_table(index=['E','F'], aggfunc=[np.sum, np.mean]) # 同上
5 df.groupby(['E','F']).agg({'A':['mean','sum'], 'B':'min'}) # groupby 也可以这样写

11 排序

1 df.sort(['A','B'],ascending=[1,0]) # 按列排序，na_position控制NAN的位置
2 df.sort_index(ascending=0) # 按index排序

12 筛选

1 df[(df.A >= -1) & (df.B <= 0)] # 值筛选
2 df[df.E.str.contains(">")] # 包含某个字符，contains筛选的其实是正则表达式
3 df[df.F.isin(['1'])] # 在列表内

13 变量选择

1 df['A'] # 单个的列
2 df[0:3] # 行
3 df['20130102':'20130104'] # 按index筛选
4 df.loc[:,] # 类似于R里面的dataframe选行和列的方法
5 df.iloc[:,] # iloc只能用数字了

Python基于pandas的数据处理（二）

14 抽样

1 df.sample(10, replace = True)
2 df.sample(3)
3 df.sample(frac = 0.5) # 按比例抽样
4 df.sample(frac = 10, replace = True,weights = np.random.randint(1,10,6)) # 对样本加权
5 df.sample(3, axis = 1) # 变量抽样

15 join（即 merge）

1 pd.merge(df.sample(4), df.sample(4), how = "left", on = "A", indicator = True)

16 随机数

numpy.random.rand(3, 2) # 按维度生成[0,1)之间的均匀分布随机数
np.random.randn(2,5) # 按维度生成标准正太分布随机数
np.random.randint(2, size=10) # randint(low[, high, size])生成随机整数，默认low为0，high必填，size默认为1
np.random.bytes(10) # 返回随机字节
a=np.arange(10)
np.random.shuffle(a) # 洗牌
a=np.arange(9).reshape(3, 3)
np.random.shuffle(a) # 若是数组，则只会打乱第一维
np.random.permutation(10) # 随机排列，对于多维序列也适用
np.random.permutation(10) .reshape(2, 5)
np.random.seed(1000) # 种子
np.random.normal(2,3,[5,2]) # 高斯分布，其他分布可查
# http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.random.html
np.random.seed(12345678)
x = scipy.stats.norm.rvs(loc=5, scale=3, size=100) # 另外scipy也有这些随机数的生成，附带检验
scipy.stats.shapiro(x)
# http://docs.scipy.org/doc/scipy-0.17.0/reference/stats.html

17 gather和spread

 1 # gather:
 2 def gather( df, key, value, cols ):
 3     id_vars = [ col for col in df.columns if col not in cols ]
 4     id_values = cols
 5     var_name = key
 6     value_name = value
 7     return pandas.melt( df, id_vars, id_values, var_name, value_name )
 8 # 以上是定义的一个函数，实际上一样的，横变竖，是gather,竖变横，是spread
 9 pd.melt(df, id_vars=['E','F'], value_vars=['A','C'])
10 # spread:
11 pd.pivot(df["D"],df["E"],df['F']) #这个是竖变横
12 df3=pd.pivot(df2['D'],df2['variable'],df2['value'])
13 df3.reset_index(level=0, inplace=True) # 再变回df的样子

18 熵

1 scipy.stats.entropy(np.arange(10))

19 字符串拼接

1 [",".join(['a','b','d'])]
2 df[['E','F']].groupby('F')['E'].apply(lambda x: "{%s}" % ', '.join(x)) # 分组拼接，前提是这些列都要是字符串
3 df[['E','F']].applymap(str).groupby('E')['F'].apply(lambda x: "%s" % ', '.join(x)) # 所以可以这样

20 随机字符串生成

1 import random,string
2 df2 = pd.DataFrame(range(10),columns=['y'])
3 df2["x"] = [",".join(random.sample(string.lowercase,random.randint(2,5))) for i in range(10)]

21 分列后生成hash表

1 # 用20 的示例数据
2 df3=pd.DataFrame(df2.x.str.split(',').tolist(),index=df2.y).stack().reset_index(level=0)
3 df3.columns=["y","x"]

22 去重

1 df[["F","E"]].drop_duplicates()

23 离散化

1 pd.cut(df.A,range(-1,2,1))

pandas 处理数据概括

转自：http://www.cnblogs.com/big-face/p/5418416.html

Python基于pandas的数据处理（一）

Python基于pandas的数据处理（二）

猜你喜欢