pandas基本使用

DataFrame

1：选取特定值的数据

print(onlineData[onlineData['User_id'] == 14336199])

2：去除重复的数据

drop_duplicates:

def drop_duplicates(self, subset=None, keep='first', inplace=False):

# subset：可以接收一个字符串作为列名，也可以接收一个数组，表示一系列的列名，列名对应的元素如果相同就会被去重
# keep:默认是'first',代表遇到重复的元素，保留第一个，如果是'last'就是保留最后一个
# inplace:false 代表不改变原来的dataFrame，Ture代表改变

onlineData.drop_duplicates(['User_id'])

3：查找某列不为空的数据

onlineData[pd.isnull(onlineData['Coupon_id'])==False]

4：基础度量指标

def status(x) :
    return pd.Series([x.count(),str(x.min()),str(x.idxmin()),str(x.quantile(.25)),str(x.median()),
                      str(x.quantile(.75)),str(x.mean()),str(x.max()),x.idxmax(),x.mad(),x.var(),
                      x.std(),x.skew(),x.kurt()],index=['总数','最小值','最小值位置','25%分位数',
                    '中位数','75%分位数','均值','最大值','最大值位数','平均绝对偏差','方差','标准差','偏度','峰度'])

5：DataFrame合并

内连接（保留相同的部分），其他连接方式和数据库类似

how：可以为 left right inner，和数据库的 左右内 连接意思相同
on：表示连接使用的字段

pd.merge(testData,offlineData,how='inner',on='User_id')

6:loc方法

当我们需要对查询出来的数据进行赋值时，就可以使用loc方法

1：loc可以接收3类参数，行，列，布尔类型的list或者迭代器
2：查找时，第一个参数为行list，然后是列list
3：也可以将行list换成布尔类型的list，对应索引为Ture的显示
4：同样可以使用切片
官方文档地址
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html

# 这里将Coupon_id列中，为空的值变为0
onlineData.loc[pd.isnull(onlineData['Coupon_id']),'Coupon_id'] = 0

7:分组

过程：
apply：将某个函数作用于各个分组上
combine：将各个分组进行合并

# splite：将DataFrame进行分组
# 比如下面就会将相同的 User_id 分为一组，产生的group是一个运算对象
group = onlineData['Coupon_id'].groupby(onlineData['User_id'])

分组后可以使用 size() count() sum()等聚合函數，区别在于size统计包括Nan，count不包括

group = venueData.groupby(venueData['city']).count()

# Series分组，并求出每个分组最大值和最小值之差
# Series分组，groupby中可以传入index，也可以传入values
onlineData = onlineData.groupby(onlineData.index).apply(lambda x:x.max() - x.min())

将分组的结果转化为list

# 这种使用方法可以用于协同过滤中的倒排索引
rateGroup = rateData['userId'].groupby(rateData['movieId']).apply(lambda x:list(x))

8：将DataFrame转Series

原DataFrame的index会保留

activate = pd.Series(activate['Action'])

如果不想保留(新的index从0开始编号)

activate = pd.Series(activate['Action'].values)

取DataFrame中的两列生成Series

onlineData = pd.Series(onlineData['Date_received'].values,index=onlineData['User_id'])

9：将函数作用于dataFrame中的每一个元素

这里要注意的是，对dataFrame进行赋值时，dataFrame本身不能是原数据的一个切片,否则会有警告

这是定义的对每个元素处理的函数

# str是DataFrame中，函数要处理的每一个元素
def getRate(self,str):
    if ':' in str:
        rateLs = str.split(':')
        rate = float(rateLs[1]) / float(rateLs[0])
    return rate

map作用于对应列的每一个元素

# 下面的函数将如 500:50（满500减50）转化为优惠率 50/500=0.9
onlineData['Discount_rate'] = onlineData['Discount_rate'].map(self.getRate)

apply()作用于DataFrame

# 这里如果axis=1表示每次迭代的是行，axis=0表示每次迭代的是列
apply(函数,axis=0或1)

apply()作用于groupby对象

# 这里迭代的是每一个分组
onlineData.groupby(onlineData.index).apply(lambda x:x.max()-x.min())

10:排序

# 默认是升序
s.sort_index()

# 如果要改成降序
s.sort_index(ascending=False)

11:读取没有列名的文件，并命名

train = pd.read_table('Gowalla/train.txt',names=['userId','locId','a','b','c'])

12：调整pandas的输出宽度列数等

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

13:将一列数据通过字符串分割分为多列

data = data['value'].str.split('::',expand=True)

14:将某一列的值转化为数字类型，用于比较

interestRateData = rateData.loc[pd.to_numeric(rateData['rating']) > 2]

15:pandas设置输出宽度（不换行）

pd.set_option('display.width', 5000)

16:pandas判断是否为空

# 这里test是apply中使用的函数，item代表每一行，pd.isnull可以判断某个元素值是否为空
# pandas中的空值，不是python中的None，也不是math库中的nan，但是打印的话，值是nan
def test(item):
    city = item[6]
    if pd.isnull(city) == False:
        print(item[6])
    else:
        print("12313123")

17:pandas排序

data.sort_values(by="userId")

18:查看基本描述信息

userInfo.describe()

19:dataFrame存文件

index=False表示保存的时候不要索引

data.to_csv('hk_checkin2.csv',index=False)

20:判断某个键在series中是否存在

'Train' in typeCount.index

21:获取索引

series.index.tolist()[0]

.index.tolist():将series的索引变为数组

[0]就是获取数组中的第一个元素

如果查询结果只有一个元素，那么这句话就是查出那个元素对应的索引