matplotlib必背系列:
https://blog.csdn.net/ssswill/article/details/86419094
sklearn必背系列:
下面开始pandas必须掌握的编程语句~~
1.关于pandas读取文件
df_train = pd.read_csv('../input/train.csv')
2.关于pandas特征工程/预处理(preprocessing)
2.1 pandas处理缺失值
#两个df文件有同样的包含缺失值的列
for df in [df_hist_trans,df_new_merchant_trans]:
df['category_2'].fillna(1.0,inplace=True)
df['category_3'].fillna('A',inplace=True)
df['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True)
df['Age'].fillna(df['Age'].mean(),inplace=True)
2.2 pandas处理时间量
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
#train = pd.read_csv('../input/train.csv',parse_dates=["first_active_month"])
df['year'] = df['purchase_date'].dt.year
df['weekofyear'] = df['purchase_date'].dt.weekofyear
df['month'] = df['purchase_date'].dt.month
df['dayofweek'] = df['purchase_date'].dt.dayofweek
关于pandas时间量处理更多详细内容见:
https://blog.csdn.net/ssswill/article/details/86530045
2.3 关于类型转换
2.3.1 关于str类型yes no转换
df['authorized_flag'] = df['authorized_flag'].map({'Y':1, 'N':0})
df['category_1'] = df['category_1'].map({'Y':1, 'N':0})
or:
historical_transactions['authorized_flag'] =
historical_transactions['authorized_flag'].apply
(lambda x: 1 if x == 'Y' else 0)
进一步:
map_dict = {'A': 0, 'B': 1, 'C': 2, 'nan': 3}
historical_transactions['category_3'] = historical_transactions['category_3'].apply(lambda x: map_dict[str(x)])
注意:map还能这样用:
for f in ['feature_1','feature_2','feature_3']:
order_label = df_train.groupby([f])['outliers'].mean()
df_train[f] = df_train[f].map(order_label)
df_test[f] = df_test[f].map(order_label)
2.3.2 关于bool型转换为int
df['weekend'] = (df.purchase_date.dt.weekday >=5).astype(int)
2.3.3常用的get_dummies
df = pd.get_dummies(df, columns=['c_2', 'c_3'])
2.4关于分组聚合
2.4.1 transform
for col in ['category_2','category_3']:
df_hist_trans[col+'_mean'] = df_hist_trans.groupby([col])['purchase_amount'].transform('mean')
讲解:
参考:这个讲得很好:https://www.jianshu.com/p/509d7b97088c
换个例子:
df.groupby('order')["ext price"].sum()
这个很简单,就是算出df算出不同的order下price的和。假设原数据有12行,3种order那么很自然上面代码生成:
可是我们原来数据是12行,这只有3行,我们不能直接合并。换句话说这样的数据我们没法直接用。transform来了:
#df.groupby('order')["ext price"].sum()
df.groupby('order')["ext price"].transform('sum')
生成的维度和原来一样:
那么我们直接就可以用了,
例如这样算出比例
df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')
这样的话本小节第一段代码快你应该就懂了~
2.4.2 groupby+agg
code from:
https://www.kaggle.com/chauhuynh/my-first-kernel-3-699
def get_new_columns(name,aggs):
return [name + '_' + k + '_' + agg for k in aggs.keys() for agg in aggs[k]]
aggs = {}
for col in ['month','hour','weekofyear','dayofweek','year','subsector_id','merchant_id','merchant_category_id']:
aggs[col] = ['nunique']
aggs['purchase_amount'] = ['sum','max','min','mean','var']
aggs['installments'] = ['sum','max','min','mean','var']
aggs['purchase_date'] = ['max','min']
aggs['month_lag'] = ['max','min','mean','var']
aggs['month_diff'] = ['mean']
aggs['authorized_flag'] = ['sum', 'mean']
aggs['weekend'] = ['sum', 'mean']
aggs['category_1'] = ['sum', 'mean']
aggs['card_id'] = ['size']
new_columns = get_new_columns('hist',aggs)
df_hist_trans_group = df_hist_trans.groupby('card_id').agg(aggs)
df_hist_trans_group.columns = new_columns
df_hist_trans_group.reset_index(drop=False,inplace=True)
df_train = df_train.merge(df_hist_trans_group,on='card_id',how='left')
df_test = df_test.merge(df_hist_trans_group,on='card_id',how='left')
2.5关于拼接合并
df_train = df_train.merge(df_hist_trans_group,on='card_id',how='left')
df_test = df_test.merge(df_hist_trans_group,on='card_id',how='left')
2.6关于loc,iloc
df_train['outliers'] = 0
df_train.loc[df_train['target'] < -30, 'outliers'] = 1