资金流入流出预测-数据分析与探索

数据分析与探索

1. 背景

根据余额宝2013年7月-2014年8月用户的申购赎回数据信息预测未来每日的资金流入流出情况

2.数据的导入

  1. 导入各种数据科学及可视化库
#读取数据
import pandas as pd
import numpy as np
import datetime,os
import seaborn as snsx
import matplotlib.pyplot as plt 
  1. 读取数据,并添加时间信息,统计每天申购总量和赎回总量
#读取数据
data_balance = pd.read_csv("./Data/user_balance_table.csv")
bank = pd.read_csv(r"./Data/mfd_bank_shibor.csv")
share = pd.read_csv(r"./Data/mfd_day_share_interest.csv")
users = pd.read_csv(r"./Data/user_profile_table.csv")
#给用户申购赎回数据表添加时间信息
data_balance["date"] = pd.to_datetime(data_balance["report_date"],format="%Y%m%d")#日期/年月日
data_balance["day"] = data_balance["date"].dt.day
data_balance["month"] = data_balance["date"].dt.month
data_balance["year"] = data_balance["date"].dt.year
data_balance["week"] = data_balance["date"].dt.week
data_balance["weekday"] = data_balance["date"].dt.weekday

在这里插入图片描述

#统计每天申购总量和赎回总量
total_balance = data_balance.groupby("date",as_index=False)["total_purchase_amt","total_redeem_amt"].sum()

在这里插入图片描述

3. 绘制时序图

绘制各个时间段的申购总量和赎回总量的时序图观察数据特点

  • 2013年8月-2014年8月
#绘制申购总量及赎回总量的时序图观察数据特点
fig = plt.figure(figsize=(20,6))
plt.plot(total_balance["date"],total_balance["total_purchase_amt"],label="purchase")
plt.plot(total_balance["date"],total_balance["total_redeem_amt"],label="redeem")
plt.title("The time series of total amount of purchase and redeem")
plt.legend(loc="best")
plt.xlabel("Time")
plt.ylabel("label")
plt.show()

在这里插入图片描述

  • 2014年4月份以后
    在这里插入图片描述
    观察数据可以看出,大体上以星期为周期变化
  • 2014年4月以后各个月的变化
    在这里插入图片描述
    在这里插入图片描述
    观察可得:
    1. 每月的月初购买大于赎回,每月月末赎回大于购买
    2. 每月有4个波峰4个波谷
    3. 申购与赎回看起来有关联

4.分析一周内申购与赎回的总量差异

4.1分布情况

  1. 绘制密度分布图
#绘制2014年4月份以后一周内每天数据于整体数据的分布图
fig = plt.figure(figsize=(10,10))
scatter_para = {
    
    "marker":".","s":3,"alpha":0.3}
line_kws = {
    
    "color":"k"}

plt.subplot(2,2,1)
plt.title("The distribution of total purchase")
sns.violinplot(x="weekday",y="total_purchase_amt",data=total_balance_later_month4,scatter_kws=scatter_para,line_kws=line_kws)

plt.subplot(2,2,2)
plt.title("The distribution of total purchase")
sns.distplot(total_balance_later_month4["total_purchase_amt"].dropna())

plt.subplot(2,2,3)
plt.title("The distribution of total redeem")
sns.violinplot(x="weekday",y="total_redeem_amt",data=total_balance_later_month4,scatter_kws=scatter_para,line_kws=line_kws)

plt.subplot(2,2,4)
plt.title("The distribution of total redeem")
sns.distplot(total_balance_later_month4["total_redeem_amt"].dropna())

在这里插入图片描述
2. 求中位数后绘制柱状图

#求中位数后绘制柱状图
#对数据按照星期聚合取均值
week_data = total_balance_later_month4[["weekday","total_purchase_amt","total_redeem_amt"]].groupby("weekday",as_index=False).median()
plt.figure(figsize=(12,6))
ax = plt.subplot(1,2,1)
plt.title("The median of total purchase with each weekday")
ax = sns.barplot(x="weekday",y="total_purchase_amt",data=week_data,label="Purchase")
ax.legend()

ax = plt.subplot(1,2,2)
plt.title("The median of total redeem with each weekday")
ax = sns.barplot(x="weekday",y="total_redeem_amt",data=week_data,label="Redeem")
ax.legend()

在这里插入图片描述
3. 绘制箱形图

#绘制箱形图
fig = plt.figure(figsize=(12,5))
ax = plt.subplot(1,2,1)
plt.title("The box plot of total purchase with each weekday")
ax = sns.boxplot(x="weekday",y=total_balance_later_month4["total_purchase_amt"],data=total_balance_later_month4)
ax = plt.subplot(1,2,2)
plt.title("The box plot of total redeem with each weekday")
ax = sns.boxplot(x="weekday",y=total_balance_later_month4["total_redeem_amt"],data=total_balance_later_month4)

在这里插入图片描述
如下图,以上绘制的分布图结果可能与余额宝收益的计算方式有关,周一、周二、周三、周四申购量较高
在这里插入图片描述

4.2相关情况

  1. 使用one-hot划分每日特征
  • one-hot方法: 将分类特征的每个元素转化为一个可以用来计算的值,例如:

enc = OneHotEncoder()
enc.fit([[0, 0, 3],
         [1, 1, 0],
         [0, 2, 1],
         [1, 0, 2]]) 
 ‘’‘表示有3个特征,每一列是一个特征,
 第一个特征,即第一列 [0,1,0,1],
 也就是说它有两个取值 0 或者 1,
 那么 one-hot 就会使用两位来表示这个特征,
 [1,0] 表示 0, [0,1] 表示1’‘’

ans = enc.transform([[0, 1, 3]]).toarray()  # 如果不加 toarray() 的话,输出的是稀疏的存储格式,即索引加值的形式,也可以通过参数指定 sparse = False 来达到同样的效果
print(ans) # 输出 [[ 1.  0.  0.  1.  0.  0.  0.  0.  1.]]
  • 构造变量
#使用onehot方法,将每日特征划分,获取划分后的特征
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
week_feature = encoder.fit_transform(np.array(total_balance["weekday"]).reshape(-1,1)).toarray()#将数据转换为1列,但是不知道多少行
week_feature = pd.DataFrame(week_feature,columns=["weekday_onehot"]*len(week_feature[0]))
feature = pd.concat([total_balance, week_feature],axis=1)[["total_purchase_amt","total_redeem_amt","weekday_onehot","date"]]
feature.columns = list(feature.columns[0:2])+[x+str(i) for i,x in enumerate(feature.columns[2:-1])]+["date"]
feature.head()

在这里插入图片描述
这里将weekday的值作为一列特征,根据该值构造一个矩阵

  • 绘制划分后的每日特征与标签的斯皮尔曼相关性
fig,ax = plt.subplots(figsize=(15,8))
plt.subplot(1,2,1)
plt.title("The spearman coleration between total purchase and each weekday")
sns.heatmap(feature[[x for x in feature.columns if x not in ["total_redeem_amt","date"]]].corr('spearman'),linewidths=0.1,vmax=0.2,vmin=-0.2)

plt.subplot(1,2,2)
plt.title("The spearman coleration between total redeem and each weekday")
sns.heatmap(feature[[x for x in feature.columns if x not in ["total_purchase_amt","date"]]].corr('spearman'),linewidths=0.1,vmax=0.2,vmin=-0.2)

在这里插入图片描述

5.每月申购/赎回总额分布

  1. 每个月的购买总量分布估计图(kdeplot)
#画出每个月的购买总量分布估计图(kdeplot)
fig = plt.figure(figsize=(20,10))
plt.title("The Probability Density of total purchase amount in Each Month")
plt.xlabel("Amount")
plt.ylabel("Probility")
for i in range(7,12):
    sns.kdeplot(total_balance[(total_balance["date"] >= datetime.date(2013,i,1)) & (total_balance["date"] < datetime.date(2013,i+1,1))]["total_purchase_amt"],label="13Y,"+str(i)+"M")
for i in range(1,9):
    sns.kdeplot(total_balance[(total_balance["date"] >= datetime.date(2014,i,1)) & (total_balance["date"] < datetime.date(2014,i+1,1))]["total_purchase_amt"],label="14Y,"+str(i)+"M")

在这里插入图片描述
2. 每个月的赎回总量分布估计图(kdeplot)

#画出每月的赎回总量分布估计图
fig = plt.figure(figsize=(20,10))
plt.title("The Probability Density of total purchase amount in Each Month")
plt.xlabel("Amount")
plt.ylabel("Probility")
for i in range(7,12):
    sns.kdeplot(total_balance[(total_balance["date"] >= datetime.date(2013,i,1)) & (total_balance["date"] < datetime.date(2013,i+1,1))]["total_redeem_amt"],label="13Y,"+str(i)+"M")
for i in range(1,9):
    sns.kdeplot(total_balance[(total_balance["date"] >= datetime.date(2014,i,1)) & (total_balance["date"] < datetime.date(2014,i+1,1))]["total_redeem_amt"],label="14Y,"+str(i)+"M")

在这里插入图片描述

  • 由以上两张图可以看出13年7-10月与其他各个月份相差明显,14年各月份相差不明显,需要再具体看
  1. 画出14年5-8月份的申购/赎回总量分布估计图
# 画出14年五六七八月份的分布估计图

plt.figure(figsize=(12,10))

ax = plt.subplot(2,1,1)
plt.title('The Probability Density of total purchase and redeem amount from May.14 to August.14')
plt.ylabel('Probability')
plt.xlabel('Amount')
ax = sns.kdeplot(total_balance_month5['total_purchase_amt'],color='Black',label='May')
ax = sns.kdeplot(total_balance_month6['total_purchase_amt'],label='June')
ax = sns.kdeplot(total_balance_month7['total_purchase_amt'],label='July')
ax = sns.kdeplot(total_balance_month8['total_purchase_amt'],label='August')
ax = plt.subplot(2,1,2)
plt.ylabel('Probability')
plt.xlabel('Amount')
ax = sns.kdeplot(total_balance_month8['total_redeem_amt'],label='August')
ax = sns.kdeplot(total_balance_month7['total_redeem_amt'],label='July')
ax = sns.kdeplot(total_balance_month6['total_redeem_amt'],label='June')
ax = sns.kdeplot(total_balance_month5['total_redeem_amt'],color='Black',label='May')

5月和6月的购买量较为接近,6月赎回量与其他三个月相差较大

  • 5月和6月的购买量较为接近,6月赎回量与其他三个月相差较大

6.按天分析购买/赎回总量

  1. 将8月份的数据按照每天聚合,分别绘制每日购买/赎回总量的柱状图
#将8月份的数据按照每天聚合
#8月每日购买柱状图
day_data = total_balance_month8[["day","total_purchase_amt","total_redeem_amt"]].groupby("day",as_index=False).mean()
fig = plt.figure(figsize=(10,10))
ax = plt.subplot(2,2,1)
ax=sns.barplot(x="day",y="total_purchase_amt",data=day_data,label="Purchase")
ax=sns.lineplot(x="day",y="total_purchase_amt",data = day_data,label="Purchase")
#8月每日购买柱状图
ax = plt.subplot(2,2,2)
ax=sns.barplot(x="day",y="total_redeem_amt",data=day_data,label="Redeem")
ax=sns.lineplot(x="day",y="total_redeem_amt",data = day_data,label="Redeem")

在这里插入图片描述

  • 每周初申购量较大
  • 周末时购买量和赎回量都较少,不倾向交易
  1. 画出历史所有数据的热力图
#画出历史所有天的热力图
test = np.zeros((max(total_balance_later_month4['week'])-min(total_balance_later_month4["week"])+1,7))
test[total_balance_later_month4['week'] - min(total_balance_later_month4['week']), total_balance_later_month4['weekday']] = total_balance_later_month4["total_purchase_amt"]

fig,ax = plt.subplots(figsize=(10,4))
sns.heatmap(test,linewidths=0.1,ax=ax)
ax.set_title("Purchase")
ax.set_xlabel('weekday')
ax.set_ylabel('week')

test = np.zeros((max(total_balance_later_month4['week']) - min(total_balance_later_month4['week']) + 1, 7))
test[total_balance_later_month4['week'] - min(total_balance_later_month4['week']), total_balance_later_month4['weekday']] = total_balance_later_month4['total_redeem_amt']

f, ax = plt.subplots(figsize = (10, 4))
sns.heatmap(test,linewidths = 0.1, ax=ax)
ax.set_title("Redeem")
ax.set_xlabel('weekday')
ax.set_ylabel('week')

在这里插入图片描述

  • 观察数据,赎回热力图中12周周三数据和4周周日数据较为异常在这里插入图片描述
  • 这天是五一劳动节之后的第一天

7.分析节假日及特殊日期

  1. 画出节假日均值与平时均值的柱状图
#画出节假日均值与平时均值的柱状图
fig = plt.figure()
index_list = ["qingming","Labour","DW","618","Mean"]
value_list = [np.mean(data_qingming["total_purchase_amt"]),np.mean(data_labour["total_purchase_amt"]),np.mean(data_duanwu["total_purchase_amt"]),np.mean(data618["total_purchase_amt"]),np.mean(total_balance_later_month4["total_purchase_amt"])]
plt.bar(index_list,value_list,label="Purchase")
index_list = ['QM.','Labour.','DW.','618.','Mean.']
value_list = [np.mean(data_qingming["total_redeem_amt"]),np.mean(data_labour["total_redeem_amt"]),np.mean(data_duanwu["total_redeem_amt"]),np.mean(data618["total_redeem_amt"]),np.mean(total_balance_later_month4["total_redeem_amt"])]
plt.bar(index_list,value_list,label="Redeem")

在这里插入图片描述
清明节、劳动节、端午节的均值明显低于平时的均值

  1. 画出节假日及其周边日期的时序图
#画出清明节与周边日期的时序图
data_qingming_around = total_balance[(total_balance["date"]>="2014-04-01")&(total_balance["date"]<"2014-04-13")]
ax = sns.lineplot(x="date",y="total_purchase_amt",data=data_qingming_around,label="Purchase")
ax = sns.lineplot(x="date",y="total_redeem_amt",data=data_qingming_around,label="Redeem",ax=ax)
ax = sns.scatterplot(x="date",y="total_purchase_amt",data=data_qingming,ax=ax)
ax = sns.scatterplot(x="date",y="total_redeem_amt",data=data_qingming,ax=ax)
ax.legend()

在这里插入图片描述
同理可得其他日期:
在这里插入图片描述
观察图像可知:

  • 节假日之前,购买赎回均下降
  • 节假日之后,购买赎回均上升

8.分析大额交易

  1. 画出用户交易记录的箱形图
#画出用户交易记录的箱形图
sns.boxplot(data_balance["total_purchase_amt"])
plt.title("The abnormal value of total purchase")

在这里插入图片描述
2. 查看异常值信息

data_balance[data_balance["total_purchase_amt"]>2e8]

在这里插入图片描述
2. 查看异常值周围日期的趋势

# 画出单笔交易为2e8的那天的总交易量及附近几天的交易量
e2 = total_balance[(total_balance["date"]>="2013-11-01")&(total_balance["date"]<="2013-11-10")]
ax = sns.barplot(x="day",y="total_purchase_amt",data=e2,label="Purchase")
ax = sns.lineplot(x="day",y="total_purchase_amt",data=e2,label="Purchase")
plt.title("The influence of the big deal with 200 million purchasing(Red Bar)")
ax.legend()

在这里插入图片描述
3. 画出每日单笔最大交易的时序图

# 画出每日单笔最大交易以及总交易额的时序图

plt.figure(figsize=(20, 6))
ax = sns.lineplot(x="date", y="total_purchase_amt", data=data_balance[['total_purchase_amt', 'date']].groupby('date', as_index=False).max(), label='MAX_PURCHASE')
ax = sns.lineplot(x="date", y="total_redeem_amt", data=data_balance[['total_redeem_amt', 'date']].groupby('date', as_index=False).max(), label='MAX_REDEEM')
ax = sns.lineplot(x="date", y="total_purchase_amt", data=data_balance[['total_purchase_amt', 'date']].groupby('date', as_index=False).sum(), label='TOTAL_PURCHASE')
ax = sns.lineplot(x="date", y="total_redeem_amt", data=data_balance[['total_redeem_amt', 'date']].groupby('date', as_index=False).sum(), label='TOTAL_REDEEM')

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/ava_zhang2017/article/details/108080195