【时间序列 - 05】FaceBook / Prophet

Abstract

  • Prophet follows the sklearn model API. We create an instance of the Prophet class and then call its fit and predict methods.

  • Input of Prophet:ds(时间格式,YYYY-MM-DD or YYYY-MM-DD HH:MM:SS) and y(numeric,表示预测的衡量值);

  • Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

  • By default, Prophet uses a linear model for its forecast.

时间序列预测对大部分公司而言都存在必要的需求,比如电商预测GMV,外卖O2O预测成交量波动以便于运力分配,酒店预测间夜量来调整定价与销售,等等。但通常而言,时间序列预测对不少公司来说是一个难题。主要原因除了时间序列预测本身就是玄学(大雾)之外,还要求分析师同时具备深厚的专业领域知识(domain knowledge)和时间序列建模的统计学知识。此外,时间序列模型的调参也是一个比较复杂而繁琐的工作。

prophet就是在这样的背景下的产物,将一些时间序列建模常见的流程与参数default化,来使不太懂统计的业务分析师也能够针对需求快速建立一个相对可用的模型。

Reference

https://facebook.github.io/prophet/  ## 官网英文文档

https://github.com/facebook/prophet ## github 代码仓库

https://vectorf.github.io/ ## 有人根据官网教程做的翻译

https://www.zhihu.com/question/56585493 ## 知乎问题

Detail

future = m.make_future_dataframe(periods=6, freq='M');  ## 存储待预测的日期,frep 指定时间粒度;

make_future_dataframe(self, periods, freq='D', include_history=True)

periods:Int number of periods to forecast forward;表示需要预测 periods 个时间单位;

freq:表示时间粒度,M 表示时间粒度为月;未指定的话,默认为“天”;

include_history:Boolean to include the historical dates in the data frame for predictions;

future:extends forward from the end of self.history for the requested number of periods;模型根据你提供的数据返回待预测的日期;

fcst = m.predict(future)

fcst:返回预测结果,类型为 pandas.core.frame.DataFrame,包括一组字段,其中:

yhat(预测值):The predict method will assign each row in future a predicted value which it names yhat.

模型

  • Prophet 默认使用线性模型

  • 通过 growth 参数指定模型:m = Prophet(growth='logistic')

承载能力:cap

  • 当预测增长情况时,通常会存在可到达的最大极限值,例如:总市场规模、总人口数等等。这被称做承载能力,那么预测时就应当在接近该值时趋于饱和;

  • 每行数据都必须指定对应的 cap 值;

Saturating Minimum(饱和最小值):floor

  • 使用方法与承载能力 cap 类似

  • df['floor'] = 1.5

趋势突变点:Trend Changepoints

By default, Prophet will automatically detect these changepoints and will allow the trend to adapt appropriately. 自动检测突变点并做自适应处理,但也存在一些问题:1)missed a rate change;2)overfitting rate changes in the history;

  • 自动检测突变点:m = Prophet(changepoint_range=0.9),默认情况下,Prophet 只检测前80%的时间序列:避免过拟合。

  • 调整趋势的灵活度:changepoint_prior_scale 默认值为0.05 -> m = Prophet(changepoint_prior_scale=0.5)

changepoint_prior_scale 值越高,趋势拟合得更灵活;changepoint_prior_scale 值越低,趋势拟合的灵活性降低。

过拟合:过于灵活;欠拟合:灵活性不足。

  • 指定突变点的位置:changepoints -> m = Prophet(changepoints=['2014-01-01'])

节假日

  • 创建关于节假日的数据库:(holiday ||| ds ||| lower_window ||| upper_window)

  1. holiday:节假日名称

  2. ds:日期,可以同时包含多个日期

  3. [lower_window,upper_window] 将节假日扩展成一个区间;如,[-1, 0]表示包括ds的前一天加入区间;

  • 节假日数据构建完毕后,进行传参:m = Prophet(holidays=holidays)

  • 对节假日设定先验规模:holidays_prior_scale,默认值为10

  1. 如果发现节假日效应被过度拟合了,通过设置参数可以调整它们的先验规模来使之平;

  2. m = Prophet(holidays=holidays, holidays_prior_scale=1).fit(df)

  • 对季节性设定先验规模:seasonality_prior_scale

Script - update log

20180714:初版代码,m = Prophet(seasonality_mode='multiplicative', weekly_seasonality=True, daily_seasonality=True).fit(class_df)

20180717:增加模型参数配置(by yinque)m.add_seasonality('quarterly', period=91.25, fourier_order=8, mode='additive')

# Python
import pandas as pd
import numpy as np
from fbprophet import Prophet
#import matplotlib.pyplot as plt
import xlwt


def num2date(num_list):
    
    ## num 格式:20180709
    from datetime import datetime    
    date_list = []
    for num in num_list:
        num = str(num)
        date_list.append(datetime(int(num[0:4]), int(num[4:6]), int(num[6:8])).strftime('%Y-%m-%d'))
    return date_list


year_list = ["2013", "2014", "2015", "2016", "2017", "2018"]
month_list = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

all_leaf_class_name_dict = {cate_id: cate_name}


playoffs = pd.DataFrame({
  'holiday': 'playoff',
  'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
                        '2010-01-24', '2010-02-07', '2011-01-08',
                        '2013-01-12', '2014-01-12', '2014-01-19',
                        '2014-02-02', '2015-01-11', '2016-01-17',
                        '2016-01-24', '2016-02-07']),
  'lower_window': 0,
  'upper_window': 1,
})
    
superbowls = pd.DataFrame({
  'holiday': 'superbowl',
  'ds': pd.to_datetime(['2013-06-18', '2014-06-18', '2015-06-18', '2016-06-18', '2017-06-18', '2018-06-18',
                        '2013-11-11', '2014-11-11', '2015-11-11', '2016-11-11', '2017-11-11', '2018-11-11',
                        '2013-12-12', '2014-12-12', '2015-12-12', '2016-12-12', '2017-12-12', '2018-12-12'
                        ]),
  'lower_window': 0,
  'upper_window': 1,
})

holidays = pd.concat((playoffs, superbowls))


def by_days(year_month_day):

    periods = 120
    df = pd.read_csv('source_data_2013_201805.csv', header=0, encoding='gbk')
    df.columns = ['ds', 'cate_id', 'cate_name', 'y']
    
    cate_name = "羽绒服"
    class_df = df[df.cate_name.str.startswith(cate_name)].reset_index(drop=True)
    class_df['ds'] = num2date(class_df['ds'].tolist())
    class_df = class_df[['ds','y']]
    
    stop_month_index = class_df[class_df.ds == year_month_day].index.tolist()[0]
    class_df = class_df[0: stop_month_index+1]
    class_df_now = class_df[0:len(class_df)-periods]
    
    #class_df["y"] = np.log10(np.log10(df['y']))
#    print(class_df.head())
    
    m = Prophet()
    m.fit(class_df_now)
    future = m.make_future_dataframe(periods = periods)  ## periods: predict days
#    future.tail()
    forecast = m.predict(future)
    print(forecast.tail(periods))
    #
    m.plot(forecast)
    m.plot_components(forecast)



pred_list = []
real_list = []
error_list = []
def by_month(year_month, periods):
    
    print("End time: {}".format(year_month))
    
    save_path = "result_cate_by_month_until_{}_num{}.xls".format(year_month[0:4], periods)
    target_data = xlwt.Workbook(encoding="utf-8")
    
    df = pd.read_csv('./dataset/cate_by_month_histroy.csv', header=0, encoding='gbk')
    df.columns = ['ds', 'cate_id', 'cate_name', 'y']
    
    for cate_id in all_leaf_class_name_dict.keys():
        
        cate_name = all_leaf_class_name_dict[cate_id]
        target_sheet = target_data.add_sheet(u'{}'.format(cate_name))
        
        class_df_all = df[df.cate_name.str.startswith(cate_name)].reset_index(drop=True)
        stop_month_index = class_df_all[class_df_all.ds == year_month].index.tolist()[0]
        class_df_all = class_df_all[0: stop_month_index+1]
        class_df = class_df_all[0:len(class_df_all)-periods]
        
#        print(class_df_all)
#        print(class_df)
        
        class_df = class_df[['ds','y']]
#        class_df['y'] = np.log10(class_df['y'])
#        m = Prophet(seasonality_mode='multiplicative', weekly_seasonality=True, daily_seasonality=True).fit(class_df)
        m = Prophet(seasonality_mode='multiplicative')
        m.add_seasonality('quarterly', period=91.25, fourier_order=8, mode='additive')
        #m.add_regressor('regressor', mode='additive')
        m.fit(class_df)
        future = m.make_future_dataframe(periods=periods, freq='M')
        fcst = m.predict(future)
        pred_pandas = fcst.tail(periods)
        pred_list.clear()
        
        for index, row in pred_pandas.iterrows():
#            pred_list.append(10**row["yhat"])
            pred_list.append(row["yhat"])
            
        # =============================================================================
        real_list.clear()
        for index, row in (class_df_all.tail(periods)).iterrows():
            real_list.append(row["y"])
        
        # =============================================================================
        error_list.clear()
        excel_index_row = 0

        target_sheet.write(0, 1, "real-data")
        target_sheet.write(0, 2, "pred-data")
        target_sheet.write(0, 3, "error")
        
        for i in range(periods):
            target_sheet.write(i+1, 0, "{}-{}".format(year_month[0:4], int(year_month[5:7])-periods+1))

        
        for i in range(len(real_list)):
            temp_error = (pred_list[i] - real_list[i])/real_list[i]
            error_list.append(temp_error)
            
            excel_index_row += 1
            target_sheet.write(excel_index_row, 1, real_list[i])
            target_sheet.write(excel_index_row, 2, pred_list[i])
            target_sheet.write(excel_index_row, 3, temp_error)
        
        print("Done: {}".format(cate_name))
        
    target_data.save(save_path)
    
    
# =============================================================================
# main    
# =============================================================================
#by_days("2017-12-31")
periods = 12  ## 表示预测未来几个月
by_month("2015-12", periods)

猜你喜欢

转载自blog.csdn.net/Houchaoqun_XMU/article/details/81462903