Python performs lstm and xgboost sales volume time series modeling forecast analysis on store data

Original link: http://tecdat.cn/?p=17748

 

In the data science learning journey, I often deal with time series data sets in my daily work and make predictions accordingly.

I will go through the following steps:

Exploratory Data Analysis (EDA)

  • Problem definition (what are we going to solve)
  • Variable identification (what data do we have)
  • Univariate analysis (understand each field in the data set)
  • Multivariate analysis (understand the interaction between different fields and goals)
  • Missing value processing
  • Outlier handling
  • Variable conversion

Predictive modeling

  • LSTM
  • XGBoost

Problem definition

We provide the following information about the store in two different tables:

  • Store : ID of each store
  • Sales : Turnover on a specific date (our target variable)
  • Customers : The number of customers on a specific date
  • StateHoliday : Holiday
  • SchoolHoliday : School Holiday
  • StoreType : 4 different stores: a, b, c, d
  • CompetitionDistance : the distance to the nearest competitor store (in meters)
  • CompetitionOpenSince [Month/Year]: Provide the approximate year and month when the nearest competitor opened
  • Promotion : whether the promotion is on the day or not
  • Promo2 : Promo2 is a continuous and continuous promotion for certain stores: 0 = store does not participate, 1 = store is participating
  • PromoInterval : Describe the continuous interval of promotion start, and specify the month when the promotion restarts.

Using all this information, we predict the sales volume for the next 6 weeks.

 

# 让我们导入EDA所需的库:

import numpy as np # 线性代数
import pandas as pd # 数据处理,CSV文件I / O导入(例如pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
plt.style.use("ggplot") # 绘图


#导入训练和测试文件:
train_df = pd.read_csv("../Data/train.csv")
test_df = pd.read_csv("../Data/test.csv")


#文件中有多少数据:
print("在训练集中,我们有", train_df.shape[0], "个观察值和", train_df.shape[1], 列/变量。")
print("在测试集中,我们有", test_df.shape[0], "个观察值和", test_df.shape[1], "列/变量。")
print("在商店集中,我们有", store_df.shape[0], "个观察值和", store_df.shape[1], "列/变量。")

In the training set, we have 1017209 observations and 9 columns/variables.
In the test set, we have 41088 observations and 8 columns/variables.
In the store set, we have 1115 observations and 10 columns/variables.

First let's clean up the  training data set.

 

#查看数据
train_df.head().append(train_df.tail()) #显示前5行。

 

train_df.isnull().all()
Out[5]:

Store            False
DayOfWeek        False
Date             False
Sales            False
Customers        False
Open             False
Promo            False
StateHoliday     False
SchoolHoliday    False
dtype: bool

Let's start with the first variable ->  sales volume



opened_sales = (train_df[(train_df.Open == 1) #如果商店开业
opened_sales.Sales.describe()
Out[6]:

count    422307.000000
mean       6951.782199
std        3101.768685
min         133.000000
25%        4853.000000
50%        6367.000000
75%        8355.000000
max       41551.000000
Name: Sales, dtype: float64


<matplotlib.axes._subplots.AxesSubplot at 0x7f7c38fa6588>

 

Look at customer variables

In [9]:

train_df.Customers.describe()
Out[9]:

count    1.017209e+06
mean     6.331459e+02
std      4.644117e+02
min      0.000000e+00
25%      4.050000e+02
50%      6.090000e+02
75%      8.370000e+02
max      7.388000e+03
Name: Customers, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x7f7c3565d240>

 
train_df[(train_df.Customers > 6000)]

 

我们看一下假期 变量。

 
train_df.StateHoliday.value_counts()
 
0    855087
0    131072
a     20260
b      6690
c      4100
Name: StateHoliday, dtype: int64

 

train_df.StateHoliday_cat.count()

 

1017209

 

train_df.tail()

 

train_df.isnull().all() #检查缺失
Out[18]:

Store               False
DayOfWeek           False
Date                False
Sales               False
Customers           False
Open                False
Promo               False
SchoolHoliday       False
StateHoliday_cat    False
dtype: bool

Let's proceed to store analysis

 

store_df.head().append(store_df.tail())

 

#缺失数据:


Store                         0.000000
StoreType                     0.000000
Assortment                    0.000000
CompetitionDistance           0.269058
CompetitionOpenSinceMonth    31.748879
CompetitionOpenSinceYear     31.748879
Promo2                        0.000000
Promo2SinceWeek              48.789238
Promo2SinceYear              48.789238
PromoInterval                48.789238
dtype: float64
In [21]:

Let's start with the missing data. The first one is  CompetitionDistance


store_df.CompetitionDistance.plot.box() 

Let me look at the outliers, so we can choose between mean and median to fill in NaN
 

 

缺少数据,因为商店没有竞争。 因此,我建议用零填充缺失的值。

store_df["CompetitionOpenSinceMonth"].fillna(0, inplace = True)

Let's look at the promotion.

 

store_df.groupby(by = "Promo2", axis = 0).count() 

 

If there is no promotion, the NaN in "promotion" should be replaced with zero 

We merge the store data and the training set data, and then continue to analyze.

First, let's compare stores by sales, customers, etc.

 

f, ax = plt.subplots(2, 3, figsize = (20,10))

plt.subplots_adjust(hspace = 0.3)
plt.show()

 

As can be seen from the figure, StoreType A has the most stores, sales and customers. However, StoreType D has the highest average spend per customer. Only 17 stores have StoreType B with the most average customers.

 

We look at trends year by year.

 

sns.factorplot(data = train_store_df, 
# 我们可以看到季节性,但看不到趋势。 该销售额每年保持不变


<seaborn.axisgrid.FacetGrid at 0x7f7c350e0c50>


 

Let's look at the correlation diagram.

  "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2

<matplotlib.axes._subplots.AxesSubplot at 0x7f7c33d79c18>

 

 

 

We can get the correlation:

  • Customers and Sales (0.82)
  • Promotion and sales (0,82)
  • Average customer sales vs promotion (0,28)
  • Store category vs average customer sales (0,44)

My analysis conclusion:

  • Store category A has the most sales and customers.
  • Store category B has the lowest average sales per customer. Therefore, I think customers only come for small commodities.
  • Store category D has the most shopping carts.
  • The promotion is only on working days.
  • Customers tend to buy more items on Monday (promotion) and Sunday (no promotion).
  • I can't see any annual trends. Seasonal mode only.

Most popular insights

1. Use lstm and pytorch for time series prediction in python

2. Use the long and short-term memory model lstm in python for time series prediction analysis

3. Use R language for time series (arima, exponential smoothing) analysis

4. Multivariate copula-garch-model time series prediction in r language

5. R language copulas and financial time series case

6. Use R language random fluctuation model sv to deal with random fluctuations in time series

7. R language time series tar threshold autoregressive model

8. R language k-shape time series clustering method to cluster stock price time series

9. Python3 uses arima model for time series forecasting

Guess you like

Origin blog.csdn.net/qq_19600291/article/details/105670930