2023 mathorcup B question Big data competition modeling analysis Senior Xiaolu will take you to model!

Insert image description here

Question restated:

In the field of e-commerce retail, in order to meet the needs of thousands of merchants and effectively manage inventory, demand forecasting and inventory optimization are key issues. E-commerce platforms must understand the demand for various goods in different warehouses in advance in order to reasonably allocate inventory and ensure timely delivery. This question is divided into three parts:

Question 1: Use historical demand data to predict the demand for goods by each merchant in each warehouse from May 16 to May 30, 2023, in order to optimize inventory. Also, discuss how to classify time series of merchants, warehouses, and items to make forecasts more accurate.

Question 2: Consider the newly emerged merchant + warehouse + product dimensions, and explain how these new dimensions are similar to historical data in order to complete the demand forecast for these new dimensions.

Question 3: E-commerce platforms hold large-scale promotions in June every year, which will affect the accurate forecast of demand. Based on the data from last year’s Double Eleven period, the demand from June 1 to June 20, 2023 is predicted.

Question one:

step:

Data preprocessing:
- Organize the data in attachments 1-4 to ensure that the data includes date, merchant, warehouse, product and corresponding shipment volume.
- Process the data before time series analysis, including removing missing values and outliers to ensure data quality.
Time series decomposition:
- Time series usually include components such as trend, seasonality, and noise. It can be split using time series decomposition methods.
Erection model：
- Choose an appropriate time series model, such as the ARIMA (Autoregressive Integrated Moving Average) model, which includes autoregressive (AR), difference (I), and moving average (MA) components.
  When performing time series analysis, the ARIMA (AutoRegressive Integrated Moving Average) model is usually used, which is a common time series forecasting model. The formulas related to the ARIMA model are provided below:

ARIMA model general form:

The ARIMA model is divided into three main parts, namely AR (autoregressive), I (integral, used to deal with non-stationarity), and MA (moving average).

AR (self-reflection) part:

The AR part represents the linear relationship between the value at the current time point and the value at the past time point.

一般形式：
$X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \ldots + \phi_p X_{t-p} + \varepsilon_t$
- $X_t$ : The value of the time series at time (t).
- $c$ : Constant number.
- $\phi_1, \phi_2, \ldots, \phi_p$ : Autoregressive coefficient, usually needs to be estimated by fitting.
- $\varepsilon_t$ : Noise, representing random errors not explained by the model.
I (preparation) part :

The I part is used to deal with the non-stationary nature of the time series. It usually requires a difference operation to represent the difference between the value at the current moment and the previous moment.

一般形式：
$Y_t = X_t - X_{t-1}$

In that, $Y_t$ is the differenced time series.
MA (sliding average) part :

The MA part represents the linear combination of the value at the current moment and the noise term.

Definitions:
$X_t = c + \varepsilon_t - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2} - \ldots - \theta_q \varepsilon_{t-q}$
- $c$ : Constant number.
- $\varepsilon_t$ : Noise term.
- $\theta_1, \theta_2, \ldots, \theta_q$ : Moving average coefficient, which needs to be estimated through fitting.

These formulas represent the general form of the ARIMA model. In practical applications, model parameters such as (p), (d) and (q) need to be determined through fitting and tuning. These operations are typically performed using time series analysis tools such as the statsmodels library in Python.
4. Model fitting:

Use historical data to fit the ARIMA model and estimate model parameters $\phi$ sum $\theta$ 。

预测：
- Use the fitted ARIMA model to predict demand at future time points.
- Specifies the function of the vacuum:
  $X_{t+1} = c + \phi_1 X_{t} + \phi_2 X_{t- 1} + \ldots + \phi_p X_{t-p+1} - \theta_1 \varepsilon_{t} - \theta_2 \varepsilon_{t-1} - \ldots - \theta_q \varepsilon_{t-q+1} + \varepsilon_{t+1}$
Model review：
- Use appropriate evaluation metrics to evaluate the performance of the model, such as mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), etc.
Classification：
- The demand time series of different merchants, warehouses, and commodities can be divided into different categories, and the corresponding ARIMA model can be used for each category.

In Python, you can use thestatsmodels library or other time series analysis library to perform the above steps.

pip install pandas numpy statsmodels

You can then use the following Python code example for time series analysis and forecasting:

import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose

# 读取数据
data = pd.read_csv('your_data.csv')  # 替换为数据文件路径

# 将数据转换为时间序列
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)

# 季节性分解
result = seasonal_decompose(data['Demand'], model='additive', period=12)  # 假设季节性周期为12个月

# 差分操作以处理非平稳性
differenced_data = data['Demand'].diff().dropna()

# 确定ARIMA模型的参数 p, d, q
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

plot_acf(differenced_data, lags=40)
plt.show()
plot_pacf(differenced_data, lags=40)
plt.show()

# 从自相关函数(ACF)和偏自相关函数(PACF)图中确定合适的p和q值

# 拟合ARIMA模型
model = ARIMA(data['Demand'], order=(p, d, q))  # 使用确定的p, d, q值

# 拟合模型
model_fit = model.fit()

# 预测未来需求
forecast = model_fit.forecast(steps=n)  # n为预测的时间步数

print("预测结果：")
print(forecast)

Question two:

There are some newly emerged merchant + warehouse + product dimensions (Appendix 5). The reason for this may be newly launched products, or changes in the warehouses where certain products are stored. Please discuss how these emerging predictive dimensions can be referenced through the data in Historical Appendix 1 to find similar sequences and complete these dimensions

In order to predict the new merchant + warehouse + product dimension, you can use the data in historical attachment 1 and use the following steps to find similar sequences and complete the prediction of these new dimensions:

Quantity preparation：
- First, integrate the newly emerged merchant + warehouse + product dimension data (Appendix 5) into historical data.
Similar sequence search:
- For each new dimension, similar historical sequences can be found using:
  - Similarity measure: Use an appropriate similarity measure (such as correlation, Euclidean distance, cosine similarity, etc.) to compare the similarity of the new dimension with the historical dimension. This will help find the most similar historical sequence.
  - Cluster analysis: Use cluster analysis methods, such as K-means clustering or hierarchical clustering, to cluster new dimensions with historical dimensions to find similar sequences.
  - Time series feature extraction: Extract time series features of new dimensions and historical dimensions, such as trends, seasonality, etc., and then use these features to perform similarity comparisons.
Kenmowa 梄浦：
- By finding similar historical sequences, time series analysis models of these historical sequences can be used to predict new dimensions.
- For each new dimension, you can use the model of the historical dimension that is similar to it, or use the model of the historical dimension with the highest similarity.
Model review：
- Evaluate the prediction results and use appropriate evaluation metrics to determine the performance of the model.
预测输下：
- Record the completed forecast results of the new dimensions and use them for actual supply chain management.

In order to find similar sequences and complete predictions in new dimensions, similarity measures and time series models need to be used. :

Similarity measure:

Correlation coefficient: Used to measure the linear relationship between two time series.
$\frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i= 1}^{n}(X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}$
inside, $X_i$ Sum $Y_i$ are two time series at time $i$ 目值， $\bar{X}$ Sum $\bar{Y}$ are their means respectively.
Euclidean distance: used to measure the spatial distance between two time series.
$\sqrt{\sum_{i=1}^{n}(X_i - Y_i)^2} a>$
Cosine similarity : The cosine of the angle between the usage quantity and the time order.
$cos(\theta) = \frac{ \sum_{i=1}^{n}X_i \cdot Y_i}{\sqrt{\sum_{i=1}^{n}X_i^2} \cdot \sqrt{\sum_{i=1}^{n }Y_i^2}}$

Time Series Model:

When completing forecasts for new dimensions, ARIMA models or other appropriate time series models can be used. The formula for the ARIMA model has been provided in previous answers. Here again the general form of the ARIMA model is provided:

$X_t = c + \phi_1 X_{t-1 } + \ldots + \phi_p X_{t-p} - \theta_1 \varepsilon_{t-1} - \ldots - \theta_q \varepsilon_{t-q} + \varepsilon_t$

In that, $X_t$ is the value of the time series, $c$ is a constant term, $\phi_1, \ldots, \phi_p$ is the autoregressive coefficient, $\theta_1, \ldots, \theta_q$ is the moving average coefficient, $\varepsilon_t$ It's noise.

Using these formulas, similarity measures can be calculated to find similar historical series and then used to make predictions using a time series model.

pip install pandas numpy statsmodels

import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from scipy.spatial.distance import euclidean
from scipy.stats import pearsonr
from sklearn.metrics.pairwise import cosine_similarity

# 读取历史数据
historical_data = pd.read_csv('historical_data.csv')  # 替换为历史数据文件路径

# 读取新维度的数据
new_dimension_data = pd.read_csv('new_dimension_data.csv')  # 替换为新维度数据文件路径

# 定义相似性度量函数
def calculate_similarity(series1, series2, method='cosine'):
    if method == 'cosine':
        similarity = cosine_similarity([series1], [series2])
        return similarity[0][0]
    elif method == 'pearson':
        correlation, _ = pearsonr(series1, series2)
        return correlation
    elif method == 'euclidean':
        distance = euclidean(series1, series2)
        return 1 / (1 + distance)

# 寻找相似的历史维度
similar_dimensions = []
for index, row in new_dimension_data.iterrows():
    new_series = row['Demand']  # 新维度的需求数据
    max_similarity = -1
    most_similar_dimension = None
    for historical_index, historical_row in historical_data.iterrows():
        historical_series = historical_row['Demand']  # 历史维度的需求数据
        similarity = calculate_similarity(new_series, historical_series)
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar_dimension = historical_row
    similar_dimensions.append(most_similar_dimension)

# 使用ARIMA模型进行预测
predictions = []
for similar_dimension in similar_dimensions:
    historical_demand = similar_dimension['Demand']
    model = ARIMA(historical_demand, order=(p, d, q))  # 替换p, d, q为合适的参数
    model_fit = model.fit()
    forecast = model_fit.forecast(steps=n)  # 替换n为想要预测的时间步数
    predictions.append(forecast)

# 打印预测结果
print("预测结果：")
for i, forecast in enumerate(predictions):
    print(f"预测维度 {
      
      i+1}: {
      
      forecast}")

Question 3: There are regular large-scale promotions in June every year, which brings great challenges to the accurate prediction of demand and fulfillment of contracts.

When faced with large promotions, the following methods can be used to accurately forecast demand. One of the most common methods is to use seasonal decomposition. Here are some relevant formulas and steps:

Seasonal decomposition:

Seasonal decomposition is a technique commonly used in time series analysis to split time series data into trend, seasonal, and residual components. Helps to better understand and predict the impact of large promotions on demand.

Seasonal decomposition formula:

Time series data $Y_t$ It can be broken down into a combination of the following three components:
- Trend component $T_t$ : Describes long-term trends in data, often estimated using techniques such as moving averages.
- Seasonal component $S_t$ : Describes seasonal changes in data, often estimated by cyclical patterns, such as large annual promotions.
- Residual component $E_t$ : Including the part that is not explained by trend and seasonality, which is usually considered to be random noise.
The specific decomposition formula is as follows:

$Y_t = T_t \times S_t \times E_t$
Steps of seasonal decomposition:

a. Use moving average or other methods to estimate trend components $T_t$ 。

b. Estimated seasonal component $S_t$ , usually using periodic analysis methods, such as seasonal decomposition or Fourier analysis.

c. 计算残差成分 $E_t = \frac{Y_t}{T_t \times S_t}$ 。

预测：

Once you have a seasonal breakdown, you can use the following formula to predict demand during major promotions:

$Demand_{\text{forecast}} = T_{\text{forecast}} \times S_{\text{large promotion}}$

in:

$Demand_{\text{forecast}}$ It is a demand forecast during a major promotion.
$T_{\text{prediction}}$ Is the prediction of trending components during large promotions.
$S_{\text{Large Promotion}}$ is the seasonal pattern of seasonal ingredients during large promotions.

In application, it is necessary to select appropriate trend estimation methods, seasonal estimation methods and models based on specific data and demand to predict demand during large-scale promotions. These formulas and steps provide a basic framework to help forecast demand more accurately.

Seasonal decomposition and demand forecasting often require time series analysis tools, of which the statsmodels library in Python is very useful. The following is a basic example code that demonstrates how to use seasonal decomposition to make accurate demand forecasts.

pip install pandas numpy statsmodels

Then you can use the following example code:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# 读取历史需求量数据
data = pd.read_csv('demand_data.csv')  
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)

# 季节性分解
decomposition = sm.tsa.seasonal_decompose(data, model='additive')

# 绘制趋势、季节性和残差成分
plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(data, label='原始数据')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(decomposition.trend, label='趋势')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(decomposition.seasonal, label='季节性')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(decomposition.resid, label='残差')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

# 预测需求量，假设有一个大型促销在6月
# 获取大型促销的季节性模式，这里假设在6月有最大需求
seasonal_pattern = decomposition.seasonal[decomposition.seasonal.index.month == 6]

# 做6月的需求量预测
forecasted_demand = decomposition.trend + seasonal_pattern
forecasted_demand = forecasted_demand['2023-06-01':'2023-06-30']  

# 打印预测结果
print("2023年6月的需求量预测：")
print(forecasted_demand)

# 绘制预测结果
plt.figure(figsize=(12, 6))
plt.plot(data, label='原始数据')
plt.plot(forecasted_demand, label='预测')
plt.legend(loc='upper left')
plt.title('2023年6月需求量预测')
plt.show()

The code will read the historical demand data, perform a seasonal breakdown, and then use the seasonal patterns to predict demand for June 2023.

Through seasonal decomposition and demand forecast, we can draw the following conclusions:

This study uses seasonal decomposition techniques and time series analysis to deal with the demand forecasting and inventory optimization problems faced by e-commerce platforms. Key findings are as follows:

The Importance of Seasonal Decomposition: Seasonal decomposition of historical demand data allows us to identify trends, seasonal components, and residuals in the data. This helps to better understand the cyclicality and regularity of demand, especially for the large-scale promotions that often occur on e-commerce platforms during June every year.
Demand Forecast: Using the results of seasonal decomposition, we are able to predict demand for future time periods. These forecasts will take into account seasonal effects, especially during major promotions, to provide a more accurate estimate of demand. This helps e-commerce platforms better plan inventory and ensure timely fulfillment of customer orders.
Supply chain optimization: Accurate forecasting of demand is crucial to supply chain management. It can help e-commerce companies better adjust inventory levels, reduce inventory costs, and ensure timely delivery of products. By understanding the seasonality and regularity of demand, companies can better respond to large promotions and seasonal demand fluctuations, thereby making the supply chain more efficient.
Continuous Improvement: To maintain accuracy, predictive models need to be constantly monitored and improved. Actual results are affected by a variety of factors, including market changes, new product launches, etc. Therefore, predictive models need to be continuously optimized and adjusted to adapt to changing conditions.

Question B of the 4th MathorCup Big Data Challenge in 2023! Modeling analysis, senior Xiaolu leads the team to guide full code articles and ideas - CSDN