[Review and Sharing] Question B of the 11th Teddy Cup: Data Analysis and Demand Forecast of Product Orders

It has been more than two months since the end of the game.

The whole process is still very hard. In the early stage, the whole team is laying the groundwork for learning, and the energy is mainly concentrated on the modeling after all the data is given.

I received the notice of the defense, but unfortunately the questions asked by the judges were too outrageous, and we failed to show our innovation points. In the end, we failed to win the special prize.

Because I feel that a summary of our work is quite helpful to many students who are preparing for related competitions, so let’s review it

Using Prophet to predict products one by one is definitely wrong, and the training time is too long. It is reasonable to integrate structured data first and then use machine learning

topic

Task 1: Data Analysis

For the provided historical sales data (order_train1.csv), in-depth data analysis is required. Analysis topics include but are not limited to:

1.1 The impact of different prices of products on demand
1.2 The impact of product regions on demand, and the characteristics of product demand in different regions
1.3 Characteristics of product demand in different sales methods (online and offline)
1.4 Different categories 1.5 What are the characteristics of product demand in different
time periods (such as the beginning of the month, the middle of the month, the end of the month, etc.)
1.6 The impact of holidays on product demand
1.7 Promotions (such as 618, Double Ten First Class) Impact on Product Demand
1.8 Impact of Seasonal Factors on Product Demand

Task 2: Demand Forecasting

Based on the above analysis, a mathematical model needs to be established to predict the monthly demand for the given product (predict_sku1.csv) in the next 3 months (ie January, February, March 2019). The prediction result needs to be saved as the file result1.xlsx according to the given format.

Please make predictions according to the time granularity of day, week, and month, and try to analyze the possible impact of different prediction granularities on prediction accuracy.

first question

The first question is data exploratory analysis. There is nothing to say. Now I can adjust chatgpt and make a simple modification to make a good picture.

Although the meaning of the title may be that the analysis of the first question will help the modeling of the second question, and it may look good in the paper, but it is useless to be honest. The second question is based on experience such as feature engineering. So the first question is not the key point, let's show a few pictures, and I won't go into details.

  • Scatterplot of price and quantity demanded

  • Trend chart of offline/online order demand over time

  • Double-circle chart of the proportion of demand for each category/sub-category

  • Bubble chart of monthly demand for various products

  • Line chart of product demand in different time periods (beginning, mid-month, and end-of-month)

  • Offline/Online Sales Trends

  • Two-way histogram of the categories of top 50 promotional products during "6.18" and "Double Eleven"

second question

The second question is to predict accurately, which is a test of learning and coding ability. At that time, I read several sales prediction competition codes, mainly on kaggle, and changed them step by step. After setting up the Baseline, you can first have a predicted result, and then add your own ideas step by step.

以下内容都是先有Baseline后一步步试出来的,所以会有些跳跃性

Some links (many I can't find):

Detailed EDA and Random Forest

1st place solution - Part 1 - “Hands on Data”

2.1 Data preprocessing

  • Missing value handling

  • Outlier detection

    • For products detected to have outliers
    • Items in the forecast set are modeled individually (manual forecasts)
    • The products in the prediction set are no longer directly deleted
  • Convert categorical data to numeric

    • Sales channels
    • Product number/product category/sales area
  • For sales data that fluctuates greatly, we have two indicators.

    • Label smoothing: take logarithm, use RMSE index
    • Not logarithmic processing: use Tweedie deviation (Tweedie deviance)
  • If you don't deal with it, there must be a problem with using RMSE to evaluate the accuracy of sales forecasts.

    For example, for a pen worth 5 yuan (about 5,000 units sold in a month), the forecast deviation is 100. With a watch worth 2,000 yuan (about 500 units sold in a month), the forecast deviation is 100. The RMSE evaluation is the same, but in fact, it must be the bigger problem caused by the deviation of the watch prediction. Tweedie bias can solve this problem

    Of course, if you deal with the logarithm first, you can also use RMSE

    You can choose one of the two, in the end I still use the latter

2.2 Dataset analysis

2.2.1 Training set

Here we have conducted a very detailed analysis of the data. I can find many characteristics by looking at the trend of each product in each category individually. Although mostly unused for time reasons, this is an important step in real business forecasting. We need to have a detailed understanding of this data set in order to deal with it in a targeted manner.

To list a few points:

  1. 403/404/405: Initially online, offline will be added from 2017
  2. 406: offline, small-scale orders; 2018.3 moved from 105 area to other areas
  3. 407: The sales trend shows multiple small peaks, with a seasonal trend
  4. 411: Launched in November 2017
  5. Since 2017, sales in region 104 have ceased, most products in region 104 have been transferred to region 105, and functions have been written to implement data migration
  6. Some products have the characteristics of online leading offline sales. If the line of a certain product rises, there is a high probability that this product will also rise offline next month.
  • Data is integrated on a monthly basis for feature engineering and machine learning
    • Consolidate demand for each product by region and month
    • Create a structured data set that contains combined information such as sales region, sales month, and product

  • Then we propose a more useful strategy - commodity stratification . The idea comes from marketing class advertisements, because the sales rules of products with different nature and positioning must be different, so the classification
    • New : Products that did not start appearing on the market until month 36 (date_block_num).
    • Meteor product : a product that appears suddenly; but the sales period does not exceed 5 months, and the sales will drop sharply.
    • Sleeping products : products that have maintained objective sales, but sales have plummeted after a certain point in time, but the reason is not seasonal factors.
    • Regular products : Products that always sell; have been on sale for more than 39 weeks or have been in the market for at least one year.
  • In fact, there should be seasonal products , but most of them have not existed for more than two years, so the algorithm is not able to judge them, so I gave up

2.2.2 Prediction Set

  • Then we wrote a classification function to classify the products in the prediction set to see which products are to be predicted

It is found that most of them are regular products, and the proportion of new products is not small. After setting up the Baseline, we conducted an error analysis (it will be mentioned later, which is to analyze where the prediction error comes from). We found that there are many new products and some commodities with large fluctuations, and the prediction deviation is very large, so we established a new product model separately .

2.3 Feature Engineering

Feature engineering is the most important and the key to determining the final prediction accuracy of the model. Conventional ones are lag features, trend features, and so on. Keep adding new features, keep training the model to verify the effect, and finally delete the useless features

  • Remember not to leak data, and don't introduce future data when making features. For example, the trend should be the trend of last month -> last month, not last month -> this month. This month's data is to be forecasted

2.4 Model establishment

2.4.1 Model framework and evaluation indicators

  • If the topic is very outrageous, it is necessary to model and forecast separately by day/week/month. In fact, it would be nice to be able to do a good job in the month, because otherwise you have to do three sets of features, which is impossible.

    • Our solution is to continuously optimize according to monthly forecasts. For the day/week, just make a random prediction with the prophet. But in the process, we found that prophet can not only predict, but also extract some seasonal features.

    • Because the features we made actually lack seasonality, we incorporated this part of the features extracted from the prophet, and found that the effect is really good.

2.4.2 Model establishment

  • For model selection, our Baseline uses LightGBM, because its training time is the fastest, which is convenient for us to continuously optimize
    • Finally, three gradient boosting tree algorithms (LightGBM, CatBoost, XGBoost) were used for model fusion
    • How should I put it, the effect is definitely very good, but this will also bring about overfitting. In fact, it doesn't have to be that complicated, and it may work best with a model

2.4.3 Error analysis and feature selection

  • Error Analysis
    • Helps a lot in pre-training
    • Re-predict the products with large error and submit the predicted value overlay to the original model
  • feature screening
    • Eliminate useless features

2.4.4 New Product Model

  • For new products, we use the sliding window to extract the new products of each month to form the training set and prediction set of the new product model
  • And re-do feature engineering, because there is no historical data for new products, and the prediction can only rely on some information of similar products, so the features we make rely on this direction

2.5 Model Fusion

After comparing, the method of model fusion was selected

Again, too complex a model does not mean that the real prediction effect is better. But such work is needed in the presentation of the thesis.

2.6 Prediction method

We also tested three prediction methods. Because the title requires forecasting data for the next three months.

Direct forecasting and rolling forecasting should be easier to understand.

Lag forecasts need to re-do features, such as forecasting M+2 sales. We cannot use the data of M+1 month as features

2.7 Summary

end

Let me talk about the topic of this competition first. The quality of the data of the topic is not very good. It is a headache to do it in the early stage, which may be a common problem of sales data. It is difficult to understand the separate predictions of the second question by daily/weekly/monthly accuracy. Let’s talk about the judges again. I personally think that the team that can enter the defense should use machine learning/deep learning to train the entire data set together. The judges should focus on the innovation of our work. But the judges seem to be unable to understand, thinking how we can use Prophet without training one by one, it seems difficult to understand how to use machine learning to predict each product. What we achieve is the global optimum rather than the optimum for each product, which has nothing to do with whether Prophet is used or not (we just use Prophet to extract features one by one, and the overall work is continuously optimized with LGBM).

In addition, this competition needs to submit papers and forecast data (data for January, February, and March in 2019), and the data for January, February, and March will be given the day after submission, and it is required to predict 4, 5, and Data for June. At that time, it was the May Day holiday. Hey, that morning, I found that the actual sales data in January was very high, and the overall sales were about 2 to 3 times the forecast. Then I found that the data in May may also be very high, so I changed the code again, summed up the monthly sales characteristics of each type of commodity, and predicted for another day. In the end, I believe the effect should be good. Reasonable use of Trick to improve prediction accuracy is also an essential part of winning!

Finally, thank you. Thanks to my two teammates for their efforts, to npy for their drawing and understanding during the competition, to my seniors for their help and defense guidance, and to my instructor. Hope this summary helps others.

Guess you like

Origin blog.csdn.net/weixin_57345774/article/details/131883066