Using Linear Regression to Predict Online Store Sales

describe

The basic situation of a store is as follows: it has been in operation for more than a year, and the traffic, orders and sales have all increased significantly. After a period of observation, it is found that the sales volume of online store products is closely related to the intensity of advertising promotion. The store promotes on the WeChat official account, also promotes through Weibo, and also places advertisements on some other websites. Of course, the more funds invested in promotion, the more total sales of goods. The shop owner thought: "Can machine learning algorithms be used to predict the product sales that can be achieved at a certain point in the future based on the advertising amount and product sales recorded in the past through machine learning algorithms?", we Using a linear regression model to achieve forecasting of online store sales.

The main practical contents of this task include:

1. Definition of online store sales regression problem

2. Data collection, analysis and preprocessing

3. How to build a machine learning model

4. How to find the best parameters by gradient descent

5. Realization of linear regression model

Unary (univariate) linear regression model
Multivariate (Multivariate) Linear Regression Models

Source code download

environment

Operating system: Windows10, Ubuntu18.04
Tool software: Anaconda3 2019, Python3.7
Hardware environment: no special requirements
Dependency library list
```
scikit-learn	0.24.2
```

analyze

1. The overall process of machine learning combat

Clearly define the problem to be solved - forecasting of online store sales.
In the data collection and preprocessing link, the data preprocessing work is completed in five steps, as follows.
- Data collection: Relevant records of the store’s online store are required
- Visualize the collected data and display it for a look
- Do feature engineering to make data easier for machines to process
- Split the dataset into training and testing sets
- Do feature scaling to compress data values into a smaller interval
Choosing a Machine Learning Model: Building the Model
- Identify the algorithm for machine learning—in this case, the linear regression algorithm.
model training
model evaluation

2. Data introduction

The past weekly advertising amount and sales data have been organized into an Excel table and saved as an advertising.csv file (this is a comma-delimited file format, which is easier to be read by python). Basically, the amount of various advertisements and product sales are recorded every week.

Please add a picture description

List of past weekly ad spend and item sales (raw format)

This important data record is the basis for implementing the machine learning project of this experiment.

Wechat official account advertising amount (hereinafter referred to as "WeChat"), Weibo advertising amount (hereinafter referred to as "Weibo"), and other types of advertising amount (hereinafter referred to as "other"), which The 3 fields are characteristics (also adjustable by the person who opened the store).
Product sales (referred to as "sales" below) are labels (also what the shop owner wants to predict).

The advertising amount of each type of advertisement is a feature, so this data set contains 3 features. That is, it is a multiple regression problem.

3. Split the data set into training set and test set

Before starting modeling, the data set needs to be split into two parts: training set and test set. In a normal machine learning project, at least these two data sets should be included, one is used to train the machine to determine the model, and the other is used to test the accuracy of the model. Not only that, but a validation set is often required to add validation before final testing. At present, this problem is relatively simple and the amount of data is small. We have simplified the process and combined the verification and testing links.

These two data sets need to be randomly assigned, and there should be no obvious differences between the two. Therefore, before splitting, pay attention to whether the data has been sorted or classified, and if so, scramble it first.

Use the following code snippet to split the dataset 80% (training set) and 20% (testing set):

# 将数据集进行80%（训练集）和20%（验证集）的分割
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(特征数据x, 标签数据y, 
                                   test_size=0.2, random_state=0)

The train_test_split function in Sklearn is a common tool for splitting data sets in machine learning.
test_size=0.2, indicating that the split test set accounts for 20% of the total sample size.
If you use the print statement to output the content of the new data set after splitting (such as X_train, X_test), you will find that this tool has performed random (re-random) work on the data set, because the default value of the shuffle parameter is True .
The random_state parameter is used to set the randomization of the data set splitting process. If an integer is specified, then this number is called a random seed. Setting a fixed seed each time can ensure that the same training set and test set are obtained, otherwise it will be randomly divided.

4. Data normalization

Common methods of feature scaling include standardization, data compression (also called normalization), and normalization. Feature scaling is particularly important for machine learning, which can make the machine feel more "comfortable" when reading data and train more efficiently.

Here the data is normalized. Normalization is a proportional linear scaling. After data normalization, the data distribution remains unchanged, but they all fall into a small specific interval, such as 0~1 or -1~+1, as shown in the right figure.

Please add a picture description

A common normalization formula is as follows:

Please add a picture description

Data normalization can be achieved through MinMaxScaler in the preprocessing (data preprocessing) tool in the Sklearn library. The core code is as follows:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
 # 对特征归一化
result = scaler.fit_transform(待归一化数据)

5. Linear regression model

For this case, although the function line in the above figure does not pass through each point accurately, it can already reflect the relationship between the feature (that is, the amount of advertising on the WeChat official account) and the label (that is, the sales of goods) The relationship, the degree of fitting is quite good.

This simple model is a linear function of one variable (shown below):

Please add a picture description

One-variable linear function y=ax+b

Among them, the mathematical meaning of parameter a is the slope (steepness) of the line, and b is the intercept (the position where it intersects with the y-axis).
In machine learning, the parameter code will be slightly modified to express the model as:y=wx+b

Here, a in the equation becomes w, and in machine learning, this parameter represents weight. Because in the case of multiple variables (multi-features), the larger the w parameter corresponding to a feature, the greater the weight. The parameter b is called bias in machine learning.

This simple linear function will repeatedly exert its power as a basic computing unit in the subsequent machine learning process.

implement

1. Problem definition

The questions we define are all related to the amount of advertising and product sales, and we hope to find out the answer through machine learning algorithms.

(1) How relevant are the various advertisements and product sales?

(2) What is the relationship between various advertisements and product sales?

(3) Which type of advertisement has the greatest impact on product sales?

(4) Allocate a specific advertising amount. , to predict future product sales.

The above is the problem we defined. There is obviously a correlation between the amount of advertising and the sales of goods.

Machine learning algorithms discover the relationship between the two by analyzing the existing data, that is, discover a function that can infer "that" from "this". This lesson uses regression analysis to find this function. The so-called regression analysis (regression analysis) is a statistical analysis method to determine the quantitative relationship between two or more variables that depend on each other, that is, to study when the independent variable changes, the dependent variable changes in what form. In the field of machine learning, regression is applied to situations where the predicted object has continuous value features (such as passenger flow, rainfall, sales, etc.), so it is very suitable to use it to solve these problems.

2. Data collection and preprocessing

For online store sales data reference advertising.csv, introduce the reference data introduction section in detail.

2.1 Data reading and visualization

Create a new 1-数据读取和可视化.ipynbfile, read the data into the Python runtime environment through the following code, and run the code:

# 示例代码中或有不当之处，欢迎读者提出改进，作者邮箱[email protected]
import numpy as np # 导入NumPy数学工具箱
import pandas as pd # 导入Pandas数据处理工具箱
# 读入数据并显示前面几行的内容，这是为了确保我们的文件读入的正确性
# 读入文件，如果在本机中需要指定具体本地路径
df_ads = pd.read_csv('../dataset/advertising.csv')
df_ads.head()

The variable here is named df_ads, df means that this is a Pandas Dataframe format data, and ads is the abbreviation of advertisement. The output (shown below) shows that the data has been successfully read into the Dataframe.

Please add a picture description

Display the first 5 rows of data

2.2 Correlation analysis of data

Correlation analysis is then performed on the data. （a,b)After correlation analysis, we can understand the correlation between any pair of variables in the data set through the correlation coefficient . The correlation coefficient is -1~1a value of one, with a positive value indicating a correlation and a negative value indicating a negative correlation. The larger the value, the stronger the correlation, and if the correlation coefficient of a and b is 1, then a and b are always equal. If the correlation coefficient between a and b is 0.9, then b will change significantly with the change of a, and the trend of change will remain the same. If the correlation coefficient of a and b is 0.3, it means that there is no obvious connection between the two.

In Python, correlation analysis can be implemented with a few lines of code, and can be displayed very intuitively in the form of a heatmap:

#导入数据可视化所需要的库
import matplotlib.pyplot as plt #Matplotlib – Python画图工具库
import seaborn as sns #Seaborn – 统计学数据可视化工具库
# 对所有的标签和特征两两显示其相关性热力图(heatmap)
sns.heatmap(df_ads.corr(), cmap="YlGnBu", annot = True)
plt.show() # plt代表英文plot,就是画图的意思

operation result:

Please add a picture description

Correlation Heatmap

After running the code, the correlation coefficients between the 3 features plus a label and 4 sets of variables are all displayed in matrix form, and the higher the correlation, the darker the corresponding color. The correlation analysis results here clearly show us that it is the most reasonable choice to spend limited money on WeChat official accounts for advertising.

It seems that the placement of the other two advertisements has little effect on the sales of the online store.

2.3 Scatter plot of the data

Next, the scatter plot (scatter plot) displays the corresponding relationship between product sales and various advertising amounts in pairs to focus on the focus. A scatter plot is a distribution map of data points on a Cartesian coordinate system plane in regression analysis. It is a very effective data visualization tool.

# 显示销量和各种广告投放量的散点图
sns.pairplot(df_ads, x_vars=['wechat', 'weibo', 'others'], 
                          y_vars='sales', 
                          height=4, aspect=1, kind='scatter')
plt.show()

The output result is shown in the figure below:

Please add a picture description

Scatterplot between item sales and various ad placement amounts

The scatter plot output after the code runs clearly shows the general trend of sales changing with the amount of various advertisements. Based on this information, you can choose an appropriate function to fit the data points.

2.4 Dataset cleaning and normalization

By observing the correlation and scatter plots, it is found that among the three characteristics of this example, the correlation between the amount of WeChat advertising and product sales is relatively high. Therefore, in order to simplify the model, we will temporarily ignore the two groups of characteristics of Weibo advertisements and other types of advertisements, and only leave the data of WeChat advertisements. In this way, multivariate regression analysis is simplified to univariate regression analysis.

The following code reads the WeChat official account advertising amount field in df_ads into a NumPy array, that is, cleans the other two feature fields, and reads the label into the array y:

X = np.array(df_ads.wechat) #构建特征集，只有微信广告一个特征
y = np.array(df_ads.sales) #构建标签集，销售金额
print ("张量X的阶:",X.ndim)
print ("张量X的形状:", X.shape)
print ("张量X的内容:", X)

The output is as follows. The result shows that the feature set is a 1D tensor with order 1. This tensor contains a total of 200 samples, which are the weekly WeChat advertising amount data.

Please add a picture description

（200，）This representation represents an array of tensors with 200 sample data of rank 1, which is a vector. Currently there is only one feature in the array, and the order of the tensor is 1. Is this 1D feature tensor in a format acceptable to machine learning algorithms?

For numerical data sets for regression problems, the canonical format read by the machine learning model should be a 2D tensor, that is, a matrix, whose shape is . The rows are the data, （样本数，标签数）and the columns are the features. You can think of it as the format of an Excel table. Then, as far as the current feature tensor X is concerned, its shape must be (200,)changed from (200,1)to , and then machine learning is performed. Therefore, it is necessary to use the reshape method to deform the above tensor. The code is as follows:

# X = X.reshape((len(X),1)) #通过reshape函数把向量转换为矩阵，len函数返回样本个数
# y = y.reshape((len(y),1)) #通过reshape函数把向量转换为矩阵，len函数返回样本个数
X = X.reshape(-1,1) #通过reshape函数把向量转换为矩阵，len函数返回样本个数
y = y.reshape(-1,1) #通过reshape函数把向量转换为矩阵，len函数返回样本个数
print ("张量X的阶:",X.ndim)
print ("张量X的形状:", X.shape)

output result:

张量X的阶: 2
张量X的形状: (200, 1)

At this time, the tensor is upgraded and becomes a 2D matrix, and each data sample occupies a row of the matrix:

Rank of tensor X: 2

Dimensions of tensor X: (200, 1)

Contents of tensor X:

Please add a picture description

Now the data format has changed from (200,) to (200, 1). Although there are still 200 numbers, the structure of the data has changed from a 1D array to rows. Again, for common continuous numerical datasets (also called vector datasets), the input feature set is a 2D matrix with two axes. matrix with columns

2.5 Split the data set into training set and test set

Use the following code snippet to split the dataset 80% (training set) and 20% (testing set):

# 将数据集进行80%（训练集）和20%（验证集）的分割
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                   test_size=0.2, random_state=0)

2.6 Normalize the data

Data normalization can be achieved through MinMaxScaler in the preprocessing (data preprocessing) tool in the Sklearn library.

# 调用sklearn自动实现归一化
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
 # 对特征归一化
X_train,X_test = scaler.fit_transform(X_train),scaler.fit_transform(X_test)
# 对标签归一化
y_train,y_test = scaler.fit_transform(y_train),scaler.fit_transform(y_test) 
#print('自动归一化结果:',X_train)

output result:

自动归一化结果:
 [[0.39995488]
 [0.72629521]
 [0.22746071]
 [0.66952402]
 [0.81803143]
 [0.35341003]
 [0.24355215]
 [0.44852996]
 [0.44544703]
 [0.71636965]
 [0.46597489]
 [0.46319272]
 [0.11594857]
 [0.07353936]
 [0.97706594]
 [0.45770359]
 [0.22204677]
 [0.1898639 ]
...

The following code shows the scatter plot after the data is compressed and processed. The shape is exactly the same as the previous plot, but the value has been limited to a smaller interval:

#用之前已经导入的matplotlib.pyplot中的plot方法显示散点图
plt.plot(X_train,y_train,'r.', label='Training data') 
plt.xlabel('Wechat Ads') # x轴Label
plt.ylabel('Sales') # y轴Label
plt.legend() # 显示图例
plt.show() # 显示绘图结果

Please add a picture description

Manually draw a linear regression line between x and y (the value has been normalized and compressed from two to three hundred to a relatively small value).

The current data preparation, analysis, and simple feature engineering work have all been completed, and the following is the key link of machine learning modeling and training machine learning.

3. Model construction and training

3.1 Determine the linear regression model

The interface for building a linear regression model is LinearRegression()included in sklearn.linear_modelthe module. To create a linear regression model, you can directly use the fit() interface to train the model.

3.2 Build the model

Create a linear regression model using the ``LinearRegression()` interface

from sklearn.linear_model import LinearRegression #导入线性回归算法模型
model = LinearRegression() #使用线性回归算法

3.3 Model training

Use the fit() interface to train the model, and the interface parameters are the training set feature X_train and the test set label y_train:

# 模型训练
model.fit(X_train, y_train) #用训练集数据，训练机器，拟合函数，确定参数

4. Model evaluation and prediction

Calculate the accuracy of the model through the model's score() interface. The interface parameters are the test set feature X_test and the test set true label
y_test.

# 模型评估
y_pred = model.predict(X_test) #预测测试集的Y值
print ('销量的真值(测试集)',y_test)
print ('预测的销量(测试集)',y_pred)
print("线性回归预测评分：", model.score(X_test, y_test)) #评估预测结果

Output result:

销量的真值(测试集) [[ 0.37815126]
 [ 0.90336134]
 [ 0.73529412]
 [ 0.71428571]
 [ 0.14285714]
 [ 0.76890756]
 [ 0.58823529]
 [ 0.79831933]
 [ 0.16806723]
 [ 0.63865546]
 [ 0.74369748]
 [ 0.31092437]
 [ 0.45378151]
 [ 0.65966387]
 [ 0.88235294]
 [ 0.10504202]
 [ 0.42016807]
 [ 0.81512605]
 [ 0.71008403]
 [-0.06722689]
 [ 0.51680672]
 [ 0.56302521]
 [ 0.73529412]
 [ 0.31932773]
 [ 0.96638655]
 [ 0.34033613]
 [ 0.55462185]
 [ 0.31512605]
 [ 0.74369748]
 [ 0.23109244]
 [ 0.28991597]
 [ 0.7394958 ]
 [ 0.39495798]
 [ 0.60084034]
 [ 0.73529412]
 [ 0.70168067]
 [ 0.71428571]
 [ 0.79411765]
 [ 0.36554622]
 [ 0.36134454]]
预测的销量(测试集) [[0.47947693]
 [0.66283604]
 [0.70328437]
 [0.60156726]
 [0.18237553]
 [0.68022783]
 [0.63227619]
 [0.73200566]
 [0.18441286]
 [0.68952001]
 [0.63863661]
 [0.36752352]
 [0.37771013]
 [0.7014955 ]
 [0.75550942]
 [0.18371719]
 [0.47266929]
 [0.62825123]
 [0.66328326]
 [0.1663254 ]
 [0.51878236]
 [0.6711841 ]
 [0.70576891]
 [0.24816617]
 [0.81498932]
 [0.32121168]
 [0.55475852]
 [0.39172294]
 [0.81136189]
 [0.20294753]
 [0.26580641]
 [0.78353504]
 [0.31420527]
 [0.7014955 ]
 [0.58278414]
 [0.59068498]
 [0.72634091]
 [0.68499815]
 [0.34605708]
 [0.29472647]]
线性回归预测评分： 0.8493759729652093

The observation results show that our predicted value is very close to the real value, with a score of 84%, and this model is available.

Model Forecasting Using the Linear Ridge Regression Algorithm

from sklearn.linear_model import Ridge #导入线性岭回归算法模型 
model = Ridge() #使用线性回归算法
model.fit(X_train, y_train) #用训练集数据，训练机器，拟合函数，确定参数
y_pred = model.predict(X_test) #预测测试集的Y值
print("线性回归预测评分：", model.score(X_test, y_test)) #评估预测结果

Output result:

线性回归预测评分： 0.832199701556618

The prediction results of Ridge and LinearRegression models are basically the same.

5. Realize Multiple Linear Regression

Multivariate means multi-variable, that is, multi-dimensional features. Our original data already has 3 feature models. Just to simplify the teaching, only the advertising amount of the WeChat public account is selected as the only feature to build a univariate linear regression model.

read data

from sklearn.linear_model import LinearRegression 
from sklearn.preprocessing import MinMaxScaler
df_ads = pd.read_csv('advertising.csv')
X = np.array(df_ads)# 这次在构建特征数据时，保留所有字段，包括wechat,weibo,others
X =np.delete(X,[3],axis=1) # 删除标签
y=np.array(df_ads.sales) # 构建标签集，销售额
print('张量X的阶:',X.ndim)
print('张量X的维度:',X.shape)

The output result is the basic information of the tensor, which can be found to be a two-dimensional matrix, containing 200 rows and 3 columns.

张量X的阶: 2
张量X的维度: (200, 3)

Dataset splitting and preprocessing

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                   test_size=0.2, random_state=0) # 拆分数据集
X_train = scaler.fit_transform(X_train) # 训练集特征归一化
X_test = scaler.fit_transform(X_test)   # 测试特征归一化

Construct Multiple Linear Regression

lr = LinearRegression() # 线性回归
lr.fit(X_train,y_train) # fit 就相当于梯度下降
score = lr.score(X_test,y_test)
print("Sklearn线性回归准确率:{:.2f}%".format(score*100))

The output is the model prediction accuracy:

Sklearn线性回归准确率:85.86%

predict

When defining the problem, we proposed: in a certain week in the future, when we allocate various advertising amounts (for example: I decide to use 250 yuan, 50 yuan, 50 yuan) for a week of advertising, I will About how many yuan in merchandise sales?

X_plan=[[250,50,50]] # 要预测的X特征值
X_plan=scaler.transform(X_plan) # 对预测数据也要归一化缩放
pred = lr.predict(X_plan)
print(pred[0],"千元")

The forecasted online store sales are:

8.48645854453084 千元

Through the above machine learning model, it can be calculated that the expected product sales are 8,000 yuan.
Multiple Linear Regression**

lr = LinearRegression() # 线性回归
lr.fit(X_train,y_train) # fit 就相当于梯度下降
score = lr.score(X_test,y_test)
print("Sklearn线性回归准确率:{:.2f}%".format(score*100))

The output is the model prediction accuracy:

Sklearn线性回归准确率:85.86%

predict

X_plan=[[250,50,50]] # 要预测的X特征值
X_plan=scaler.transform(X_plan) # 对预测数据也要归一化缩放
pred = lr.predict(X_plan)
print(pred[0],"千元")

The forecasted online store sales are:

8.48645854453084 千元

Through the above machine learning model, it can be calculated that the expected product sales are 8,000 yuan.