[Machine Learning] Linear Regression 2-Basic Implementation (Comparison of Single Feature and Multiple Feature Cases)

1. Linear Model Construction

1. Initialize the model

import numpy as np
from utils.features import prepare_for_training#导入预处理

class LinearRegression:

    def __init__(self,data,lables,olynomial_degree=0, sinusoid_degree=0, normalize_data=True):
        """
         1.数据预处理操作
         2,数据的特征参数
         3.初始化参数矩阵
         """
        data_processed,features_mean,features_deviation=prepare_for_training(data,polynomial_degree=0, sinusoid_degree=0, normalize_data=True)#预处理操作

        self.data=data_processed
        self.lables=lables
        self.features_mean=features_mean
        self.features_deviation=features_deviation
        self.olynomial_degree=olynomial_degree
        self.sinusoid_degree=sinusoid_degree
        self.olynomial_degree=normalize_data

        num_features=self.data.shape[1]
        self.theta=np.zeros((num_features,1))

2. Model training function

    def train(self,alpha,num_iterations=500):
        """
        训练模块,执行梯度下降
        :param alpha:学习率
        :param num_iterations:迭代次数
        :return:
        """
        const_history=self.gradient_descent(alpha,num_iterations)#迭代损失
        return self.theta,const_history

    def gradient_descent(self,alpha,num_iterations=500):
        """
        时刻i迭代模块
        :param alpha:
        :param num_iterations:
        :return:
        """
        cost_history=[]
        for _ in range(num_iterations):
            self.gradient_step(alpha)
            cost_history.append(self.cost_function(self.data,self.lables))

        return cost_history
    def gradient_step(self,alpha):
        """
        梯度下降参数更新方法,梯度下降,注意是矩阵运算
        :return:
        """
        num_examples=self.data.shape[0]
        predicton=LinearnGegress.hypothesis(self.data,self.theta)

        delta=predicton-self.lables

        theta=self.theta
        theta=theta-alpha*(1/num_examples)*(np.dot(delta.T,self.data))
        self.theta=theta
    @staticmethod
    def hypothesis(data,theta):
        """
        预测函数
        :param data:
        :param theta:
        :return:
        """
        predictions=np.dot(data,theta)#,np.dot用于计算两个数组中相应元素的乘积之和。
        return predictions
	def get_cost(self,data,lables):
        """
        得到当前的损失
        :param data:
        :param lables:
        :return:
        """
        data_processed=prepare_for_training(data,
         self.polynomial_degree,
         self.sinusoid_degree,
         self.normalize_data,
         )[0]

        return self.cost_function(data_processed,lables)
 	def cost_function(self,data,lables):
        """
        计算损失方法
        
        :param data:
        :param lable:
        :return:
        """

        num_examples = data.shape[0]
        delta = LinearnGegress.hypothesis(self.data, self.theta) - lables
        #上一篇文章梯度损失的计算公式
        cost = (1 / 2) * np.dot(delta.T, delta) / num_examples
        return cost[0][0]

3. Predictive Model

    def predict(self, data):
        """
                    用训练的参数模型,与预测得到回归值结果
        """
        data_processed = prepare_for_training(data,
                                              self.polynomial_degree,
                                              self.sinusoid_degree,
                                              self.normalize_data
                                              )[0]

        predictions = LinearRegression.hypothesis(data_processed, self.theta)

        return predictions

Two, single characteristic variable model

Goal: Predict happiness score through GDP

1. Load the dataset

本次模型的数据是一份不同国家根据不同特征,给出不同幸福度分数的数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题
from line import LinearRegression
data=pd.read_csv('./data/world-happiness-report-2017.csv')

2. Get training and test data

#得到训练和测试数据
train_data=data.sample(frac=0.8)
text_data=data.drop(train_data.index)

input_param_name='Economy..GDP.per.Capita.'
out_param_name='Happiness.Score'

x_train=train_data[[input_param_name]].values
y_train=train_data[[out_param_name]].values

x_test=text_data[input_param_name].values
y_test=text_data[out_param_name].values

General training data and test data 7:3 or 8:2

3. Draw a scatter plot to observe the distribution of the data set

plt.scatter(x_train,y_train,label='Train data')
plt.scatter(x_test,y_test,label='test data')
plt.xlabel(input_param_name)
plt.ylabel(out_param_name)
plt.title("countr happinse")
plt.legend()
plt.show()

insert image description here
It is obvious that there is a linear relationship, and we train the data carefully

4. Training

We set the number of iterations to 500 and the learning rate to 0.01

num_iterations=500
learning_rate=0.01#学习率

linean_regress=LinearRegression(x_train,y_train)
(theta,const_history)=linean_regress.train(learning_rate,num_iterations)

Run to get the training loss result, and print the loss at the beginning and the end.
insert image description here
Compared with the beginning, the loss becomes much smaller in the end. We hope that the smaller the loss, the better.

Plotting the loss data, you will find that the loss gradually, declines, and tends to be stable
insert image description here

5. Prediction results

Randomly generate 100 data and get the prediction result

predictions_num=100
x_predictions=np.linspace(x_train.min(),x_train.max(),num=predictions_num).reshape(predictions_num,1)#等间隔数据,
# print(x_predictions)
y_predictions=linean_regress.predict(x_predictions)

Show forecast results with scatterplots

plt.scatter(x_train,y_train,label='Train data')
plt.scatter(x_test,y_test,label='test data')
plt.plot(x_predictions,y_predictions,'r',label="Preddiction")
plt.xlabel(input_param_name)
plt.ylabel(out_param_name)
plt.title("happinses预测")
plt.legend()
plt.show()

insert image description here
It can be found that the prediction results are basically on the fitting line, and there is a linear relationship between happiness and GPD. The higher the GDP, the higher the happiness score. But this is just a feature, we will use two features for training below

3. Models with Multiple Characteristic Variables

Goal: Predict happiness from GDP and degrees of freedom

Here we use plotly for drawing. pip insyall plotly
It is a very beautiful visual drawing. If you are interested, you can learn about it.

1. Load data

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import plotly.offline
import matplotlib.pyplot as plt
import plotly.graph_objs as go


plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题
from line import LinearnGegress,LinearRegression

data=pd.read_csv('./data/world-happiness-report-2017.csv')

2. Get training data and test data

  • Almost the same operation as above
train_data=data.sample(frac=0.8)
text_data=data.drop(train_data.index)

input_param_name1='Economy..GDP.per.Capita.'
input_param_name2="Freedom"
input_param_name3='Health..Life.Expectancy.'

out_param_name='Happiness.Score'

x_train=train_data[[input_param_name1,input_param_name2]].values
y_train=train_data[[out_param_name]].values

x_test=text_data[[input_param_name1,input_param_name2]].values
y_test=text_data[out_param_name].values

3. Draw a dynamic scatter plot


polt_traning_trace=go.Scatter3d(
    x=x_train[:,0].flatten(),#这个代码段假设x_train是一个NumPy数组,它至少有两个维度,第一维度的长度大于等于1,第二维度的长度可以是任意值。具体而言,x_train[:, 0]表示取x_train的第二维中索引为0的那一列,也就是所有行的第一个元素。然后,这个一维数组被调用flatten()方法,将其展平成一个一维数组
    y=x_train[:,1].flatten(),
    z=y_train.flatten(),
    name='traning set',
    mode='markers',
    marker={
    
    
        'size':9,
        'opacity':0.9,
        'line':{
    
    
            'color':'rgb(255,255,255)',
            'width':1
        }
    }

)

polt_test_trace = go.Scatter3d(
    x=x_test[:, 0].flatten(),
    # 这个代码段假设x_train是一个NumPy数组,它至少有两个维度,第一维度的长度大于等于1,第二维度的长度可以是任意值。具体而言,x_train[:, 0]表示取x_train的第二维中索引为0的那一列,也就是所有行的第一个元素。然后,这个一维数组被调用flatten()方法,将其展平成一个一维数组
    y=x_test[:, 1].flatten(),
    z=y_test.flatten(),
    name='test set',
    mode='markers',
    marker={
    
    
        'size': 9,
        'opacity': 1,
        'line': {
    
    
            'color': 'rgb(255,255,255)',
            'width': 1
        }
    }

)
#布局
plot_layout=go.Layout(
    title='data set',
    scene={
    
    
        'xaxis':{
    
    'title':input_param_name1},
        'yaxis':{
    
    'title':input_param_name2},
        'zaxis':{
    
    'title':out_param_name}
    },margin={
    
    
        'l':0,'r':0,'b':0,'t':0
    }
)

plot_data=[polt_traning_trace,polt_test_trace]
plot_figure=go.Figure(data=plot_data,layout=plot_layout)
plotly.offline.iplot(plot_figure)#弹出网页iplot嵌入展示

insert image description here

  • The blue points are the training data and the red points are the test data
  • Data changes can be considered from different dimensions
  • Generally speaking, the higher the GPD and the degree of freedom, the higher the happiness

4. Data training

num_iterations=500
learnin_rate=0.01
liner_regress=LinearRegression(x_train,y_train)
(theta,const_history)=liner_regress.train(alpha=learnin_rate,num_iterations=num_iterations)

print('开始时候损失:',const_history[0])
print('训练后的损失',const_history[-1])

insert image description here

Compared with a single feature, we found that the smaller the loss value of the two feature values ​​after training, the more reliable the prediction result is.
Draw the loss decline curve


plt.plot(range(1,num_iterations+1),const_history)
plt.xlabel('Inter')
plt.ylabel('cost')
plt.title('损失梯度')
plt.show()

insert image description here

5. Prediction results

For the processing of multi-dimensional data, you can use the np.hstack() function to construct two or more matrices or arrays into a new matrix or array.
We generate 100 data in the range, and construct the shape as (100, 1) The matrix, and then use np.hstack() to construct a (100,2) matrix

predictions_num=100

x_min = x_train[:, 0].min()
x_max = x_train[:, 0].max()
y_min = x_train[:, 1].min()
y_max = x_train[:, 1].max()
x_axis=np.linspace(x_min,x_max,predictions_num)
y_axis=np.linspace(y_min,y_max,predictions_num)

x_predictions = np.zeros((predictions_num * predictions_num, 1))
y_predictions = np.zeros((predictions_num * predictions_num, 1))
x_y_inex=0
for x_index,a_value in enumerate(x_axis):
    for y_index,y_value in enumerate(y_axis):
        x_predictions[x_y_inex]=a_value
        y_predictions[x_y_inex]=y_value
        x_y_inex+=1

"""
np.hstack()是NumPy库中的一个函数,用于将两个或多个数组沿着水平方向(列方向)合并成一个新的数组。
"""
z_predictions=liner_regress.predict(np.hstack((x_predictions,y_predictions)))

After obtaining the prediction results, we draw the fitted dynamic scatter plot for observation

plot_predictions_trace = go.Scatter3d(
    x=x_predictions.flatten(),
    y=y_predictions.flatten(),
    z=z_predictions.flatten(),
    name='Prediction Plane',
    mode='markers',
    marker={
    
    
        'size': 1,
    },
    opacity=0.8,
    surfaceaxis=2,
)

plot_data = [polt_traning_trace,polt_test_trace, plot_predictions_trace]
plot_figure = go.Figure(data=plot_data, layout=plot_layout)
plotly.offline.iplot(plot_figure)

insert image description here
By looking at our fitted facets, our predicted results are basically the same as the actual results

Summarize

It’s probably over here. We compare the loss values ​​of a single eigenvector and multiple eigenvectors. The more eigenvalues, the smaller the loss value, the greater the impact on the prediction results, and the more accurate the prediction results.
If there is a problem above, please correct me
I hope you will support us a lot, study hard together, and share more novel and interesting things in the future

Guess you like

Origin blog.csdn.net/qq_61260911/article/details/129911794