pyhton machine learning and data mining-linear regression

说明:
    此类文,只介绍常见机器学习的算法的实际应用,不对各算法原理做进一步理解,以后的机器学习的原理学习专在机器学习栏目进行展示,先学会怎么用吧,原理得慢慢来。
参考内容:
    python数据挖掘与机器学习实战.方魏.机械工业出版社.2019.05
    机器学习基础:从入门到求职.胡欢武.电子工业出版社.2019.03

Regression analysis is a very widely used quantitative analysis method. It is used to analyze the statistical relationship between things, focusing on investigating the law of quantitative changes between variables, and describing and reflecting this relationship in the form of regression equations to help people accurately grasp the degree to which variables are affected by one or more variables , And then provide a scientific basis for prediction.在大数据分析中,回归分析是一种预测性的建模技术,它研究的是因变量(目标)和自变量(预测器)之间的关系。这种技术通常用于预测分析、时间序列模型,以及发现变量之间的因果关系。

Unary linear regression analysis

The concept of linear regression

In linear regression analysis, if there is only one independent variable and one dependent variable, and the relationship can be roughly expressed by a straight line, it is called simple linear regression analysis. If a high positive correlation between the dependent variable Y and the independent variable X is found, a straight line equation can be determined so that all data points are as close as possible to the fitted straight line. The model of simple linear regression analysis can be expressed by the following equation:Y=a+bx 其中,Y为因变量,a为截距,b为相关系数,x为自变量。

Examples of linear regression

Predicting house prices: A simple example of linear regression is the question of house value prediction. Generally speaking, the bigger the house, the higher the value of the house. It can be concluded that the value of the house is related to the area of ​​the house. In this case, the house area is an independent variable and the house price is a dependent variable. Therefore, to predict the house price, you need to find a linear equation that fits the Y = a + bx model from the given data set .

Predicted housing area data set
# predict_house_price.py

# 1. 导入需要的包
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# 2. 读取数据函数
def get_data(file_name):
    data = pd.read_csv(file_name)
    x_parmeter = []
    y_parmeter = []
    for single_square_feet, single_price_value in zip(data['square_feet'],data['price']):
        # 遍历数据
        x_parmeter.append([float(single_square_feet)])
        y_parmeter.append([float(single_price_value)])
    return x_parmeter, y_parmeter

# 3. 将数据拟合到线性模型
def linear_model_main(X_parameters,Y_parmeters,predict_value):
    regr = linear_model.LinearRegression()
    regr.fit(X_parameters,Y_parmeters)
    # 训练模型
    predict_outcome = regr.predict(predict_value)
    predictions = {}
    predictions['intercept'] = regr.intercept_
    predictions['coefficient'] = regr.coef_
    predictions['predicted_value'] = predict_outcome
    return predictions

# 4. 绘制拟合曲线
def show_linear_line(X,Y):
    regr = linear_model.LinearRegression()
    regr.fit(X,Y)
    plt.scatter(X,Y,color='blue')
    plt.plot(X,regr.predict(X),color="red")
    plt.xticks()
    plt.yticks()
    plt.show()
    
# 读取数据、进行预测
X, Y = get_data('E:/Data/6/input_data.csv')
show_linear_line(X,Y)
predictvalue = 700
result = linear_model_main(X,Y,predictvalue)
print("系数a:", result['intercept'])
print("系数b:", result['coefficient'])
print("预测价格:", result['predicted_value'])

系数a: [1771.80851064]
系数b: [[28.77659574]]
预测价格: [[21915.42553191]]

Intercept value is the value of a, and coefficientvalue is the value of b. The predicted price value is 21915.4255-this means that the job of predicting the price of the house is done. In order to verify whether the data fits linear regression, you need to write a function, input X_parameters and Y_parameters, and draw a straight line for data fitting. It can be seen from the figure that the straight line can basically fit all the data points.

Multiple linear regression analysis

The concept of multiple linear regression

Multiple linear regression analysis is a generalization of simple linear regression analysis, which refers to the regression analysis of multiple dependent variables on multiple independent variables. The most commonly used is limited to one dependent variable but has multiple independent variables, also called multiple regression analysis. The general form of multiple regression analysis is as follows: Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 +… + b k X k (where a represents the intercept, b 1 , b 2 , b 3 … b k is the regression coefficient.)

Examples of how far linear regression

Advertising investment: When there are multiple factors influencing the result value, multiple linear regression model can be used. For example, the sales of goods may be related to television advertising, radio advertising, and newspaper advertising. Therefore:
Sales = β 0 + β 1 TV + β 2 Radio + β 3 Newspaper 

1. Read in data

from sklearn import linear_model
import pandas

# 1.读入数据
data = pd.read_csv("E:/Data/6/Advertising.csv",header=0,index_col=0)
data 
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
... ... ... ... ...
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4

200 rows × 4 columns

The above data has the following characteristics:

  • TV: advertising costs invested in TV (in millions of yuan);
  • Radio: advertising costs invested in broadcast media;
  • Newspaper: advertising costs for newspaper media;
  • Sales: The sales volume of the corresponding product.
    In this case, product sales are predicted through different advertising investments. Because the corresponding variable is a continuous value, this problem is a regression problem. There are a total of 200 observations in the data set, and each group of observations corresponds to a market situation.

2. View the relationship between data

# 2. 通过可视化各个特征和观测值的关系
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(data, x_vars=['TV','Radio','Newspaper'],y_vars=['Sales'],size=7,kind='reg')
plt.show()

It can be seen from the figure that TV characteristics and sales have a strong linear relationship, while Radio and Sales have a weaker linear relationship, and Newspaper and Sales have a weaker linear relationship.

3. Construct feature vectors and labels

# 3. 使用pandas构建X(特征向量)和y(标签)
'''
scikit-learn要求X是一个特征矩阵,y是一个NumPy向量。pandas构建在NumPy之上。
因此,X可以是pandas的DataFrame,y可以是pandas的Series,scikit-learn可以理解这种结构。
'''

# 创建特征列表
feature_cols = ['TV','Radio','Newspaper']

# 使用列表选择原始DataFrame的子集,构建特征向量
X = data[feature_cols]
X = data[['TV','Radio','Newspaper']]

# 从DataFrame中选择一个Series
y = data['Sales']
y = data.Sales

X
TV Radio Newspaper
1 230.1 37.8 69.2
2 44.5 39.3 45.1
3 17.2 45.9 69.3
4 151.5 41.3 58.5
5 180.8 10.8 58.4
... ... ... ...
196 38.2 3.7 13.8
197 94.2 4.9 8.1
198 177.0 9.3 6.4
199 283.6 42.0 66.2
200 232.1 8.6 8.7

200 rows × 3 columns

y
1      22.1
2      10.4
3       9.3
4      18.5
5      12.9
       ... 
196     7.6
197     9.7
198    12.8
199    25.5
200    13.4
Name: Sales, Length: 200, dtype: float64

4. Build training set and test set
Build training set and test set, save in X_train, y_train, Xtest and y_test respectively.

# 4.构建训练集和测试集
# 使用交叉验证
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y) # 75% 用于训练 25%用于测试
print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)
(150, 3)
(150,)
(50, 3)
(50,)

5. Training the model To
use sklearn for linear regression, first import the relevant linear regression model, and then do linear regression simulation.

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
# 训练模型
model = linreg.fit(X_train,y_train)
print(model)
print(linreg.intercept_) # 截距
print(linreg.coef_)      # 系数
 
# 将特征名称和系数对应
zip(feature_cols,linreg.coef_) 

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
3.18324257719733
[0.04355291 0.19015449 0.00187221]
1

<zip at 0x1fcd01a5788>

For the advertising investment given Radio and Newspaper, if each additional unit is invested in TV advertising, the corresponding sales volume will increase by 0.04355 units. That is to say, the advertising investment of the other two media is fixed. For every additional US $ 1,000 in TV advertising (because the unit is US $ 1,000), the sales volume will increase by 43.5 (because the unit is 1000). Similarly, using the control variable method, the other two variables can be analyzed.

6. Make predictions
After the regression model is obtained through linear simulation, the data can be predicted by the model, and the prediction result can be obtained by the predict function.

y_pred = linreg.predict(X_test)
print(y_pred)

# 返回模型在测试集上的预测准确率
print("模型得分:",linreg.score(X_test,y_test))

[10.74391068 23.14564258  8.32679708 15.21951218 18.28945003 14.87833614
 13.7803107   7.63132692  8.92202965 11.01760334 12.53937187 14.68253695
 15.55290183 11.01469897 11.76553368 17.74441368 16.85875042  9.38817803
 20.61167465  4.771148   10.76013573 18.12281809 17.31568335 15.00207018
 16.24914813  8.14154239 18.41459647 21.86644162 21.18666811 16.48384668
 24.61985703 21.02422026 11.4813219  20.97928742 13.30921663  7.2209075
 15.33535201  7.60112899 12.52375857 16.99771224 12.79018816 11.62956032
 20.72425247 17.31572264 11.89161417  6.34236115 20.1308101  10.79413639
 17.74580942  9.98156724]
模型得分: 0.8631218575476306

7. Model evaluation
For classification problems, the evaluation measure is the accuracy rate, but it is not applicable to regression problems. Here are three commonly used evaluation measures for linear regression.

  • Mean Absolute Error (MAE);
  • Mean squared error (Mean Squared Error, MSE);
  • Root Mean Squared Error (RMSE).
# 此处采用RMSE
from sklearn import metrics
import numpy as np

sum_mean = 0 
for i in range(len(y_pred)):
    sum_mean += (y_pred[i] - y_test.values[i])**2
    sum_error = np.sqrt(sum_mean / len(y_pred))
print("均方根为:", sum_error)
均方根为: 2.094615925156935
# 绘制预测和特测试集曲线
import matplotlib.pyplot as plt

def show_roc():   
    plt.figure()
    plt.plot(range(len(y_pred)), y_pred,'b',label = "predict")
    plt.plot(range(len(y_pred)), y_test,'r',label = "test")
    plt.legend(loc="upper right")
    plt.xlabel("The number of sales")
    plt.ylabel("Value of sales")
    plt.show()
    
show_roc()

summary

The introduction of ordinary linear regression is as described above, implemented in the scikit-learn through the linear_model.LinearRegression class, the following summarizes the main parameters and methods of this class.

  • parameter
    • fit_intercept: Select whether to calculate the bias constant b, the default is True, which means calculation.
    • normalize: Choose whether to normalize the data before fitting it. The default is False, which means no normalization.
    • n_jobs: Specify the number of CPU cores when the computer is working in parallel. The default is 1. If you select -1, it means that all available CPU cores are used.
  • Attributes
    • coef_: weight vector w used to output the linear regression model.
    • intercept_: used to output the bias constant b of the linear regression model.
  • method
    • fit (X_train, y_train): train the model on the training set (X_train, y_train).
    • predict (X): Use the trained model to predict the data set X to be predicted, and the returned data is the prediction result corresponding to the prediction set.
    • score (X_test, y_test): returns the prediction accuracy of the model on the test set (X_test, y_test). The calculation formula is shown below. The score is a value less than 1 or may be a negative value. A larger value indicates the prediction performance of the model The better.

Guess you like

Origin www.cnblogs.com/sinlearn/p/12683168.html