Return articles machine learning (a)

Return articles machine learning (a)

I. Overview

Regression is starting from a set of data to determine the quantitative relationship between certain variables, which is a mathematical model and estimate the unknown parameters. Purpose is to predict the return target value of numeric, its goal is subjected to continuous data to find the equation best fits the data, and to predict the specific value. Equation wherein sought called regression equation, solving the regression equations, first determine the model, the most simple regression model is a simple linear regression (e.g. y = kx + b), then it is to compute the regression coefficients of the regression equation (i.e., k and b value).

Second, the linear regression

Linear regression model (mathematical expression) is defined as: \ [F (X) = \ sum_ \ Limits. 1} ^ {n-I = \ omega_ix_i + \ omega O = \ omega O + \ omega_1x_1 + \ + ... + omega_2x_3 \ omega_nx_n \] , with the matrix

Shows that \ (F (X) = XW \) , where \ (X = \ begin {bmatrix } 1 & x_1 & x_2 & ... & x_n \ end {bmatrix}, W = \ begin {bmatrix} \ omega_0 \\\ omega_1 \\. \\\ \\ omega_n .. \ bmatrix End {} \) , \ (X-\) is augmented feature vector, \ (W is \) is augmented weight vector. Linear Regression is the process of solving the augmented weight vector.

Simple linear regression Example 2.1

To find the augmented weight vector, we must first carry out sampling, removing some of the data that is representative of problems we have to study. For example, I want to predict prices near Wuhan University, I must first find out the area near Wuhan University in recent years of price data (from: the room in the world ). So we made a series of samples, the samples of data that encompasses two dimensions, namely time and price (and in fact, house prices in addition to these two factors, and also lots, floors, residential environment, property and other related factors, here we it is the study of the relationship between time and prices, so the price should refer to the average prices). So that we can build a two-dimensional Cartesian coordinate system, the horizontal axis represents time and the vertical axis represents the average price. As shown below:

image

Code is as follows:

# -*- coding: utf-8 -*
import numpy as np
import pandas as pd
from datetime import datetime 
from matplotlib import pyplot as plt
import matplotlib.dates as mdates

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus']= False 

x = ['2018-11','2018-12','2019-01','2019-02','2019-03','2019-04','2019-05','2019-06','2019-07','2019-08','2019-09','2019-10']
y = np.array([20128, 20144, 20331, 20065, 20017, 19972, 19902, 19706, 19997, 20057, 20213, 20341])
x = [datetime.strptime(d, '%Y-%m') for d in x]

plt.title("武汉市洪山区平均房价")
plt.ylim((19500, 20500))
plt.ylabel(u'平均房价(元/平方米)')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))#设置时间标签显示格式
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.plot(x,y,'.')
plt.xlabel(u'月份(2018.11-2019.10)')
plt.gcf().autofmt_xdate()
plt.show()

If the display can not be Chinese, please refer to know almost answered first .

Linear regression is solved as a form of y = kx + b come straight fitting these scatter. Below, there is a blue-green sample points for all straight lines were fitted. The effect of fitting a straight line which is better?

image-20191020174321750

Intuitively, the blue line directly to the first point and the last point to link (in fact, the code which I also do), the green line is the point of the fifth and penultimate point directly connected together (actual code, I'm not doing this just so happens to image display). Most point below the blue line, and the dotted lines on both sides of the green substantially equalized. So, we can roughly think good green line of fitting effect than the blue line. This is high school mathematics (physics), the often talked about when to make the sample fit points evenly distributed on both sides of the straight line (there is a rule that clearly erroneous data points to be discarded).

Linear regression error 2.1

Linear regression, we want to find a best fit line, how to determine the best-fit line is not it? Here we have a concept --- leads to error. Junior Physics error is defined as the difference between the measured value and the true value, regression analysis, we can define a rough error prediction value by subtracting the actual value, represented mathematically \ [e = y_a-y_p ( a represents actual , p represents Predicted) \] , \ (Y_A \) is the ordinate of sample points, \ (y_p \) is the point on the line corresponding to the ordinate. (In fact, more accurately defined as the Euclidean distance error described by linear regression, a straight line is found, all the sample points such that the minimum Euclidean distance to the line.) Solution best-fit line, and that the error is \ (\ sum_ \ limits {i = 1} ^ ne_i \) during the minimum line. Linear regression, we use the loss function \ (J (w) \) to measure the error and the magnitude, usually we use the mean square error as a loss function, represented mathematically as \ (J (w) = \ cfrac {1 n-} {} \ sum_ \ Limits. 1} ^ {n-I = (Y_A-y_p) ^ 2 \) , for the mean square error, we generally minimize loss function using the least squares method.

Guess you like

Origin www.cnblogs.com/liyier/p/12516646.html