Least Squares Method-Regression Practice

Regression practice of python least squares algorithm

basic concept

Linear regression model

For different data sets, the process of data mining or machine learning is the process of establishing a data model. For regression problems, the general formula for linear model prediction is as follows:

**y=w[0]*x[0]+w[1]*x[1]+w[2]x[2]+…+w[p]x[p]+b

Here x[0] to x[p] represent the features of the delayed data point (the number of features in this example is p+1), w and b are the parameters of the learning model, and y is the prediction result. For a single feature data set, The formula is as follows:

*y=w[0]x[0]+b

As you can see, this is very similar to the straight line equation in high school mathematics. Where w[0] is the slope. For data sets with more features, w contains the slope along each feature coordinate element. Alternatively, you can also think of the predicted response value as a weighted summation of the input features, with the weight given by the element of w.

Introduction to Algorithm

Least squares algorithm

Ordinary least squares algorithm, or linear regression, is the simplest and most classic linear method for regression problems. Linear regression looks for parameters w and b to make the mean square between the predicted value of the training set and the true regression target value y The error is minimal.
The mean square error is the sum of the squares of the difference between the predicted value and the true value divided by the number of samples.

Data Sources

Split data of first-person fps game csgo: https://www.kaggle.com/sp1nalcord/mycsgo-data
Insert picture description here

csgo is a first-person shooter game. The data includes the network delay (ping) of each player, the number of kills, the number of deaths, the score, and so on.
Hey, I still know the game well. This is the only blogger who can understand the data sets of various dimensions without looking at the English introduction of the original data.

Data mining

1. Import third-party libraries

import pandas as pd
import numpy as np
import winreg
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression#导入线性回归算法
from sklearn.metrics import r2_score

The old rules, come up first to import each module needed for modeling in turn
2. Read the file

import winreg
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\avocado.csv"#设立源数据文件的桌面绝对路径
glass=pd.read_csv(file_origin)#https://www.kaggle.com/neuromusic/avocado-prices

Because every time you download data, you have to transfer the file to the python root directory or read it in the download folder, which is very troublesome. So I set up an absolute desktop path through the winreg library, so that I only need to download the data to the desktop or paste it into a specific folder on the desktop to read it, and it won't be confused with other data.
In fact, up to this step, we are going through the process, basically every data mining has to be done again, there is nothing to say.

3. Cleaning the data It
Insert picture description here
can be seen that this data does not include missing values, and there is no overlap of attributes between each feature value, so there is no need for any processing for the time being.

4. Modeling

X_train,X_test,y_train,y_test=train_test_split(csgo[["Ping","Kills","Assists","Deaths","MVP","HSP"]],csgo["Score"],random_state=1)

The score is divided into predicted values, other attributes are divided into feature values, and the data is divided into training set and test set.

LR=LinearRegression()
LR.fit(X_train,y_train)
prediction=LR.predict(X_test)
r2_score(y_test,prediction)

After introducing the knn algorithm, after modeling, the accuracy of the test set is scored, and the results obtained are as follows:
Insert picture description here

It can be seen that the accuracy of the model is about 94%.
At this point, the modeling of this data set is complete.

5. Summary

1. It can be felt that the least squares algorithm is not difficult, but it is indeed one of the most classic and important algorithms. Because many other linear regression algorithms are derived based on its model formula, you must understand the formula of this model.
You must have your own understanding of the principle of the algorithm, and may not be specialized in algorithm research and development, but you must know how to use the algorithm and what scenarios it is used in.

2. You can see that this algorithm does not require parameter adjustment, because this algorithm has no parameters at all, which is an advantage, but it also reflects one thing, that is, there is no way to control the complexity of the model, and there is no way to pass it. Adjust the algorithm itself to improve the accuracy of the model.

There are many places that are not doing very well. Netizens are welcome to make suggestions, and I hope to meet some friends to discuss together.

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/112271333