Machine Learning Notes - Linear and Logistic Regression

An overview of linear regression

1 Overview

        Linear regression algorithm is a method of predicting continuous variables. Its basic idea is to set a mathematical model to fit the sample points through the relationship between the dependent variable and the independent variable of the sample points. The linear regression algorithm is all about finding the best model.

        There are two cores to the linear regression algorithm. First, assume a suitable model, such as whether to use primary curve fitting or quadratic curve fitting; second, find the best fitting parameters, different parameters correspond to different shapes of the model, how to find the best parameters is very critical.

        The purpose of regression is to predict a numerical target value. The most straightforward way is to write a formula for calculating the target value based on the input. If you wanted to predict the number of car sales in the next quarter, you might use the following formula.

nums = 0.005*d - 0.00099f

        This is the so-called regression equation, where 0.0015 and -0.99 are called regression weights, and the process of finding these regression coefficients is regression. 

2. Advantages and disadvantages

Advantages: Results are easy to understand and computationally uncomplicated.

Disadvantage: Does not fit nonlinear data well.

Applicable data types: numeric and nominal data.

3, the usual application process

(1) Collect data: collect data using any method.

(2) Prepare data: The regression requires numerical data, and the nominal data will be converted into binary data.

(3) Analyze data: Drawing a visual two-dimensional graph of the data will help to understand and analyze the data. After the new regression coefficient is obtained by the reduction method, the new fitted line can be drawn on the graph for comparison.

(4) Training algorithm: find regression coefficients.

(5) Test algorithm: Use R2 or the fit of the predicted value to the data to analyze the effect of the model.

(6) Using an algorithm: Using regression, a value can be predicted given an input, which is an improvement over classification methods because it can predict continuous data rather than just discrete class labels.

2. Overview of Logistic Regression

1 Overview

        Logistic regression can be used for binary or multiclass classification.

        Logistic Regression is one of the most popular machine learning algorithms and belongs to a supervised learning technique. It is used to predict classification using a given set of independent variables.

        Logistic regression is a statistical model that uses logistic functions to model conditional probabilities. The idea of ​​logistic regression is to find a relationship between a feature and the probability of a particular outcome.

2. Advantages and disadvantages

Advantages: Not computationally expensive, easy to understand and implement.

Disadvantages: It is easy to underfit, and the classification accuracy may not be high.

Applicable data types: numeric and nominal data.

3, the usual application process

(1) Collect data: collect data using any method.

(2) Prepare data: Since distance calculation is required, the data type is required to be numeric. Alternatively, structured data formats are best.

(3) Analyze data: use any method to analyze the data.

(4) Training algorithm: Most of the time will be used for training, and the purpose of training is to find the best classification and regression coefficients.

(5) Testing the algorithm: Once the training step is complete, the classification will be fast.

(6) Using the algorithm: First, we need to input some data and convert it into corresponding structured values; then, based on the trained regression coefficients, we can perform simple regression calculations on these values ​​to determine which category they belong to; After this, we can do some other analysis work on the output categories.

Third, the similarities and differences between linear regression and logistic regression

        There are three major problems in machine learning, namely regression, classification and clustering.

        Linear regression is a regression problem whereas logistic regression is a classification problem.

        Although the two solve completely different problems, if we delve into the essence of the algorithm, they still have a lot in common. For example, they both use the gradient descent method to find the best fitting model.

        However, the goal of a linear regression fit is to try to get the data points to fall on a straight line, while the goal of logistic regression is to try to put points of different classes on both sides of the line.

Linear regression on the left, logistic regression on the right

        form comparison

Linear regression

logistic regression

Linear regression is a supervised regression model.

Logistic regression is a supervised classification model.

In linear regression, we predict values ​​by integers.

In logistic regression, we predict values ​​as 1 or 0.

No activation function is used here.

Here activation function is used to convert linear regression equation to logistic regression equation

Thresholds are not required.

Threshold is required.

Here we calculate the root mean square error (RMSE) to predict the next weight value.

Here we use the accuracy to predict the next weight value.

Here the dependent variable should be numeric and the response variable continuous.

The dependent variable here contains only two categories. Given a set of quantitative or categorical independent variables, logistic regression estimates the odds outcome of the dependent variable.

It is based on least squares estimation.

It is based on maximum likelihood estimation.

Here, when we plot the training dataset, we can draw a straight line that touches the maximum graph.

Any change in the coefficients results in a change in the direction and steepness of the logistic function. This means that a positive slope results in an S-shaped curve, and a negative slope results in a Z-shaped curve.

Linear regression is used to estimate the dependent variable in the presence of changes in the independent variables. For example, predicting house prices.

Whereas logistic regression is used to calculate the probability of an event. For example, classify whether the tissue is benign or malignant.

Linear regression assumes a normal or Gaussian distribution of the dependent variable.

Logistic regression assumes a binomial distribution of the dependent variable.

4. Simple example 

1. Simple Linear Regression

        The dataset is very simple, one column for time, one column for score.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

# 读取数据
dataset = pd.read_csv('studentscores.csv')
X = dataset.iloc[ : ,   : 1 ].values
Y = dataset.iloc[ : , 1 ].values

# 分割训练数据和测试数据
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 1/4, random_state = 0) 

# 使用线性回归进行训练
regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train)

# 进行预测
Y_pred = regressor.predict(X_test)

# 可视化训练结果
plt.scatter(X_train , Y_train, color = 'red')
plt.plot(X_train , regressor.predict(X_train), color ='blue')

# 可视化预测结果
plt.scatter(X_test , Y_test, color = 'red')
plt.plot(X_test , regressor.predict(X_test), color ='blue')

         The following figure is a visualization of the training results

Training results on the left, testing results on the right

2. Multiple Linear Regression

        An example of the dataset is as follows

R&D Spend Administration Marketing Spend State Profit
165349.2 136897.8 471784.1 New York 192261.83
162597.7 151377.59 443898.53 California 191792.06
153441.51 101145.55 407934.54 Florida 191050.39
144372.41 118671.85 383199.62 New York 182901.99

        The reference code is as follows 

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : ,  4 ].values

labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder()
X = onehotencoder.fit_transform(X).toarray()
X = X[: , 1:]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

y_pred = regressor.predict(X_test)

3. Simple logistic regression

         The dataset contains information about users in social networks. This information is user ID, gender, age and estimated salary. A car company just unveiled their brand new luxury SUV. We are trying to see which users of the social network will buy this brand new SUV The last column here tells that if the user bought this SUV, we will build a model to predict whether the user will buy or not based on the two variables of age and estimated salary Don't buy an SUV. So our feature matrix has only these two columns. We want to find some correlation between the user's age and estimated salary and his decision whether to buy an SUV.

        An example of the dataset is as follows

User ID Gender Age EstimatedSalary Purchased
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 0
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 0
15694829 Female 32 150000 1

        The code reference is as follows

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 读取数据集
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# 分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# 实现特征缩放
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 创建并训练
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# 测试
y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

Guess you like

Origin blog.csdn.net/bashendixie5/article/details/123588716