4. Linear Support Vector Machine Algorithm (LinearSVC, Linear Support Vector Classification) (supervised learning)

Linear Support Vector Machine, Linear Support Vector Classification.
is similar to SVC with a linear parameter kernel (SVC(kernel='linear') ), but using liblinear instead of libsvm, so 在选择惩罚和损失函数时更具灵活性，并能更好地扩展到大量样本

SVC(kernel=’linear’) and LinearSVC() are similar, except that LinearSVC() is implemented through liblinear; and SVC(kernel='linear') is implemented through libsvm; compared to SVC(kernel ='linear'), LinearSVC) (more flexibility in choosing penalty and loss functions, and scales better to large numbers of samples

1. Algorithm idea

Essentially it is an optimization in SVM, and the principles are similar. For detailed algorithm ideas, you can refer to the blog post: 3. Support Vector Machine Algorithm (SVC, Support Vector Classification) (with supervised learning)

2. Official website API

Official website API

class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', *, dual='warn', tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)

There are quite a lot of parameters here. For specific parameter usage, you can learn based on the demo provided on the official website and try it more; here are some commonly used parameters for explanation.
Guide package:from sklearn.svm import LinearSVC

①Penalty item penalty

Selection of the penalty term, specifying the specification used in the penalty
The l2 penalty is the standard used by SVC, l1 will cause the coef_vector to be sparse
Regular To put it bluntly, transformation is a constraint on the loss function

In linear regression, L1 regularization, also known as Lasso regression, can produce sparse models
In linear regression, L2 regularization, also known as Ridge regression, can obtain small Parameters to prevent overfitting
specifies the norm used in the penalty. l2 "The penalty is the standard used by SVC. l1" will cause the coef_ vector to be sparse.

'l1': Add L1 regularization
' l2': Add L2 regularization. By default, L2 regularization is the standard for SVC.
Can be selected in SVC()None, but there is no in LinearSVC()

The specific official website details are as follows:
Insert image description here

Usage

LinearSVC(penalty='l2')

②Loss function loss

loss, specify the loss function
hinge is the standard SVM loss function (such as used by the SVC class, and squared_hinge is the square of the hinge loss function

The combination of penalty='l1' and loss='hinge' is not supported
Because penalty='l2' is the standard for SVC and loss=' Hinge' is a standard SVM loss function, only such matching ones can be used together

'hinge': standard SVM disfunction number
' squared_hinge': hinge suffix square

The specific official website details are as follows:
Insert image description here

Usage

LinearSVC(loss='squared_hinge')

③Regularization parameter C

The strength of the regularization is inversely proportional to C, and the penalty is the square of the L2 regularization. C is a floating point type.

The specific official website details are as follows:
Insert image description here

Usage

LinearSVC(C=2.0)

④dual

Whether to choose an algorithm to solve dual or primitive optimization problems, the default is True

"auto": Parameter values will be automatically selected based on the values of n_samples, n_features, loss, multi_class and penalty
If n_samples < n_features, and the optimizer supports the selection of loss, multi_class and penalty, then dual will be set to True, otherwise it will be set to False

The specific official website details are as follows:
Insert image description here

⑤Random seed random_state

If you need to control variables for comparison, it is best to set the random seed here to the same integer.

The specific official website details are as follows:
Insert image description here

Usage

LinearSVC(random_state=42)

⑤Finally build the model

LinearSVC(penalty=‘l2’,loss=‘squared_hinge’,C=2.0,random_state=42)

3. Code implementation

①Guide package

Here you need to evaluate, train, save and load the model. The following are some necessary packages. If an error is reported during the import process, just install it with pip.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

②Load the data set

The data set can be simply created by itself in csv format. What I use here is 6 independent variables X and 1 dependent variable Y.
Insert image description here

fiber = pd.read_csv("./fiber.csv")
fiber.head(5) #展示下头5条数据信息

Insert image description here

③Divide the data set

The first six columns are the independent variable X, and the last column is the dependent variable Y

Official API of commonly used split data set functions:train_test_split
Insert image description here
test_size: Proportion of test set data
train_size: Proportion of training set data
random_state: Random seed
shuffle: Whether to disrupt the data
Because my data set here has a total of 48, training set 0.75, test set 0.25, that is, 36 training sets and 12 test sets

X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']

X_train, X_test, y_train, y_test = train_test_split(X,Y,train_size=0.75,test_size=0.25,random_state=42,shuffle=True)

print(X_train.shape) #(36,6)
print(y_train.shape) #(36,)
print(X_test.shape) #(12,6)
print(y_test.shape) #(12,)

④Build LinearSVC model

You can try setting and adjusting the parameters yourself.

lsvc = LinearSVC(penalty='l2',loss='squared_hinge',C=2.0,random_state=42)

⑤Model training

It’s that simple, a fit function can implement model training

lsvc.fit(X_train,y_train)

⑥Model evaluation

Throw the test set in and get the predicted test results

y_pred = lsvc.predict(X_test)

See if the predicted results are consistent with the actual test set results. If consistent, it is 1, otherwise it is 0. The average is the accuracy.

accuracy = np.mean(y_pred==y_test)
print(accuracy)

can also be evaluated by score. The calculation results and ideas are the same. They all look at the probability of the model guessing correctly in all data sets. However, the score function has been encapsulated. Of course, the incoming The parameters are also different, you need to import accuracy_score, from sklearn.metrics import accuracy_score

score = lsvc.score(X_test,y_test)#得分
print(score)

⑦Model testing

Get a piece of data and use the trained model to evaluate
Here are six independent variables. I randomly throw them alltest = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
into the model. Get the prediction result, prediction = lsvc.predict(test)
See what the prediction result is and whether it is the same as the correct result, print(prediction)

test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = lsvc.predict(test)
print(prediction) #[2]

⑧Save the model

lsvc is the model name, which needs to be consistent
The following parameter is the path to save the model

joblib.dump(lsvc, './lsvc.model')#保存模型

⑨Load and use the model

lsvc_yy = joblib.load('./lsvc.model')

test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#随便找的一条数据
prediction = lsvc_yy.predict(test)#带入数据，预测一下
print(prediction) #[4]

Complete code

Model training and evaluation does not include ⑧⑨.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import joblib
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

fiber = pd.read_csv("./fiber.csv")
# 划分自变量和因变量
X = fiber.drop(['Grade'], axis=1)
Y = fiber['Grade']
#划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)

lsvc = LinearSVC(penalty='l2',loss='squared_hinge',C=2.0,random_state=42)
lsvc.fit(X_train,y_train)

y_pred = lsvc.predict(X_test)
accuracy = np.mean(y_pred==y_test)
print(accuracy)
score = lsvc.score(X_test,y_test)#得分
print(score)

test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])
prediction = lsvc.predict(test)
print(prediction) #[2]