Machine learning practice based on sklearn

LinearRegression

Getting Started with Linear Regression

Data generation

In order to intuitively see the idea of ​​​​the algorithm, we first generate some two-dimensional data to visually display

import numpy as np
import matplotlib.pyplot as plt 

def true_fun(X): # 这是我们设定的真实函数,即ground truth的模型
    return 1.5*X + 0.2

np.random.seed(0) # 设置随机种子
n_samples = 30 # 设置采样数据点的个数

'''生成随机数据作为训练集,并且加一些噪声'''
X_train = np.sort(np.random.rand(n_samples)) 
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1)
# 训练数据是加上一定的随机噪声的
Define model

We can directly use LinearRegression in sklearn:

from sklearn.linear_model import LinearRegression
model = LinearRegression()  # 这就是我们的模型
model.fit(X_train[:, np.newaxis], y_train)  # 训练模型
print("输出参数w:",model.coef_)
print("输出参数b:",model.intercept_)
输出参数w: [[1.4474774]]
输出参数b: [0.22557542]

Pay attention to np.newaxis in the above code. Because X_train is a one-dimensional vector, its function is to turn X_train into an N*1 two-dimensional matrix. In fact, writing X_train[:,None] has the same effect.

As for why you need to do this, you can try it without doing this. The error will be:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It can be simply understood that this is the requirement of the sklearn library for training data, which cannot be a one-dimensional vector.

Model testing and comparison

You can see that our output is 1.44 and 0.22, which are still very close to the real answer. Then we select a batch of test sets to see the accuracy:

X_test = np.linspace(0,1,100)  # 0和1之间,产生100个等间距的
plt.plot(X_test, model.predict(X_test[:, np.newaxis]), label = "Model")  # 将拟合出来的散点画出
plt.plot(X_test, true_fun(X_test), label = "True function")  # 真实结果
plt.scatter(X_train, y_train)  # 画出训练集的点
plt.legend(loc="best")  # 将标签放在最合适的位置
plt.show()

Insert image description here

The above situation is the simplest, but when higher dimensions appear, we need to perform polynomial regression to meet the needs.

polynomial regression

Implementation

For polynomial regression, linear regression is generally used to solve it y = ∑ i = 1 m b i × x i y=\sum_{i=1}^m b_i \times x^i < /span>and=i=1mbi×xi,In this calculation method:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures  # 导入能够计算多项式特征的类
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score  # 交叉验证

def true_fun(X):  # 真实函数
    return np.cos(1.5 * np.pi * X)

np.random.seed(0)
n_samples = 30 

X = np.sort(np.random.rand(n_samples))  # 随机采样后排序
y = true_fun(X) + np.random.randn(n_samples) * 0.1

degrees = [1, 4, 15] # 多项式最高次,我们分别用1次,4次和15次的多项式来尝试拟合
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i+1)  # 总共三个图,获取第i+1个图的图像柄
    plt.setp(ax, xticks = (), yticks = ())  # 这是 设置ax图中的属性
    polynomial_features = PolynomialFeatures(degree=degrees[i],include_bias=False)
    # 建立多项式回归的类,第一个参数就是多项式的最高次数,第二个是是否包含偏置
    linear_regression = LinearRegression()  # 线性回归
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)]) # 使用pipline串联模型
    pipeline.fit(X[:, np.newaxis], y) 
    scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10) 
    # 使用交叉验证,第一个参数为模型,第二个为输入,第三个为标签,第四个为误差计算方式,第五个为多少折
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")    
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(degrees[i], -scores.mean(), scores.std()))
plt.show()

Insert image description here

Here are two places to explain:

  • PolynomialFeatures: This class is actually a class for constructing features, because our original X is a one-dimensional vector, and its polynomial degree is 1. If we want to form a polynomial, we need to use /span> X 1 , X 2 , . . . , X m X^1,X^2,...,X^m X1,X2,...,Xm (this is the case of one variable, if there are multiple variables, cross multiplication will be calculated), then this class implements such an operation and constructs m features< /span>
  • pipeline: This is a pipeline that is convenient for us. It adds various modules together so that we do not have to calculate each module step by step. Here is the addition of PolynomialFeatures and linear regression modules. Then we will After entering, linear regression is performed after feature construction, so just fit the pipeline.

We also used the cross-validation idea. This part is very common and I won’t explain it much.

LogisticRegression

Brief description of algorithmic ideas

For logistic regression, most of them face binary classification problems. Given data X = { x 1 , x 2 , . . . , } , Y = { y 1 , y 2 , . . . , } X=\{x_1,x_2,...,\}, Y=\{y_1,y_2,...,\} X={ x1,x2,...,}Y={ y1,and2,...,}
Consider the two-category task, then the hypothesis function is:
h θ ( x ) = g ( θ T x ) = g ( w T x + b ) = 1 1 + e w T x + b h_{\theta}(x) = g(\theta^Tx)=g(w^Tx+b)=\frac{1}{1+e ^{w^Tx+b}} hθ(x)=g(θTx)=g(wTx+b)=1+It isInTx+b1
to represent the probability of category 1 or category 0.

Then the loss function is generally defined using the maximum likelihood estimation method:
L ( θ ) = ∏ i = 1 p ( y i = 1 ∣ x i ) = h θ ( x 1 ) ( 1 − h θ ( x ) ) . . . L(\theta)=\prod_{i=1}p(y_i=1\mid x_i)=h_{\theta}(x_1)(1- h_{\theta}(x))... L(θ)=i=1p(yi=1xi)=hθ(x1)(1hθ(x)). ..
这り假设 y 1 = 1 , y 2 = 0 y_1=1,y_2 =0and1=1,and2=If 0 is a smooth-flowing solution, the result is:
θ ∗ = a r g min ⁡ θ ( − L ( θ ) . ) = a r g min ⁡ θ − ln ⁡ ( L ( θ ) ) = ∑ i = 1 ( − yi θ T x i + ln ⁡ ( 1 + e θ T x i ) ) \theta^{*}=arg\min_{\ theta}(-L(\theta))=arg\min_{\theta}-\ln(L(\theta))\\ =\sum_{i=1}(-y_i\theta^Tx_i+\ln(1+). e^{\theta^Tx_i})) i=argimin(L(θ))=argiminln(L(θ ))=i=1(yiiTxi+ln(1+It isiTxi))
Just use gradient descent.

Algorithm implementation

# 下面为sklearn版本
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784")  # 数据
X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype = float)
y_train = np.array(y[:60000], dtype = float)
X_test = np.array(X[60000:], dtype = float)
y_test = np.array(y[60000:], dtype = float)  # 构造训练集和数据集
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(60000, 784)
(60000,)
(10000, 784)
(10000,)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l1', solver='saga', tol=0.1)
# 第一个参数为惩罚项选择l1还是l2,tol是停止求解的条件,solver可以认为是求解器
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Test score with L1 penalty: %.4f" % score)
Test score with L1 penalty: 0.9245

What I am curious about here is that logistic regression faces a two-classification problem, but here we directly give him a data set for a multi-classification problem. Why can it be solved directly? After checking it, I found that it is the optimization within the class that helps you realize this process.

# 以下为pytorch版本
from torch.utils.data import DataLoader
from torchvision import datasets
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np
train_dataset = datasets.MNIST(root = p_parent_path+'/datasets/', train = True,transform = transforms.ToTensor(), download = False)
test_dataset = datasets.MNIST(root = p_parent_path+'/datasets/', train = False, transform = transforms.ToTensor(), download = False)
#加载数据集
batch_size = len(train_dataset)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)
# 数据加载器
X_train,y_train = next(iter(train_loader))
X_test,y_test = next(iter(test_loader))
# 打印前100张图片
images, labels= X_train[:100], y_train[:100] 
# 使用images生成宽度为10张图的网格大小
img = torchvision.utils.make_grid(images, nrow=10)
# cv2.imshow()的格式是(size1,size1,channels),而img的格式是(channels,size1,size1),
# 所以需要使用.transpose()转换,将颜色通道数放至第三维
img = img.numpy().transpose(1,2,0)
print(images.shape)
print(labels.reshape(10,10))
print(img.shape)
plt.imshow(img)
plt.show()
torch.Size([100, 1, 28, 28])
tensor([[4, 7, 0, 9, 3, 6, 1, 7, 7, 8],
        [8, 3, 2, 7, 2, 4, 4, 3, 8, 0],
        [5, 6, 4, 9, 0, 6, 1, 2, 3, 3],
        [6, 0, 4, 3, 7, 0, 7, 6, 5, 1],
        [4, 3, 4, 8, 5, 3, 1, 5, 2, 4],
        [5, 4, 8, 5, 5, 1, 1, 6, 0, 4],
        [5, 4, 5, 1, 4, 4, 8, 2, 7, 3],
        [8, 1, 8, 6, 3, 7, 7, 9, 5, 9],
        [8, 4, 7, 0, 3, 6, 6, 2, 5, 3],
        [2, 0, 6, 5, 1, 7, 2, 7, 1, 2]])
(302, 302, 3)

Insert image description here

X_train,y_train = X_train.cpu().numpy(),y_train.cpu().numpy() # tensor转为array形式)
X_test,y_test = X_test.cpu().numpy(),y_test.cpu().numpy() # tensor转为array形式)
X_train = X_train.reshape(X_train.shape[0],784)  # 展开成1维度的向量的形式,长度为28*28等于784
X_test = X_test.reshape(X_test.shape[0],784)
model = LogisticRegression(solver='lbfgs', max_iter = 400)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred)) # 打印报告
precision    recall  f1-score   support

           0       0.50      0.75      0.60         4
           1       0.71      1.00      0.83        10
           2       0.79      0.85      0.81        13
           3       0.79      0.69      0.73        16
           4       0.83      0.91      0.87        11
           5       0.60      0.23      0.33        13
           6       1.00      1.00      1.00         5
           7       0.88      1.00      0.93         7
           8       0.67      0.83      0.74        12
           9       0.71      0.56      0.63         9

    accuracy                           0.75       100
   macro avg       0.75      0.78      0.75       100
weighted avg       0.74      0.75      0.73       100

Decision Tree

First, we introduce a data set, the iris data set. The data set contains a total of 150 records in 3 categories, with 50 data in each category. Each record has 4 features: sepal length, sepal width, petal length, Petal width, these four characteristics can be used to predict which species of iris flower belongs to (iris-setosa, iris-versicolour, iris-virginica).

import seaborn as sns
from pandas import plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

# 加载数据集
data = load_iris()
# 转换成DataFrame的格式
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Species'] = data.target  # 添加品种列
# 查看数据集信息
print(f"数据集信息:\n{
      
      df.info()}")
# 查看前5条数据
print(f"前5条数据:\n{
      
      df.head()}")
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   Species            150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB
数据集信息:
None5条数据:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   Species  
0        0  
1        0  
2        0  
3        0  
4        0  

Insert image description here

The above is a preliminary observation of the data. Let’s look at the specific algorithm implementation:

# 用数值代替品类名称
target = np.unique(data.target)  # 去重
print(target)
target_names = np.unique(data.target_names)
print(target_names)
targets = dict(zip(target, target_names))
print(targets)
df['Species'] = df['Species'].replace(targets)

# 提取数据和标签
X =df.drop(columns = 'Species')  # 把标签列丢掉就是特征
y = df['Species']
feature_names = X.columns
labels = y.unique()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
# 划分训练集与测试集,测试集比例为0.4,随机种子为42
model = DecisionTreeClassifier(max_depth = 3, random_state = 42)  # 决策树的最大深度为3
model.fit(X_train, y_train)
# 以文字的形式输出树
text_representation = tree.export_text(model)
print(text_representation)

# 以图片的形式画出
plt.figure(figsize=(30, 10), facecolor='w')
a = tree.plot_tree(model,
                  feature_names = feature_names,
                  class_names = labels,
                  rounded = True,
                  filled = True,
                  fontsize = 14)
plt.show()
[0 1 2]
['setosa' 'versicolor' 'virginica']
{
    
    0: 'setosa', 1: 'versicolor', 2: 'virginica'}
|--- feature_2 <= 2.45
|   |--- class: setosa
|--- feature_2 >  2.45
|   |--- feature_3 <= 1.75
|   |   |--- feature_2 <= 5.35
|   |   |   |--- class: versicolor
|   |   |--- feature_2 >  5.35
|   |   |   |--- class: virginica
|   |--- feature_3 >  1.75
|   |   |--- feature_2 <= 4.85
|   |   |   |--- class: virginica
|   |   |--- feature_2 >  4.85
|   |   |   |--- class: virginica

Insert image description here

MLP

For an introduction to multi-layer perceptrons, you can read my article on neural networksblog

Next we focus on algorithm implementation

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml
import numpy as np

mnist = fetch_openml("mnist_784")  # 加载数据集
X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)

clf = MLPClassifier(alpha = 1e-5, hidden_layer_sizes = (15,15), random_state=1)
# alpha为正则项的惩罚系数,第二个为每一层隐藏节点的个数,这里就是2层,每层15个

clf.fit(X_train, y_train)

score = clf.score(X_test, y_test)
score
0.9124

Then there are some parameters worth noting:

  • Activation: Select activation function, optional options are {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, the default is relu
  • solver: weight optimizer, optional {‘lbfgs’, ‘sgd’, ‘adam’}, the default is adam
  • learning_rate_init: initial learning rate, only used in sgd or adam

SVM

We still focus on the implementation of the SVM algorithm.

Choosing different cores mainly involves specifying the parameter kernel in svm.SVC.

Linear SVM

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
data = np.array([
    [0.1, 0.7],
    [0.3, 0.6],
    [0.4, 0.1],
    [0.5, 0.4],
    [0.8, 0.04],
    [0.42, 0.6],
    [0.9, 0.4],
    [0.6, 0.5],
    [0.7, 0.2],
    [0.7, 0.67],
    [0.27, 0.8],
    [0.5, 0.72]
])# 建立数据集
label = [1] * 6 + [0] * 6  # 前6个数据的label为1,后6个为0
x_min, x_max = data[:,0].min() - 0.2, data[:,0].max() + 0.2
y_min, y_max = data[:,1].min() - 0.2, data[:,0].max() + 0.2
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.002),
                    np.arange(y_min, y_max, 0.002))  # 生成网格网络
model_linear = svm.SVC(kernel = 'linear', C = 0.001)  # 线性SVM模型
model_linear.fit(data, label)
Z = model_linear.predict(np.c_[xx.ravel(), yy.ravel()])
# 是先将xx(size*size)和yy(size*size)拉成一维,然后让它们相连,成为一个两列的矩阵,然后作为X进去预测
Z = Z.reshape(xx.shape)
plt.contourf(xx,yy, Z, cmap=plt.cm.ocean, alpha=0.6)
# 可以理解为绘制等高线,xx为横坐标,yy为轴坐标,Z为确切坐标点的取值,cmap为配色方案
plt.scatter(data[:6,0],data[:6,1], marker='o', color='r', s=100, lw=3)
plt.scatter(data[6:,0],data[6:,1], marker='x', color='k', s=100, lw=3)
plt.title("Linear SVM")
plt.show()

Insert image description here

polynomial kernel

plt.figure(figsize=(16,15))

# 画出多个多项式等级来对比
for i, degree in enumerate([1,3,5,7,9,12]):
    model_poly = svm.SVC(C=0.001, kernel='poly', degree = degree)  # 多项式核
    model_poly.fit(data, label)
    Z = model_poly.predict(np.c_[xx.ravel(), yy.ravel()])#预测
    Z = Z.reshape(xx.shape)
    
    plt.subplot(3,2, i+1)
    plt.subplots_adjust(wspace=0.2, hspace=0.2)  # 调整子图的间距
    plt.contourf(xx,yy, Z, cmap=plt.cm.ocean, alpha=0.6)
    
    plt.scatter(data[:6, 0], data[:6, 1], marker='o', color='r', s=100, lw=3)
    plt.scatter(data[6:, 0], data[6:, 1], marker='x', color='k', s=100, lw=3)
    plt.title('Poly SVM with $\degree=$' + str(degree))
    
plt.show()

Insert image description here

Gaussian kernel

plt.figure(figsize=(16,15))

for i, gamma in enumerate([1,5,15,35,45,55]):
    model_rbf = svm.SVC(kernel='rbf', gamma=gamma, C = 0.001)  # 选择高斯核模型
    model_rbf.fit(data, label)
    Z = model_rbf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.subplot(3, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)
    plt.contourf(xx, yy, Z, cmap=plt.cm.ocean, alpha=0.6)
 
    # 画出训练点
    plt.scatter(data[:6, 0], data[:6, 1], marker='o', color='r', s=100, lw=3)
    plt.scatter(data[6:, 0], data[6:, 1], marker='x', color='k', s=100, lw=3)
    plt.title('RBF SVM with $\gamma=$' + str(gamma))
plt.show()    

Insert image description here

Compare the effects of different kernels on Mnist

Read data
from sklearn import svm
import numpy as np
from time import time
from sklearn.metrics import accuracy_score
from struct import unpack
from sklearn.model_selection import GridSearchCV

def readimage(path):
    with open(path, 'rb') as f:
        magic, num, rows, cols = unpack('>4I', f.read(16))
        img = np.fromfile(f, dtype=np.uint8).reshape(num, 784)
    return img

def readlabel(path):
    with open(path, 'rb') as f:
        magic, num = unpack('>2I', f.read(8))
        lab = np.fromfile(f, dtype=np.uint8)
    return lab

train_data  = readimage("../../datasets/MNIST/raw/train-images-idx3-ubyte")#读取数据
train_label = readlabel("../../datasets/MNIST/raw/train-labels-idx1-ubyte")
test_data   = readimage("../../datasets/MNIST/raw/t10k-images-idx3-ubyte")
test_label  = readlabel("../../datasets/MNIST/raw/t10k-labels-idx1-ubyte")
print(train_data.shape)
print(train_label.shape)
(60000, 784)
(60000,)
Gaussian kernel
#数据集中数据太多,为了节约时间,我们只使用前4000张进行训练
train_data=train_data[:4000]
train_label=train_label[:4000]
test_data=test_data[:400]
test_label=test_label[:400]

svc=svm.SVC()
parameters = {
    
    "kernel":['rbf'], "C":[1]}
print("Train....")
clf = GridSearchCV(svc, parameters, n_jobs=-1)  # 网格搜索来决定参数
start = time()
clf.fit(train_data, train_label)
end = time()
t = end - start
print("训练时间:%dmin%.3fsec" % (t//60, t-60 * (t//60)))
prediction = clf.predict(test_data)
print("accuracy:",accuracy_score(prediction, test_label))
accurate = [0] * 10
sumall = [0] * 10

i = 0
j = 0
while i < len(test_label):
    sumall[test_label[i]] += 1
    if prediction[i] == test_label[i]:
        j += 1
    i += 1
print("测试集准确率:", j/400)
Train....
训练时间:0min7.548sec
accuracy: 0.955
测试集准确率: 0.955
polynomial kernel
parameters = {
    
    'kernel':['poly'], 'C':[1]}#使用了多项式核
print("Train...")
clf=GridSearchCV(svc,parameters,n_jobs=-1)
start = time()
clf.fit(train_data, train_label)
end = time()
t = end - start
print('Train:%dmin%.3fsec' % (t//60, t - 60 * (t//60)))
prediction = clf.predict(test_data)
print("accuracy: ", accuracy_score(prediction, test_label))
accurate=[0]*10
sumall=[0]*10
i=0
j=0
while i<len(test_label):#计算测试集的准确率
    sumall[test_label[i]]+=1
    if prediction[i]==test_label[i]:
        j+=1
    i+=1
print("测试集准确率:",j/400)
Train...
Train:0min6.438sec
accuracy:  0.93
测试集准确率: 0.93
linear kernel
parameters = {
    
    'kernel':['linear'], 'C':[1]}#使用了线性核
print("Train...")
clf=GridSearchCV(svc,parameters,n_jobs=-1)
start = time()
clf.fit(train_data, train_label)
end = time()
t = end - start
print('Train:%dmin%.3fsec' % (t//60, t - 60 * (t//60)))
prediction = clf.predict(test_data)
print("accuracy: ", accuracy_score(prediction, test_label))
accurate=[0]*10
sumall=[0]*10
i=0
j=0
while i<len(test_label):#计算测试集的准确率
    sumall[test_label[i]]+=1
    if prediction[i]==test_label[i]:
        j+=1
    i+=1
print("测试集准确率:",j/400)
Train...
Train:0min3.712sec
accuracy:  0.9175
测试集准确率: 0.9175

NBayes

The call to this part of the algorithm is relatively simple:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs
# make_blobs:为聚类产生数据集
# n_samples:样本点数,n_features:数据的维度,centers:产生数据的中心点,默认值3
# cluster_std:数据集的标准差,浮点数或者浮点数序列,默认值1.0,random_state:随机种子
X, y = make_blobs(n_samples = 100, n_features = 2, centers = 2, random_state = 2, cluster_std = 1.5)
plt.scatter(X[:,0], X[:,1], c = y, s = 50, cmap = 'RdBu')
plt.show()

First draw a scatter plot of the training set:

Insert image description here

Next we build our own test set to see the effect:

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()  # 朴素贝叶斯
model.fit(X, y)
rng = np.random.RandomState(0)
X_test = [-6, -14] + [14,18] * rng.rand(5000,2) # 生成测试集
y_pred = model.predict(X_test)
# 将训练集和测试集的数据用图像表示出来,颜色深直径大的为训练集,颜色浅直径小的为测试集
plt.scatter(X[:,0],X[:,1], c = y, s = 50, cmap = 'RdBu')
lim = plt.axis()  # 获取当前的坐标轴限制参数
plt.scatter(X_test[:,0], X_test[:,1], c = y_pred, s = 20, cmap='RdBu', alpha = 0.1)
plt.axis(lim)
plt.show()

Insert image description here

It can be seen that basically the boundaries between the two categories are still very obvious.

We can also look at what the predicted probability looks like:

yprob = model.predict_proba(X_test)
yprob[:20].round(2)

Out[25]:

array([[0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.94, 0.06],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.01, 0.99],
       [0.  , 1.  ],
       [0.  , 1.  ]])

bagging and random forests

For an introduction to this aspect, you can read my article blog

Next we continue to focus on the implementation of the algorithm:

The first is the loading of the data set:

import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

wine = load_wine()  # 使用葡萄酒数据集
print(f"所有特征:{
      
      wine.feature_names}")
X = pd.DataFrame(wine.data, columns = wine.feature_names)
y = pd.Series(wine.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
所有特征:['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

Next, let's briefly look at what the results will be if we use a single decision tree:

# 构造并训练决策树分类器
base_model = DecisionTreeClassifier(max_depth = 1, criterion='gini', random_state = 1)  
# 使用基尼指数作为选择标准
base_model.fit(X_train, y_train)
y_pred = base_model.predict(X_test)
print(f"决策树的准确率为:{
      
      accuracy_score(y_test, y_pred):.3f}")
决策树的准确率为:0.694

As you can see, it is not accurate enough for a simple decision tree.

Then let's try the bagging ensemble using this decision tree as the base classifier to see how much improvement it can make:

from sklearn.ensemble import BaggingClassifier
# 这里的基分类器选择是上面构建的决策树模型,前面虽然已经fit了一次,但是不影响,应该也是重新fit的
model = BaggingClassifier(base_estimator = base_model,
                         n_estimators = 50,  # 最大的弱学习器的个数为50
                         random_state = 1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)# 预测
print(f"BaggingClassifier的准确率:{
      
      accuracy_score(y_test,y_pred):.3f}")
BaggingClassifier的准确率:0.917

You can see that the improvement is still very obvious! Next, we focus on the important parameter—the impact of the number of base classifiers on the results:

# 下面来测试基分类器个数的影响
x = list(range(2,102,2))  # 从2到102之间的偶数
y = []
for i in x:
    model = BaggingClassifier(base_estimator = base_model,
                             n_estimators = i,
                             random_state = 1)
    model.fit(X_train, y_train)
    model_test_sc = accuracy_score(y_test, model.predict(X_test))
    y.append(model_test_sc)  # 将得分进行存储
    
plt.style.use('ggplot')  # 设置绘图背景样式
plt.title("Effect of n_estimators", pad = 20)
plt.xlabel("Number of base estimators")
plt.ylabel("Test accuracy of BaggingClassifier")
plt.plot(x,y)
plt.show()

Insert image description here

It can be seen that the number of base classifiers is not always better! This is because too many may lead to redundancy, resulting in poor classification results.

Next, let’s observe the implementation of the improved algorithm-random forest:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier( n_estimators = 50,
                              random_state = 1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"RandomForestClassifier的准确率:{
      
      accuracy_score(y_test,y_pred):.3f}")
RandomForestClassifier的准确率:0.972

It can be seen that because Random Forest adds feature randomness, the diversity of its base classifier is improved, and the classification accuracy is further improved.

Let’s also observe the impact of the number of base classifiers on the results:

x = list(range(2, 102, 2))# 估计器个数即n_estimators,在这里我们取[2,102]的偶数
y = []

for i in x:
    model = RandomForestClassifier(
                              n_estimators=i,
                              
                              random_state=1)
  
    model.fit(X_train, y_train)
    model_test_sc = accuracy_score(y_test, model.predict(X_test))
    y.append(model_test_sc)

plt.style.use('ggplot')
plt.title("Effect of n_estimators", pad=20)
plt.xlabel("Number of base estimators")
plt.ylabel("Test accuracy of RandomForestClassifier")
plt.plot(x, y)
plt.show()

Insert image description here

It can be seen that for random forest, I think it is because it adds randomness to the features, so it is not so sensitive to quantity.

AdaBoost

For an introduction to AdaBoost, you can also read my articleblog

Below we still focus on the implementation of the algorithm:

Also import the data first, and then see how good the model is on a single decision tree:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()#使用葡萄酒数据集
print(f"所有特征:{
      
      wine.feature_names}")
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

base_model = DecisionTreeClassifier(max_depth = 1, criterion='gini', random_state = 1)
base_model.fit(X_train, y_train)
y_pred = base_model.predict(X_test)
print(f"决策树的准确率:{
      
      accuracy_score(y_test,y_pred):.3f}")
决策树的准确率:0.694

The result is the same as before.

Then we try to apply the AdaBoost algorithm to fit:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
model = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, learning_rate = 0.8)
# n_estimators和learning_rate是要调的参数,lr是弱学习器的权重衰减系数
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = metrics.accuracy_score(y_test, y_pred) # 准确率
print(f"准确率:{
      
      acc:.2}")
准确率:0.97

You can see that its effect has improved a lot! But this parameter is initialized randomly by us, and we try to use grid search to search for the parameters that perform best on the training set:

hyperparameter_space = {
    
    "n_estimators":list(range(2,102,2)),
                       "learning_rate":[0,1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
gs = GridSearchCV(AdaBoostClassifier(algorithm='SAMME.R', random_state = 1),
                 param_grid = hyperparameter_space,
                 scoring = 'accuracy', n_jobs = -1, cv = 5)
gs.fit(X_train, y_train)
print("最佳参数为:",gs.best_params_)
print("最佳得分为:",gs.best_score_)
最佳参数为: {
    
    'learning_rate': 0.8, 'n_estimators': 42}
最佳得分为:0.9857142857142858

Let’s look at its score on the test set:

model = AdaBoostClassifier(base_estimator=base_model, n_estimators=42, learning_rate = 0.8)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = metrics.accuracy_score(y_test, y_pred) # 准确率
print(f"准确率:{
      
      acc:.2}")
准确率:0.94

You can see that it is actually not as good as our previous parameters. It should be noted here that K-fold cross validation is performed when performing grid search. I initially thought that grid search was to find the parameters with the best fitting effect on the training set. This needs to be noted.

k-means algorithm

For a detailed introduction to clustering algorithms, you can read my articleblog

Next we continue to focus on algorithm implementation.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
from sklearn import datasets
%matplotlib inline
# 聚类前
X = np.random.rand(1000,2)
plt.scatter(X[:,0], X[:,1], marker='o')

Insert image description here

# 初始化质心,从原有数据中挑选k个作为质心
def IiniCentroids(X, k):
    index = np.random.randint(0, len(X)-1, k)
    return X[index]
# 聚类后
kmeans = KMeans(n_clusters = 4)  # 分成2类
kmeans.fit(X)
label_pred = kmeans.labels_
plt.scatter(X[:,0], X[:,1], c= label_pred)
plt.show()

Insert image description here

The code he gave was relatively small, so I searched for other parameters of the function:

  • n_clusters: int type, number of clusters generated, default is 8
  • max_iter: int type, the maximum number of iterations to perform, the default is 300
  • n_init: Select multiple different clustering centers, and finally select one as a result of the same operation.
  • init: has three optional values
    • k-means++: By default, a special method is used to select the initial centroid and accelerate convergence.
    • random: Randomly select centroids from the training data
    • Pass an ndarray: specify the centroid yourself
  • n_jobs
  • random_state

Its main properties are:

  • cluster_centers_: the final cluster center
  • labels: the cluster corresponding to each sample
  • inertia_: Used to evaluate whether the number of clusters is appropriate. The smaller the distance, the better the separation. Used to select the optimal number of clusters.
print("聚类中心为:",kmeans.cluster_centers_)
print("评估:",kmeans.inertia_)
聚类中心为: [[0.79862048 0.71591318]
 [0.22582347 0.26005466]
 [0.73845863 0.23886344]
 [0.29972473 0.76998545]]
评估: 41.37635968102986

KNN

For an introduction to this algorithm, you can read my blog, which explains the algorithm and the detailed Python implementation process.

Below we focus on the implementation process of using sklearn.

The neighbors module of the sklearn library implements KNN related algorithms, where:

  • The KNeighborsClassifier class is used to implement classification problems
  • The KNeighborsRegressor class is used to implement regression problems (the regression problem is simply understood as assigning the average value of features of nearby points to the target point)

The construction methods of these two classes are basically the same. Here we mainly introduce the KNeighborsClassifier class. The prototype is as follows:

KNeighborsClassifier(
    n_neighbors=5, 
    weights='uniform', 
    algorithm='auto', 
    leaf_size=30, 
    p=2, 
    metric='minkowski', 
    metric_params=None, 
    n_jobs=None, 
    **kwargs)

Mainly focus on these parameters:

  • n_neighbors: The K value in KNN, the default value 5 is generally used.
  • weights: used to determine the weights of neighbors. There are three ways:
    • weights=uniform means that all neighbors have the same weight
    • weights=distance means that the weight is the reciprocal of the distance, that is, it is inversely proportional to the distance.
    • Custom functions can customize the weights corresponding to different distances. Generally, there is no need to define functions yourself.
  • algorithm: used to set the algorithm for calculating neighbors. There are four ways:
    • algorithm=auto, automatically selects an appropriate algorithm based on the data situation
    • algorithm=kd_tree, use KD tree algorithm
      • KD tree is suitable for situations with fewer dimensions. Generally, the number of dimensions does not exceed 20. If the number of dimensions exceeds 20, the efficiency will decrease.
    • algorithm=ball_tree, use ball tree algorithm
      • Ball trees are more suitable for situations with larger dimensions
    • algorithm=brute, called brute force search
      • Compared with KD tree, it uses linear scanning instead of constructing a tree structure for fast retrieval.
      • The disadvantage is that when the training set is large, the efficiency is very low
  • leaf_size: Indicates the number of leaf nodes when constructing a KD tree or ball tree, the default is 30

Let’s get into the actual code:

from sklearn.datasets import load_digits
import pandas as pd
digits = load_digits()
data = digits.data     # 特征集
target = digits.target # 目标集
data_pd = pd.DataFrame(data)
data_pd

Insert image description here

It can be seen that there are 64 dimensions, which is equivalent to a scatter point in a 64-dimensional space.

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(
    data, target, test_size=0.25, random_state=33)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train_x, train_y)
predict_y = knn.predict(test_x)
from sklearn.metrics import accuracy_score
score = accuracy_score(test_y, predict_y)
score
0.9844444444444445

PCA

For a detailed explanation of the PCA algorithm, you can read my blog, which explains the PCA algorithm and the process of implementing PCA in Numpy.

Next we continue to focus on the implementation process of the algorithm:

#首先我们生成随机数据并可视化
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
from sklearn.datasets import make_blobs
# X为样本特征,Y为样本簇类别, 共1000个样本,每个样本3个特征,共4个簇
X, y = make_blobs(n_samples=10000, n_features=3, centers=[[3,3, 3], [0,0,0], [1,1,1], [2,2,2]], 
                  cluster_std=[0.2, 0.1, 0.2, 0.2], random_state =9)
fig = plt.figure()
ax = Axes3D(fig, rect=[0,0,1,1], elev = 20, azim = 10)  
# rect是左,底部,宽度,高度,用来确定范围,elev是上下观察视角,azim是左右观察视角
plt.scatter(X[:,0],X[:,1], X[:,2], marker='o')

Insert image description here

Because PCA needs to focus on the retained variance when reducing dimensionality, we do not perform dimensionality reduction first, but only project the data to see the variance distribution in the three dimensions after projection:

from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
pca.fit(X)
print(pca.explained_variance_ratio_)  # 各个特征保留的方差百分比
print(pca.explained_variance_)  # 各个特征的方差原始数值
[0.98318212 0.00850037 0.00831751]
[3.78521638 0.03272613 0.03202212]

It can be seen that the variance of the first dimension accounts for 98%.

Then try to reduce to 2 dimensions:

pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
[0.98318212 0.00850037]
[3.78521638 0.03272613]

It can be seen that if two dimensions are retained, it selects the first two features with a relatively large variance and discards the third feature.

We draw the dimensionally reduced picture:

X_new = pca.transform(X)
plt.scatter(X_new[:, 0], X_new[:, 1],marker='o')
plt.show()

Insert image description here

Our dimension reduction just now is to specify the number of retained dimensions, then we can also specify the retained variance proportion:

pca = PCA(n_components=0.95)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.n_components_)
[0.98318212]
[3.78521638]
1

It can be seen that because the first one occupies 98%, so to retain 95%, just retain the first dimension.

We can also let the MLE algorithm choose the result of dimensionality reduction:

pca = PCA(n_components='mle')
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.n_components_)
[0.98318212]
[3.78521638]
1

It can be seen that the MLE algorithm only retains the first feature.

Here we add the specific parameters of this class:

  • n_components: specifies the special number after dimensionality reduction or specifies the proportion of variance to be retained. It can also be automatically selected by setting it to MLE.
  • copy: Boolean value, whether the original training data needs to be copied
  • whliten: Boolean value, whether to whiten so that each feature has the same variance

The properties of this class are:

  • n_components_: Returns the number of retained features
  • explained_variance_ratio_: Returns the variance percentage of each feature retained
  • explained_variance_: Returns the variance of each feature retained

Commonly used methods are:

  • fit_transform(X): train and return the dimensionally reduced data
  • inverse_transform(newData): Convert the dimensionally reduced data newData back to the original data, which may be a little different
  • transform(X): Convert X into dimensionally reduced data

Try restoring the dimensionally reduced data:

new_Data = pca.transform(X)
X_regan = pca.inverse_transform(new_Data)
X-X_regan
array([[ 0.14364008, -0.1352249 , -0.00781994],
       [ 0.05135552, -0.01316744, -0.03802959],
       [-0.03610653,  0.07254754, -0.03665018],
       ...,
       [ 0.18537785, -0.0907325 , -0.09400653],
       [-0.2618617 ,  0.20035984,  0.06048799],
       [-0.02015389,  0.12283753, -0.10292754]])

There is still a big gap.

HMM

For an introduction to the principles of HMM, I highly recommend watching this video, it is really well explained!

We continue to focus on the implementation of the program.

hmmlearn implements three HMM model classes, which can be divided into two categories according to whether the observation state is a continuous state or a discrete state. GaussianHMM and GMMHMM are HMM models of continuous observation states, while MultinomialHMM is a model of discrete observation states. So let’s try using it:

#pip install hmmlearn

import numpy as np
import matplotlib.pyplot as plt

from hmmlearn import hmm

# Prepare parameters for a 4-components HMM
# Initial population probability
startprob = np.array([0.6, 0.3, 0.1, 0.0])
# The transition matrix, note that there are no transitions possible
# between component 1 and 3
transmat = np.array([[0.7, 0.2, 0.0, 0.1],
                     [0.3, 0.5, 0.2, 0.0],
                     [0.0, 0.3, 0.5, 0.2],
                     [0.2, 0.0, 0.2, 0.6]])
# The means of each component
means = np.array([[0.0, 0.0],
                  [0.0, 11.0],
                  [9.0, 10.0],
                  [11.0, -1.0]])
# The covariance of each component
covars = .5 * np.tile(np.identity(2), (4, 1, 1))

# Build an HMM instance and set parameters
gen_model = hmm.GaussianHMM(n_components=4, covariance_type="full")

# Instead of fitting it from the data, we directly set the estimated
# parameters, the means and covariance of the components
gen_model.startprob_ = startprob
gen_model.transmat_ = transmat
gen_model.means_ = means
gen_model.covars_ = covars

# Generate samples
X, Z = gen_model.sample(500)

# Plot the sampled data
fig, ax = plt.subplots()
ax.plot(X[:, 0], X[:, 1], ".-", label="observations", ms=6,
        mfc="orange", alpha=0.7)

# Indicate the component numbers
for i, m in enumerate(means):
    ax.text(m[0], m[1], 'Component %i' % (i + 1),
            size=17, horizontalalignment='center',
            bbox=dict(alpha=.7, facecolor='w'))
ax.legend(loc='best')
fig.show()

Insert image description here

scores = list()
models = list()
for n_components in (3, 4, 5):
    # define our hidden Markov model
    model = hmm.GaussianHMM(n_components=n_components,
                            covariance_type='full', n_iter=10)
    model.fit(X[:X.shape[0] // 2])  # 50/50 train/validate
    models.append(model)
    scores.append(model.score(X[X.shape[0] // 2:]))
    print(f'Converged: {
      
      model.monitor_.converged}'
          f'\tScore: {
      
      scores[-1]}')

# get the best model
model = models[np.argmax(scores)]
n_states = model.n_components
print(f'The best model had a score of {
      
      max(scores)} and {
      
      n_states} '
      'states')

# use the Viterbi algorithm to predict the most likely sequence of states
# given the model
states = model.predict(X)
Converged: True	Score: -1065.5259488089373
Converged: True	Score: -904.2908933008515
Converged: True	Score: -905.5449538166446
The best model had a score of -904.2908933008515 and 4 states
#让我们将我们的状态与生成的状态和我们的转换矩阵进行比较,来看我们的模型
# plot model states over time
fig, ax = plt.subplots()
ax.plot(Z, states)
ax.set_title('States compared to generated')
ax.set_xlabel('Generated State')
ax.set_ylabel('Recovered State')
fig.show()

# plot the transition matrix
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 5))
ax1.imshow(gen_model.transmat_, aspect='auto', cmap='spring')
ax1.set_title('Generated Transition Matrix')
ax2.imshow(model.transmat_, aspect='auto', cmap='spring')
ax2.set_title('Recovered Transition Matrix')
for ax in (ax1, ax2):
    ax.set_xlabel('State To')
    ax.set_ylabel('State From')

fig.tight_layout()
fig.show()

Insert image description here

Insert image description here

visualizetion_report

This chapter mainly explains the related visualization part of machine learning, which is implemented using Scikit-Plot. It mainly includes the following parts:

  • estimators: used to draw various algorithms
  • metrics: used to draw onfusion matrix, ROC AUC curves, precision-recall curves and other curves of machine learning
  • cluster: mainly used to draw clusters
  • decomposition: mainly used to draw PCA dimensionality reduction

First load the required modules:

# 加载需要用到的模块
import scikitplot as skplt

import sklearn
from sklearn.datasets import load_digits, load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys

print("Scikit Plot Version : ", skplt.__version__)
print("Scikit Learn Version : ", sklearn.__version__)
print("Python Version : ", sys.version)

If the skplt library is not installed, you can directly:

pip install scikit-plot

Load dataset

Handwriting dataset
digits = load_digits()
X_digits, Y_digits = digits.data, digits.target

print("Digits Dataset Size : ", X_digits.shape, Y_digits.shape)

X_digits_train, X_digits_test, Y_digits_train, Y_digits_test = train_test_split(X_digits, Y_digits,
                                                                                train_size=0.8,
                                                                                stratify=Y_digits,
                                                                                random_state=1)

print("Digits Train/Test Sizes : ",X_digits_train.shape, X_digits_test.shape, Y_digits_train.shape, Y_digits_test.shape)
Digits Dataset Size :  (1797, 64) (1797,)
Digits Train/Test Sizes :  (1437, 64) (360, 64) (1437,) (360,)
Tumor dataset
cancer = load_breast_cancer()
X_cancer, Y_cancer = cancer.data, cancer.target

print("Feautre Names : ", cancer.feature_names)
print("Cancer Dataset Size : ", X_cancer.shape, Y_cancer.shape)
X_cancer_train, X_cancer_test, Y_cancer_train, Y_cancer_test = train_test_split(X_cancer, Y_cancer,
                                                                                train_size=0.8,
                                                                                stratify=Y_cancer,
                                                                                random_state=1)

print("Cancer Train/Test Sizes : ",X_cancer_train.shape, X_cancer_test.shape, Y_cancer_train.shape, Y_cancer_test.shape)
Feautre Names :  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Cancer Dataset Size :  (569, 30) (569,)
Cancer Train/Test Sizes :  (455, 30) (114, 30) (455,) (114,)
Boston house price data set
boston = load_boston()
X_boston, Y_boston = boston.data, boston.target

print("Boston Dataset Size : ", X_boston.shape, Y_boston.shape)

print("Boston Dataset Features : ", boston.feature_names)
X_boston_train, X_boston_test, Y_boston_train, Y_boston_test = train_test_split(X_boston, Y_boston,
                                                                                train_size=0.8,
                                                                                random_state=1)

print("Boston Train/Test Sizes : ",X_boston_train.shape, X_boston_test.shape, Y_boston_train.shape, Y_boston_test.shape)
Boston Dataset Size :  (506, 13) (506,)
Boston Dataset Features :  ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Boston Train/Test Sizes :  (404, 13) (102, 13) (404,) (102,)

Performance visualization

cross validation plot

We plot the cross-validation learning curve for logistic regression:

skplt.estimators.plot_learning_curve(LogisticRegression(), X_digits, Y_digits,
                                     cv=7, shuffle=True, scoring="accuracy",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="Digits Classification Learning Curve")
plt.show()

skplt.estimators.plot_learning_curve(LinearRegression(), X_boston, Y_boston,
                                     cv=7, shuffle=True, scoring="r2", n_jobs=-1,
                                     figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="Boston Regression Learning Curve ");
plt.show()                                

Insert image description here
Insert image description here

It should be noted that because the evaluation metric on the second data set uses r2, its score is slightly different from the first one.

Importance feature drawing

Good features have the following characteristics:

  • Discriminative and not redundant with other features
  • features are independent of each other
  • Simple and easy to understand

Therefore, the importance feature drawing allows us to intuitively see which features are considered by the function to be more excellent and important features.

rf_reg = RandomForestRegressor()  # 随机森林
rf_reg.fit(X_boston_train, Y_boston_train)
print(rf_reg.score(X_boston_test, Y_boston_test))
gb_classif = GradientBoostingClassifier()  # 梯度提升
gb_classif.fit(X_cancer_train, Y_cancer_train)
print(gb_classif.score(X_cancer_test, Y_cancer_test))
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)  # 两张图,现在ax1是可以在第一张图上画
skplt.estimators.plot_feature_importances(rf_reg, feature_names=boston.feature_names, 
                                         title = "Random Forest Regressor Feature Importance",
                                         x_tick_rotation = 90, order="ascending", ax=ax1)
# x_tick_rotation是将x轴的文字旋转90°
ax2 = fig.add_subplot(122)
skplt.estimators.plot_feature_importances(gb_classif, feature_names=cancer.feature_names,
                                         title="Gradient Boosting Classifier Feature Importance",
                                         x_tick_rotation=90,
                                         ax=ax2);

plt.tight_layout()  # 会自动调整子图参数,使之填充整个图像区域
plt.show()

Insert image description here

Machine learning metrics

confusion matrix

For two classifications, a simple understanding of the confusion matrix is ​​as follows:

Insert image description here

Then we often use it to calculate precision and recall, and at the same time calculate F1 score. For multi-classification, it is equivalent to expanding the dimensions of the square matrix.

log_reg = LogisticRegression()
log_reg.fit(X_digits_train, Y_digits_train)
log_reg.score(X_digits_test, Y_digits_test)
Y_test_pred = log_reg.predict(X_digits_test)

fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(1,2,1)
skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred, title="Confusion Matrix", cmap="Oranges", ax=ax1)
ax2 = fig.add_subplot(1,2,2)
skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred,
                                    normalize=True,  # 相当于约束到分数
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);
plt.show()

Insert image description here

The second picture adds normalize=True, which is equivalent to compressing to a ratio between 1.

ROC, AUC curve

To understand the ROC curve, we need to start with the confusion matrix:

Insert image description here

in:

  • TP: predicted to be 1, actual to be 1, true positive rate
  • FP: predicted to be 1, true to be 0, false positive rate
  • TN: predicted to be 0, actual to be 0, true negative rate
  • FN: predicted to be 0, true to be 1, false negative rate

Then the total number of real positive examples in the sample is TP+FN, then the proportion of correctly predicted positive classes among all positive classes is:
T P R = T P T P + F N TPR=\frac {TP}{TP+FN} TPR=TP+FNTP
In the same way, the real counterexample is FP+TN, then the proportion of counterexamples with wrong predictions among all counterexamples is:
F P R = F P T N + F P FPR=\frac{FP}{ TN+FP} FPR=TN+FPFP
Another concept is the cut-off point t, which means that when the model's predicted probability of a sample is greater than t, it is classified as a positive class, otherwise it is classified as a negative class.

Then the ROC curve is a two-dimensional curve drawn from the results of TPR and FPR when the cutoff point t takes different values ​​for the data set.

The AUC curve is the area of ​​the ROC curve.

Y_test_probs = log_reg.predict_proba(X_digits_test)
skplt.metrics.plot_roc_curve(Y_digits_test, Y_test_probs, title="Digits ROC Curve", figsize=(12,6))
plt.show()

Insert image description here

PR curve

The PR curve is drawn in the same way as the ROC curve. The two selected indicators are precision and recall:
p r e c i s i o n = T P T P + F P r e c a l l = T P T P + F N precision= \frac{TP}{TP+FP}\\ recall = \frac{TP}{TP+FN} precision=TP+FPTPrecall=TP+FNTP
Then also select different cutoff points to draw the values.

skplt.metrics.plot_precision_recall_curve(Y_digits_test, Y_test_probs, title="Digits Precision-Recall Curve", figsize=(12,6))
plt.show()

Insert image description here

Contour analysis

A simple understanding of contour analysis is to judge the quality of clustering effects.

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_digits_train, Y_digits_train)
cluster_labels = kmeans.predict(X_digits_test)
skplt.metrics.plot_silhouette(X_digits_test, cluster_labels,figsize=(8,6))
plt.show()

Insert image description here

reliability curve

Testing the reliability of probabilistic models.

lr_probas = LogisticRegression().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
rf_probas = RandomForestClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
gb_probas = GradientBoostingClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
et_scores = ExtraTreesClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Extra Trees Classifier']
skplt.metrics.plot_calibration_curve(Y_cancer_test,
                                     probas_list,
                                     clf_names, n_bins=15,
                                     figsize=(12,6)
                                     )
plt.show()

Insert image description here

KS test

The KS test is used to test whether two samples follow the same distribution.

rf = RandomForestClassifier()
rf.fit(X_cancer_train, Y_cancer_train)
Y_cancer_probas = rf.predict_proba(X_cancer_test)

skplt.metrics.plot_ks_statistic(Y_cancer_test, Y_cancer_probas, figsize=(10,6))
plt.show()

Insert image description here

Cumulative revenue curve
skplt.metrics.plot_cumulative_gain(Y_cancer_test, Y_cancer_probas, figsize=(10,6))
plt.show()

Insert image description here

Lift curve
skplt.metrics.plot_lift_curve(Y_cancer_test, Y_cancer_probas, figsize=(10,6))
plt.show()

Insert image description here

clustering method

elbow method

Used to select the number of clusters that should be selected for clustering

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
                               X_digits,
                               cluster_ranges=range(2, 20),
                               figsize=(8,6))
plt.show()

Insert image description here

Dimensionality reduction method

PCA

You can view the proportion of variance accounted for by the first n principal components of PCA:

pca = PCA(random_state=1)
pca.fit(X_digits)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6))
plt.show()

Insert image description here

2-D Projection

2D projection:

skplt.decomposition.plot_pca_2d_projection(pca, X_digits, Y_digits,
                                           figsize=(10,10),
                                           cmap="tab10")
plt.show()

Insert image description here

Guess you like

Origin blog.csdn.net/StarandTiAmo/article/details/127934019