1. introduction

This 本项目旨在对DryBean数据集进行详细分析，以揭示数据中的潜在信息和特征关系。DryBean数据集包含了7种不同品种的土耳其干豆数据，每种豆子都有16个特征，这些特征可以帮助我们了解豆子的形状、大小和形态。通过对这些特征进行探索性数据分析，我们可以更深入地了解各种豆子之间的差异，并找出可能影响豆子分类的关键因素。

在分析过程中，我们使用了Python编程语言和多种数据分析库，如pandas、seaborn和matplotlib。首先，我们对数据集进行了整体概览，包括类别分布和特征描述性统计。接着，我们通过特征分布直方图和关联度矩阵，对特征进行了可视化分析，以便更直观地观察特征之间的关系和数据的分布特点。

在对DryBea数据集进行初步的探索性数据分析之后，我们继续使用机器学习方法对这些豆子进行分类。机器学习是一种人工智能技术，可以从数据中学习并根据学到的知识进行预测或决策。在本项目中，我们的目标是利用机器学习模型，根据豆子的特征来识别和分类不同的干豆品种。

在对DryBean数据集进行初步的探索性数据分析之后，我们继续使用机器学习方法对这些豆子进行分类。机器学习是一种人工智能技术，可以从数据中学习并根据学到的知识进行预测或决策。在本项目中，我们的目标是利用机器学习模型，根据豆子的特征来识别和分类不同的干豆品种。

首先，我们将数据集划分为训练集和测试集，以便在训练模型后对其性能进行评估。接着，我们对数据进行预处理，包括缩放特征值、处理缺失数据以及编码类别标签等。这些预处理步骤有助于提高模型的准确性和稳定性。

然后，我们尝试了多种机器学习算法，如支持向量机（SVM）、随机森林（RF）、K最近邻（KNN）和梯度提升（GB）等，以找到最适合解决干豆分类问题的模型。为了比较这些模型的性能，我们使用了交叉验证技术，并计算了准确率、精确度、召回率和F1分数等评价指标。

2. 数据的探索1

import pandas as pd

# 读取Excel文件
data = pd.read_excel('Dataset.xlsx')

# 计算基本统计信息
stats = data.describe()

# 打印基本统计信息
print(stats)

从上述代码中，我们可以看到数据集包含 13611 个观测值，每个观测值由 16 个特征组成。通过计算数据集的基本统计信息，我们可以获取每个特征的均值、标准差、最小值、最大值和四分位数等描述性统计信息。

以下是对每个特征统计信息的简要概述：

Area：区域面积的均值为 53048.28，标准差为 29324.10。面积范围从最小值 20420 到最大值 254616，显示出较大的差异。

Perimeter：周长的均值为 855.28，标准差为 214.29。周长范围从最小值 524.74 到最大值 1985.37，同样显示出较大的差异。

MajorAxisLength 和 MinorAxisLength：主轴长和次轴长的均值分别为 320.14 和 202.27，标准差分别为 85.69 和 44.97。这两个特征表示形状的大小和形状。

AspectRation：宽高比的均值为 1.58，标准差为 0.25，范围从 1.02 到 2.43，表示形状的纵横比差异较大。

Eccentricity：偏心率的均值为 0.75，标准差为 0.09，范围从 0.22 到 0.91，表示形状的椭圆程度有很大差异。

ConvexArea：凸包面积的均值为 53768.20，标准差为 29774.92。面积范围从最小值 20684 到最大值 263261，显示出较大的差异。

EquivDiameter：等效直径的均值为 253.06，标准差为 59.18，范围从 161.24 到 569.37，表示形状大小差异较大。

Extent：形状的扩展程度均值为 0.75，标准差为 0.05，范围从 0.56 到 0.87，表示形状的扩展程度有一定差异。

Solidity：实心度的均值为 0.99，标准差为 0.00466，范围从 0.92 到 0.99，表示形状的实心程度较为接近。

roundness：圆度的均值为 0.87，标准差为 0.06，范围从 0.49 到 0.99，表示形状的圆度有一定差异。

Compactness：紧密度的均值为 0.80，标准差为 0.06，范围从 0.64 到 0.99，表示形状的紧密程度有一定差异。

ShapeFactor1、ShapeFactor2、ShapeFactor3 和 ShapeFactor4：这四个形状因子的均值和标准差分别为：(0.00656, 0.00113)，(0.00172, 0.00060)，(0.6436, 0.0990)，(0.9951, 0.00437)。这些形状因子表示了不同方面的形状特征。

通过对这些统计信息的分析，我们可以对数据集中的特征分布有一个大致了解。值得注意的是，一些特征之间可能存在相关性（例如，Area 和 Perimeter），在后续的数据分析和建模过程中需要注意这一点。为了更好地了解这些特征之间的关系，可以使用相关系数矩阵、散点图矩阵等方法来进一步探索数据集。

此外，在进行机器学习建模之前，对数据进行预处理（例如，缺失值处理、异常值处理、特征缩放等）也是很关键的步骤。这将有助于提高模型的性能和稳定性。

数据的探索2

PCA（主成分分析）和t-SNE（t-distributed Stochastic Neighbor Embedding）是两种常用的数据降维可视化方法。它们的主要目的是将高维数据投影到低维空间（通常是二维或三维），以便于观察数据的结构和关系。这在数据探索、特征选择和模型评估等任务中非常有用。

PCA Visualization

PCA 是一种线性降维方法，通过对原始数据进行正交变换，将数据投影到新的特征空间。新特征空间中的每个维度（主成分）都是原始数据中方差最大的方向。PCA 可视化有以下优点：

解释方差： PCA 可以保留数据中最大的方差，从而在较低维度空间中保留尽可能多的信息。
计算效率： 相对于其他降维方法，PCA 的计算复杂度较低，因此在处理大规模数据时具有较好的性能。
特征选择： PCA 可以帮助我们识别数据中的主要特征，从而在特征选择时提供有价值的信息。
然而，PCA 的一个缺点是它假设数据的主成分是线性相关的。对于非线性数据，PCA 可能无法捕捉到复杂的结构。

t-SNE Visualization

t-SNE 是一种基于概率分布的非线性降维方法。它通过在高维和低维空间中保持样本之间的相似度来降低数据维度。t-SNE 可视化有以下优点：

非线性： t-SNE 可以捕捉到非线性数据结构，因此在数据具有复杂结构时，t-SNE 可能更具优势。
簇识别： t-SNE 能够以较好的分辨率揭示数据中的簇结构，有助于我们识别不同的数据子集。
然而，t-SNE 的一个缺点是计算复杂度较高，因此在处理大规模数据时可能较慢。此外，t-SNE 的结果可能受到超参数（如感知半径和迭代次数）的影响，因此需要多次尝试才能找到合适的参数设置。

总的来说，PCA 和 t-SNE 可视化都可以帮助我们更好地理解数据的结构和关系。PCA 适用于线性数据，具有较高的计算效率；而 t-SNE 能够处理非线性数据，揭示数据中的簇结构。在实际应用中，我们可以根据数据特点和任务需求选择合适的降维可视化方法。

3. KNN 预处理选择

Without Pre-Processing

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

data = pd.read_excel('Dataset.xlsx')
print(data['Class'].value_counts())

# 定义特征和目标变量
X = data.drop('Class', axis=1)
y = data['Class']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

以下不做预处理的结果

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64
Confusion Matrix:
[[118   0 104   0  30   0   9]
 [  0 117   0   0   0   0   0]
 [100   0 202   0  14   0   1]
 [  0   0   0 597   1  31  42]
 [ 42   0  24  14 268   1  59]
 [  2   0   0  84   8 257  62]
 [  0   0   0  69  45  22 400]]

Classification Report:
              precision    recall  f1-score   support

    BARBUNYA       0.45      0.45      0.45       261
      BOMBAY       1.00      1.00      1.00       117
        CALI       0.61      0.64      0.62       317
    DERMASON       0.78      0.89      0.83       671
       HOROZ       0.73      0.66      0.69       408
       SEKER       0.83      0.62      0.71       413
        SIRA       0.70      0.75      0.72       536

    accuracy                           0.72      2723
   macro avg       0.73      0.71      0.72      2723
weighted avg       0.72      0.72      0.72      2723

With Pre-Processing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

data = pd.read_excel('Dataset.xlsx')
print(data['Class'].value_counts())

# 1. 缺失值处理
data = data.dropna()

# 2. 异常值处理
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# 定义特征和目标变量
X = data.drop('Class', axis=1)
y = data['Class']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. 特征缩放 - 先使用 StandardScaler 进行标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 特征缩放 - 使用 MinMaxScaler 进行归一化
min_max_scaler = MinMaxScaler()
X_train_normalized = min_max_scaler.fit_transform(X_train_scaled)
X_test_normalized = min_max_scaler.transform(X_test_scaled)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_normalized, y_train)

y_pred = knn.predict(X_test_normalized)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

以下是做了预处理的结果

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64
Confusion Matrix:
[[288  26   0   2   1  10]
 [ 11 339   0   5   1   3]
 [  0   0 939   3   9  48]
 [  0   9   3 307   0  11]
 [  3   0  17   0 337   8]
 [  3   2  93   6  12 683]]

Classification Report:
              precision    recall  f1-score   support

    BARBUNYA       0.94      0.88      0.91       327
        CALI       0.90      0.94      0.92       359
    DERMASON       0.89      0.94      0.92       999
       HOROZ       0.95      0.93      0.94       330
       SEKER       0.94      0.92      0.93       365
        SIRA       0.90      0.85      0.87       799

    accuracy                           0.91      3179
   macro avg       0.92      0.91      0.92      3179
weighted avg       0.91      0.91      0.91      3179

在这里，我们对比了使用 KNN 分类器进行分类的两种方法：一种是没有进行数据预处理，另一种是进行了数据预处理。下面是两种方法在混淆矩阵和分类报告中的结果：

未进行数据预处理的 KNN 分类器：
准确率：0.72
加权平均 F1 分数：0.72
进行数据预处理的 KNN 分类器：
准确率：0.91
加权平均 F1 分数：0.91

通过对比两种方法的结果，我们可以清楚地看到，进行数据预处理的 KNN 分类器在准确率和 F1 分数方面均有显著提高。以下是详细分析：

缺失值处理：
在进行数据预处理的情况下，首先处理了数据中的缺失值。这有助于提高模型的性能，因为缺失值可能导致模型对训练数据的拟合不佳和预测不准确。
异常值处理：
异常值处理也有助于提高模型性能。异常值可能会对模型产生不良影响，因为它们可能导致模型对训练数据的拟合过度，从而影响其在测试数据上的泛化能力。
特征缩放：
在预处理过程中，我们使用了 StandardScaler 进行特征标准化，然后使用 MinMaxScaler 进行特征归一化。这对 KNN 分类器尤为重要，因为 KNN 是基于距离度量的方法，不同尺度的特征可能导致距离计算失真，进而影响模型性能。

综上所述，数据预处理对于 KNN 分类器的性能提升至关重要。在实际应用中，我们应该充分考虑数据预处理的步骤，以便更好地利用数据，提高分类器的性能。

具体到每个类别

在数据预处理之后，我们可以观察到每个类别的性能都有所提升。以下是每个类别在未进行数据预处理和进行数据预处理后的准确率、召回率和 F1 分数的对比：

BARBUNYA 类别：

未进行数据预处理：
准确率：0.45
召回率：0.45
F1 分数：0.45

进行数据预处理：
准确率：0.94
召回率：0.88
F1 分数：0.91

提升：准确率提升了 0.49，召回率提升了 0.43，F1 分数提升了 0.46。

CALI 类别：

未进行数据预处理：
准确率：0.61
召回率：0.64
F1 分数：0.62

进行数据预处理：
准确率：0.90
召回率：0.94
F1 分数：0.92

提升：准确率提升了 0.29，召回率提升了 0.30，F1 分数提升了 0.30。

DERMASON 类别：

未进行数据预处理：
准确率：0.78
召回率：0.89
F1 分数：0.83

进行数据预处理：
准确率：0.89
召回率：0.94
F1 分数：0.92

提升：准确率提升了 0.11，召回率提升了 0.05，F1 分数提升了 0.09。

HOROZ 类别：

未进行数据预处理：
准确率：0.73
召回率：0.66
F1 分数：0.69

进行数据预处理：
准确率：0.95
召回率：0.93
F1 分数：0.94

提升：准确率提升了 0.22，召回率提升了 0.27，F1 分数提升了 0.25。

SEKER 类别：

未进行数据预处理：
准确率：0.83
召回率：0.62
F1 分数：0.71

进行数据预处理：
准确率：0.94
召回率：0.92
F1 分数：0.93

提升：准确率提升了 0.11，召回率提升了 0.30，F1 分数提升了 0.22。

SIRA 类别：

未进行数据预处理：
准确率：0.70
召回率：0.75
F1 分数：0.72

进行数据预处理：
准确率：0.90
召回率：0.85
F1 分数：0.87

提升：准确率提升了 0.20，召回率提升了 0.10，F1 分数提升了 0.15。

通过对比每个类别的性能指标，我们可以看到数据预处理对每个类别的性能都有显著提升。这再次证明了数据预处理在提高 KNN 分类器性能中的重要作用。

4. 把PCA加入预处理

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

data = pd.read_excel('Dataset.xlsx')
print(data['Class'].value_counts())

# 1. 缺失值处理
data = data.dropna()

# 2. 异常值处理
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# 特征选择：保留具有代表性的特征
selected_features = ['Area', 'AspectRation', 'Compactness', 'roundness', 'Extent']
X = data[selected_features]
y = data['Class']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. 特征缩放
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 降维：使用 LDA
lda = LDA()
X_train_lda = lda.fit_transform(X_train_scaled, y_train)
X_test_lda = lda.transform(X_test_scaled)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_lda, y_train)

y_pred = knn.predict(X_test_lda)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

效果

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64
Confusion Matrix:
[[294  22   1   4   0   6]
 [ 14 337   0   4   1   3]
 [  0   0 916   3  23  57]
 [  1   8   4 307   0  10]
 [  3   0  14   0 337  11]
 [  3   0  89   7  17 683]]

Classification Report:
              precision    recall  f1-score   support

    BARBUNYA       0.93      0.90      0.92       327
        CALI       0.92      0.94      0.93       359
    DERMASON       0.89      0.92      0.91       999
       HOROZ       0.94      0.93      0.94       330
       SEKER       0.89      0.92      0.91       365
        SIRA       0.89      0.85      0.87       799

    accuracy                           0.90      3179
   macro avg       0.91      0.91      0.91      3179
weighted avg       0.90      0.90      0.90      3179

从这两个结果中，我们可以看到带 PCA 预处理（PCA 的结果）和不带 PCA 预处理（不带 PCA 的结果）的分类性能指标。现在我们来比较这两种方法。

带 PCA 预处理的结果：

准确率（accuracy）：0.90
加权平均 F1-score：0.90
不带 PCA 预处理的结果：

准确率（accuracy）：0.91
加权平均 F1-score：0.91
从上面的结果可以看出，不带 PCA 预处理的方法在准确率和 F1-score 方面略高于带 PCA 预处理的方法。这表明，在这个特定的问题中，不使用 PCA 预处理可能会带来更好的分类性能。

原因可能有以下几点：

数据结构： PCA 是一种线性降维方法，它假设数据的主成分是线性相关的。但是，如果数据中存在非线性关系，PCA 可能无法很好地处理这种关系。在这种情况下，不使用 PCA 预处理可能更能保留数据中的复杂结构。

信息损失：在 PCA 过程中，为了降低数据维度，我们需要丢弃一部分主成分。这可能导致一定程度的信息损失，从而影响分类性能。在这个例子中，不使用 PCA 预处理可以保留更多的原始信息，有助于提高分类性能。

模型适应性：使用 PCA 预处理可能会使得模型更容易适应线性数据，但在处理非线性数据时可能不够灵活。而不使用 PCA 预处理的模型可能在处理非线性数据时具有更好的适应性。

最后选定了以上的预处理的策略

5. 集成算法

集成算法是一种将多个基本模型（基学习器）组合起来以提高预测准确性的方法。在分类任务中，集成算法的使用具有以下优点：

提高预测准确性：集成算法通过综合多个基学习器的预测结果，可以降低模型的方差和偏差。这有助于提高模型的准确性，尤其是在处理具有高维度和复杂数据的任务时。

减少过拟合：由于集成算法结合了多个基学习器，这有助于降低单个模型过拟合的风险。这使得集成算法在处理噪声数据和异常值时更加稳定。

增强模型的泛化能力：集成算法可以有效地利用不同基学习器的特点，使得整体模型在面对新的、未见过的数据时具有更好的泛化能力。

处理不平衡数据：在处理类别不平衡的数据时，集成算法（如随机森林和梯度提升树）可以通过对少数类别进行过采样或对多数类别进行欠采样来提高分类性能。

容易并行化：许多集成算法（如随机森林和Bagging）可以在训练时进行并行化处理，从而大大加快训练速度和效率。

针对这个分类任务，选择集成算法可以帮助我们实现上述优点

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb

data = pd.read_excel('Dataset.xlsx')
print(data['Class'].value_counts())

# 1. 缺失值处理
data = data.dropna()

# 2. 异常值处理
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# 定义特征和目标变量
X = data.drop('Class', axis=1)
y = data['Class']

# 对目标变量进行编码
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. 特征缩放 - 先使用 StandardScaler 进行标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 特征缩放 - 使用 MinMaxScaler 进行归一化
min_max_scaler = MinMaxScaler()
X_train_normalized = min_max_scaler.fit_transform(X_train_scaled)
X_test_normalized = min_max_scaler.transform(X_test_scaled)

# 分类器列表
classifiers = [
    ('GBDT', GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=0)),
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=0)),
    ('XGBoost', xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=0)),
    ('LightGBM', lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1, random_state=0))
]

# 对每个分类器进行训练和测试
for name, clf in classifiers:
    clf.fit(X_train_normalized, y_train)
    y_pred = clf.predict(X_test_normalized)
    
    print(f"\n{
      
      name} - Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print(f"\n{
      
      name} - Classification Report:")
    print(classification_report(y_test, y_pred))

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64

GBDT - Confusion Matrix:
[[297  17   0   3   2   8]
 [ 10 339   0   6   1   3]
 [  0   0 940   2  10  47]
 [  0   8   3 305   0  14]
 [  1   0  15   0 339  10]
 [  3   0  91   7  11 687]]

GBDT - Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.91      0.93       327
           1       0.93      0.94      0.94       359
           2       0.90      0.94      0.92       999
           3       0.94      0.92      0.93       330
           4       0.93      0.93      0.93       365
           5       0.89      0.86      0.88       799

    accuracy                           0.91      3179
   macro avg       0.93      0.92      0.92      3179
weighted avg       0.91      0.91      0.91      3179


Random Forest - Confusion Matrix:
[[297  17   0   3   2   8]
 [ 14 337   0   4   1   3]
 [  0   0 939   1  10  49]
 [  0   7   3 304   0  16]
 [  2   0   9   0 345   9]
 [  3   0  91   3  16 686]]

Random Forest - Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.91      0.92       327
           1       0.93      0.94      0.94       359
           2       0.90      0.94      0.92       999
           3       0.97      0.92      0.94       330
           4       0.92      0.95      0.93       365
           5       0.89      0.86      0.87       799

    accuracy                           0.91      3179
   macro avg       0.93      0.92      0.92      3179
weighted avg       0.92      0.91      0.91      3179


XGBoost - Confusion Matrix:
[[301  14   0   1   2   9]
 [  8 343   0   5   1   2]
 [  0   0 942   2   9  46]
 [  0   7   3 308   0  12]
 [  1   0  15   0 341   8]
 [  3   0  90   3  13 690]]

XGBoost - Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       327
           1       0.94      0.96      0.95       359
           2       0.90      0.94      0.92       999
           3       0.97      0.93      0.95       330
           4       0.93      0.93      0.93       365
           5       0.90      0.86      0.88       799

    accuracy                           0.92      3179
   macro avg       0.93      0.93      0.93      3179
weighted avg       0.92      0.92      0.92      3179


LightGBM - Confusion Matrix:
[[305  13   0   1   2   6]
 [  9 342   0   4   1   3]
 [  0   0 934   2  10  53]
 [  0  10   4 304   0  12]
 [  2   0  19   0 334  10]
 [  2   1  91   4  13 688]]

LightGBM - Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.93      0.95       327
           1       0.93      0.95      0.94       359
           2       0.89      0.93      0.91       999
           3       0.97      0.92      0.94       330
           4       0.93      0.92      0.92       365
           5       0.89      0.86      0.88       799

    accuracy                           0.91      3179
   macro avg       0.93      0.92      0.92      3179
weighted avg       0.91      0.91      0.91      3179

当然可以。从上面的分类报告可以看出，每个分类器在各种度量标准（精确度、召回率、F1 分数）上的性能。下面是一个简要的比较：

GBDT (Gradient Boosting Decision Tree)：

准确率：0.91
宏平均 F1 分数：0.92
加权平均 F1 分数：0.91
整体表现较好，特别是在第 0、1、3、4 类别上。
Random Forest：

准确率：0.91
宏平均 F1 分数：0.92
加权平均 F1 分数：0.91
随机森林的表现与 GBDT 类似，但在第 4 类别上的 F1 分数略高。
XGBoost：

准确率：0.92
宏平均 F1 分数：0.93
加权平均 F1 分数：0.92
在这个问题上，XGBoost 的表现略优于 GBDT 和随机森林。它在所有类别上的表现都很好，特别是在第 1、3 类别上。
LightGBM：

准确率：0.91
宏平均 F1 分数：0.92
加权平均 F1 分数：0.91
LightGBM 的表现与 GBDT 和随机森林相似，但在第 2 类别上的 F1 分数略低。
总体来看，XGBoost 的表现略优于其他三种算法。然而，在实际应用中，选择哪种算法取决于具体任务的需求。例如，如果对训练时间和模型大小有严格限制，那么可能更倾向于使用 LightGBM。如果需要充分利用并行计算资源，XGBoost 可能是更好的选择。

6. MLP

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

class MLP(nn.Module):
    def __init__(self, input_size, num_classes):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, num_classes)
        )

    def forward(self, x):
        return self.layers(x)

data = pd.read_excel('Dataset.xlsx')
data = data.dropna()

data = pd.read_excel('Dataset.xlsx')
print(data['Class'].value_counts())

# 1. 缺失值处理
data = data.dropna()

# 2. 异常值处理
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# 定义特征和目标变量
X = data.drop('Class', axis=1)
y = data['Class']

# 对目标变量进行编码
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. 特征缩放 - 先使用 StandardScaler 进行标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 特征缩放 - 使用 MinMaxScaler 进行归一化
min_max_scaler = MinMaxScaler()
X_train_normalized = min_max_scaler.fit_transform(X_train_scaled)
X_test_normalized = min_max_scaler.transform(X_test_scaled)

# 创建 PyTorch 数据集和数据加载器
X_train_tensor = torch.FloatTensor(X_train_normalized)
y_train_tensor = torch.LongTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_normalized)
y_test_tensor = torch.LongTensor(y_test)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

input_size = X_train_normalized.shape[1]
num_classes = len(set(y))

# 创建模型、损失函数和优化器
model = MLP(input_size, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 训练模型
num_epochs = 100
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

# 测试模型
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend(predicted.tolist())
        y_true.extend(targets.tolist())

# 输出结果
print("MLP - Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("MLP - Classification Report:")
print(classification_report(y_true, y_pred))

MLP - Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.90      0.93       327
           1       0.91      0.95      0.93       359
           2       0.92      0.92      0.92       999
           3       0.96      0.93      0.94       330
           4       0.92      0.95      0.94       365
           5       0.89      0.89      0.89       799

    accuracy                           0.92      3179
   macro avg       0.93      0.92      0.93      3179
weighted avg       0.92      0.92      0.92      3179

7. ResNet

ResNet-18 - Confusion Matrix:
[[344   0  29   1   2   9  10]
 [  0 161   0   0   0   0   0]
 [  9   0 455   0   7   5   3]
 [  0   0   0 944   0  31  68]
 [  1   0  11   7 533   0  36]
 [  1   0   1   6   0 600  11]
 [  0   0   0  56   4  28 711]]
ResNet-18 - Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.87      0.92       395
           1       1.00      1.00      1.00       161
           2       0.92      0.95      0.93       479
           3       0.93      0.91      0.92      1043
           4       0.98      0.91      0.94       588
           5       0.89      0.97      0.93       619
           6       0.85      0.89      0.87       799

    accuracy                           0.92      4084
   macro avg       0.93      0.93      0.93      4084
weighted avg       0.92      0.92      0.92      4084

代码

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

class MLP(nn.Module):
    def __init__(self, input_size, num_classes):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, num_classes)
        )

    def forward(self, x):
        return self.layers(x)

data = pd.read_excel('Dataset.xlsx')
data = data.dropna()

data = pd.read_excel('Dataset.xlsx')
print(data['Class'].value_counts())

# 1. 缺失值处理
data = data.dropna()

# 2. 异常值处理
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# 定义特征和目标变量
X = data.drop('Class', axis=1)
y = data['Class']

# 对目标变量进行编码
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. 特征缩放 - 先使用 StandardScaler 进行标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 特征缩放 - 使用 MinMaxScaler 进行归一化
min_max_scaler = MinMaxScaler()
X_train_normalized = min_max_scaler.fit_transform(X_train_scaled)
X_test_normalized = min_max_scaler.transform(X_test_scaled)

# 创建 PyTorch 数据集和数据加载器
X_train_tensor = torch.FloatTensor(X_train_normalized)
y_train_tensor = torch.LongTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_normalized)
y_test_tensor = torch.LongTensor(y_test)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

input_size = X_train_normalized.shape[1]
num_classes = len(set(y))

# 创建模型、损失函数和优化器
model = MLP(input_size, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 训练模型
num_epochs = 100
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

# 测试模型
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        y_pred.extend(predicted.tolist())
        y_true.extend(targets.tolist())

# 输出结果
print("MLP - Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("MLP - Classification Report:")
print(classification_report(y_true, y_pred))

机器学习Dry Bean分析