NSL-KDD多分类（pytorch版）

关于为何去除42列的说明：

第42列是数据集制作者做的附加实验，作者使用21个学习器来对每个样本分类。对于每个样本的第42列数值，其大小是21个学习器中将此样本正确分类的学习器数量，用来表明此样本的被正确分类的难度，对本实验无任何意义，所以去除。下面附上数据集官网的原话:

In order to perform our experiments, we randomly created three smaller subsets of the KDD train set each of which included fifty thousand records of information. Each of the learners where trained over the created train sets. We then employed the 21 learned machines (7 learners, each trained 3 times) to label the records of the entire KDD train and test sets, which provides us with 21 predicated labels for each record. Further, we annotated each record of the data set with a #successfulPrediction value, which was initialized to zero. Now, since the KDD data set provides the correct label for each record, we compared the predicated label of each record given by a specific learner with the actual label, where we incremented #successfulPrediction by one if a match was found. Through this process, we calculated the number of learners that were able to correctly label that given record. The highest value for #successfulPrediction is 21, which conveys the fact that all learners were able to correctly predict the label of that record.

import torch
from torch import nn
from torch.nn import init
import numpy as np
import pandas as pd
import torch.utils.data as Data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
import matplotlib.pyplot as plt


#如果有gpu，就调用gpu
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
batch_size = 256

#读取KDD-cup网络安全数据,将标签数字化
df1 = pd.read_csv('./data/NSL_KDD_Dataset/KDDTrain+.txt')
df2 = pd.read_csv('./data/NSL_KDD_Dataset/KDDTest+.txt')
df1.columns = [x for x in range(43)]
df2.columns = [x for x in range(43)]

#将测试集中多余的标签删去（测试集有的攻击类型在训练集中未出现，我们删除这类样本）
s1 = set(np.array(df1[41]).tolist())
df2 = df2[df2[41].isin(s1)]
df = pd.concat([df1,df2])
#42列无用，删去
del df[42]
#获取特征和标签
labels = df.iloc[:, 41]
data = df.drop(columns= [41])
#标签编码
le = LabelEncoder()
labels =le.fit_transform(labels).astype(np.int64)
print(le.classes_)


#特征编码
data[1] = le.fit_transform(data[1])
data[2] = le.fit_transform(data[2])
data[3] = le.fit_transform(data[3])

#标签和特征转成numpy数组
data = np.array(data)
labels = np.array(labels)

#特征值归一化
min_max_scaler = MinMaxScaler()
data = min_max_scaler.fit_transform(data)

#转成torch.tensor类型
labels = torch.from_numpy(labels)
data = torch.from_numpy(data).float()

x_train, x_test, y_train,y_test = data[:125972], data[125972:], labels[:125972], labels[125972:]

#将数据集打包成DataLoader
train_dataset = Data.TensorDataset(x_train, y_train)
train_dataset.data = train_dataset.tensors[0]
train_dataset.targets = train_dataset.tensors[1]

#将数据集打包成DataLoader
test_dataset = Data.TensorDataset(x_test, y_test)
test_dataset.data = test_dataset.tensors[0]
test_dataset.targets = train_dataset.tensors[1]
labels = ['back', 'buffer_overflow', 'ftp_write', 'guess_passwd', 'imap', 'ipsweep',
 'land', 'loadmodule', 'multihop', 'neptune', 'nmap', 'normal', 'perl', 'phf',
 'pod', 'portsweep', 'rootkit' ,'satan', 'smurf', 'spy', 'teardrop',
 'warezclient', 'warezmaster']

train_dataset.classes = labels
test_dataset.classes = labels

train_dataset.classes_to_idx = {i: label  for i, label in enumerate(labels)}
test_dataset.classes_to_idx = {i: label  for i, label in enumerate(labels)}


train_iter = Data.DataLoader(train_dataset, batch_size, shuffle=True)


#感知机
#定义模型
num_inputs, num_hiddens, num_outputs = 41, 50, 23
net = nn.Sequential(
    nn.Linear(num_inputs, num_hiddens),
    nn.ReLU(),
    nn.Linear(num_hiddens, num_outputs)
).to(device)

#初始化参数
for params in net.parameters():
    init.normal_(params, mean=0, std=0.01)

#定义损失函数
loss = torch.nn.CrossEntropyLoss()
#定义优化器
optimizer = torch.optim.Adam(net.parameters(), lr=0.001,weight_decay=1e-4)


num_epochs = 25
list_acc = []

#训练模型
for epoch in range(1, num_epochs+1):
    
    train_l_sum, train_acc_sum, n =0.0, 0.0, 0
    test_acc_sum = 0.0
    
    for data, label in train_iter:
        #如果有gpu，就使用gpu加速
        data = data.to(device)
        label = label.to(device)
        
        output = net(data)
        
        l = loss(output, label).sum()
        
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        
        train_l_sum += l.item()
        train_acc_sum += (output.argmax(dim=1) == label).sum().item()
        n += label.shape[0]
        
    
     with torch.no_grad():
            x_test = x_test.to(device)
            y_test = y_test.to(device)
            
            output = net(x_test)
            
            test_acc_sum = (output.argmax(dim=1) == y_test).sum().item()
              
    print('epoch %d, train loss %.6f,  train acc %.3f, test acc %.3f'
          % (epoch , train_l_sum/n, train_acc_sum/n, test_acc_sum /y_test.shape[0]))
    list_acc.append(test_acc_sum/y_test.shape[0])

#画出精度变化图
plt.plot(range(len(list_acc)), list_acc)
plt.show()

实验结果：

NSL-KDD多分类（pytorch版）

Guess you like