Detection of knn algorithm on KDD CUP99 dataset

1. Background introduction

KDD is the abbreviation of Data Mining and Knowledge Discovery. KDD CUP is an annual competition organized by SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) of ACM (Association for Computing Machiner). "KDD CUP 99 dataset" is the dataset used when the KDD competition was held in 1999.

In 1998, the Defense Advanced Planning Agency (DARPA) conducted an intrusion detection evaluation project at MIT Lincoln Laboratory. Lincoln Laboratory established a network environment that simulates the U.S. Air Force LAN, collected TCPdump(*) network connection and system audit data for 9 weeks, simulated various user types, various network traffic and attack methods, making it possible like a real network environment. The raw data collected by these TCPdumps is divided into two parts: the training data (**) for the 7-week period contains approximately 5,000,000 network connection records, and the remaining 2-week test data contains approximately 2,000,000 network connection records.

A network connection is defined as a sequence of TCP packets from start to finish within a certain period of time, and during this period of time, data is delivered from a source IP address to a destination IP address under a predefined protocol (such as TCP, UDP). . Each network connection is marked as normal (normal) or abnormal (attack), and the type of anomaly is subdivided into 4 categories, a total of 39 attack types, of which 22 attack types appear in the training set, and another 17 unknown attack types appear. in the test set.

The main identification of kddcup99 data is shown in the following figure:


These signs appear in the last paragraph as signs to distinguish normal access from attacks.

Next, let's talk about the knn algorithm.

2. Principle of KNN algorithm

1. Core idea: The core idea of ​​the kNN algorithm is that if most of the k nearest neighbors of a sample in the feature space belong to a certain category, the sample also belongs to this category, and has the number of samples in this category. characteristic. This method only determines the category of the sample to be classified according to the category of the nearest one or several samples in determining the classification decision.

2. Algorithm introduction:

The simplest and most basic classifier is to record all the categories corresponding to the training data. When the attributes of the test object exactly match the attributes of a training object, it can be classified. But how is it possible that all test objects will find training objects that exactly match them? Secondly, there is a problem that a test object is matched with multiple training objects at the same time, resulting in a training object being divided into multiple classes. Based on these problems, just KNN is produced.

     KNN is classified by measuring the distance between different feature values. Its idea is: if most of the k most similar samples in the feature space (that is, the closest neighbors in the feature space) belong to a certain category, then the sample also belongs to this category. K is usually an integer not greater than 20. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. In the classification decision, this method only decides the category of the sample to be classified according to the category of the nearest one or several samples.

     Let's illustrate it with a simple example: As shown in the figure below, which class should the green circle be assigned to, is it a red triangle or a blue square? If K=3, since the proportion of red triangles is 2/3, the green circle will be assigned the class of red triangles, if K=5, since the proportion of blue quads is 3/5, the green circle will be assigned the class of blue quads Square class.


This also shows that the result of the KNN algorithm depends to a large extent on the choice of K.

     In KNN, the distance between objects is calculated as a dissimilarity indicator between objects, which avoids the problem of matching between objects. Here, the distance generally uses Euclidean distance or Manhattan distance. Euclidean distance is the distance between points in the coordinate system, and so on by increasing the dimension.

Next, summarize the idea of ​​the KNN algorithm: when the data and labels in the training set are known, input the test data, compare the features of the test data with the corresponding features in the training set, and find the most similar one in the training set. For the first K data, the category corresponding to the test data is the category with the most occurrences among the K data. The algorithm description is as follows:

1) Calculate the distance between the test data and each training data;

2) Sort according to the increasing relationship of distance;

3) Select the K points with the smallest distance;

4) Determine the frequency of occurrence of the category in which the first K points belong;

5) Return the category with the highest frequency in the top K points as the predicted classification of the test data.


3. Code implementation

#user/bin/env python
#-*- coding utf-8 -*-
# author:LiRuikun
# coding=utf-8
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt




def classify(input_vct, data_set):
    data_set_size = data_set.shape[0]
    diff_mat = np.tile(input_vct, (data_set_size, 1 )) - data_set   #Expand input_vct to the same type as data_set and subtract it sq_diff_mat =
 diff_mat ** 2   #Every element in the matrix is ​​squared
 distance = sq_diff_mat.sum( axis = 1 ) ** 0.5   #Add and sum each line and take the square root
 mindistance=distance.min( axis = 0 )            
    x=list(distance.reshape(data_set_size))
    for i in range(len(x)):
        if x[i]==mindistance:
            k=i
            break
    my_matrix = np.loadtxt(open("training.csv", "r",encoding='utf-8'), delimiter=",", skiprows=0)
    label=my_matrix[k][-1]
    return label   
#return classification result
def file2mat(test_filename, para_num): """ Save the table to the matrix, test_filename is the table path, para_num is the number of columns stored in the matrix Return the target matrix, and the type of data in each row of the matrix """ fr = open (test_filename) lines = fr.readlines() line_nums = len (lines) result_mat = np.zeros((line_nums, para_num)) #Create line_nums row , para_num column matrix class_label = [] for i in range (line_nums): line = lines[i].strip() item_mat = line.split( ',' ) result_mat[i, :] = item_mat[ 0 : para_num] class_label.append (item_mat[- 1 ]) #table The classification of the last column of normal 1 abnormal 2 is stored in class_label fr.close() return result_mat, class_label def roc(data_set): normal = 0 data_set_size = data_set.shape[ 1 ] roc_rate = np.zeros(( 2 , data_set_size)) for i in range (data_set_size): if data_set[ 2 ][i] == 1 : normal += 1 abnormal = data_set_size - normal max_dis = data_set[ 1 ].max() for j in range ( 1000 ): threshold = max_dis / 1000 * j normal1 = 0 abnormal1 = 0 for k in range (data_set_size): if data_set[ 1 ][k] > threshold and data_set[ 2 ][k] == 1 : normal1 += 1 if data_set[ 1 ][k] > threshold and data_set[ 2 ][k] = = 2 : abnormal1 += 1 roc_rate[ 0 ][j] = normal1 / normal # normal points above the threshold / all normal points roc_rate[ 1 ][j] = abnormal1 / abnormal # abnormal points above the threshold / all abnormal points return roc_rate def test(training_filename, test_filename): training_mat, training_label = file2mat(training_filename, 32 ) test_mat, test_label = file2mat(test_filename, 32 ) test_size = test_mat.shape[ 0 ] errorcount= 0.0 for i in range(test_size): classifierResult = classify(test_mat[i], training_mat) print( "the classifier came back with: %d, the real answer is: %d" % ( int(classifierResult), int(test_label[i]))) if ( int(classifierResult) != int(test_label[i])): errorcount += 1.0 print ( "Total errors:%d" % errorcount) #计算错误数 print ( "The total accuracy rate is %f" % ( 1.0 - errorcount/ float (test_size))) if __name__ == "__main__" : test( 'training.csv' , 'test.csv' )
4. Results display

The reason why the classification accuracy rate is too low is because the normal data in the selected training set accounts for 99%.

Since the vast majority of the training sample set is normal data, we choose k=1 , that is, the point that is considered to be the closest to the test point, the test point is the same, and it is considered that it has little effect on the result. As a result, a large number of calculations can be omitted. But there is a disadvantage that there are too few unknown intrusions in the training set and too many unknown intrusions in the test set, so the false positive rate may be high. For example, in this experiment, the normal data in the training set accounted for 99% , while the abnormal data in the test set accounted for about half, and most of the abnormal data did not appear in the training set. It can be seen that the passive learning mechanism of knn requires a large amount of learning data for support.

Adding the data of the test set to the training set and re-testing, the results are as follows:


Since the data in the test set is added for learning, the accuracy rate is much improved at this time.

5. Summary

Advantages of KNN algorithm:

1、思想简单,理论成熟,既可以用来做分类也可以用来做回归; 
2、可用于非线性分类; 
3、训练时间复杂度为O(n); 
4、准确度高,对数据没有假设,对outlier不敏感;

缺点: 
1、计算量大;(如果维度过高,又采用加权的分类的话,会非常慢的) 
2、样本不平衡问题(即有些类别的样本数量很多,而其它样本的数量很少); 
3、需要大量的内存;




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326510105&siteId=291194637