Web Security and Machine Learning (KNN articles)

sklearn the nearest neighbor algorithm knn

Neighbor algorithm to detect abnormal operation

Source: Masquerading http://www.schonlau.net/ pages of User Data. 50 which includes a user operation log, each log comprising operation commands 1500, 5000 in front of a normal operation, the latter then contains 10,000 log abnormal operation. With specific reference to "Web security Machine Learning Getting Started"

Basics

Originally I wanted to write, but too much information on the Internet, there is no need to write. Look the other chiefs have written on the line:

The two looked at, basically nothing a problem

Programming

Step 1: Just select another user log, which each row represents a command. Each 150 command and making a sequence of operations, save in the list

def load_user(filename):
    most_cmd = []
    mini_cmd = []
    cmd_list = []
    cmd_seq = []
    # 获取操作序列
    with open(filename, 'r', encoding="utf-8") as f:
        cmd = f.readline()
        temp = []
        cnt = 0
        while(cmd):
            cmd_list.append(cmd.strip('\n'))
            temp.append(cmd.strip('\n'))
            cnt = cnt + 1
            if(cnt == 150):  #  这里不按照书上的分,我这里按照150个命令为一个序列,刚好和标签对上号,因为标签只有100个值
                cmd_seq.append(temp)
                cnt = 0
                temp = []
            cmd = f.readline()

Step 2: Then the user logs all commands statistics, the statistics are the most frequent 50 command, and the most frequent commands 50

# 获取最频繁的前50个命令,获取最不频繁的前50个命令
fdist = sorted(FreqDist(cmd_list).items(),key = operator.itemgetter(1), reverse = True) # 按照出现频率排序
most_cmd = [ item[0] for item in fdist[:50]]
mini_cmd = [ item[0] for item in fdist[-50:]]

Step 3: characterization. In Step 1 of the operation sequence, a series of operations as we press unit, wherein the number of the command ① statistics do not overlap, the most frequent ② 10 command, ③ most frequent command 10

user_feature = [] 
for cmd_list in user_cmd_list:
    # 获取每个序列不重复命令的个数
    seq_len = len(set(cmd_list))
    # 将每个序列按照出现频率由高到低的排列命令
    fdist = sorted(FreqDist(cmd_list).items(), key=operator.itemgetter(1), reverse=True)
    seq_freq = [item[0] for item in fdist]

    # 获取最频繁和最不频繁的前10个命令
    f2 = seq_freq[:10]
    f3 = seq_freq[-10:]

Step 4: Because the KNN numeric type can receive input. At the command in Step 4, ② and ③ are strings, we need to scalarization. Scalarization way: statistics most frequently used 50 commands and 50 commands most frequently used to calculate the degree of coincidence

# 计算重合度
f2 = len(set(f2) & set(user_max_freq))
f3 = len(set(f3) & set(user_min_freq))
# 合并特征:①每个序列不重复的命令个数;②每个序列最频繁的前10个命令和user中最频繁的50个命令重合度;
# ③每个序列最不频繁的前10个命令和user中最不频繁的前50个命令重合度;
user_feature.append([seq_len, f2, f3])

python3 complete code is as follows

from nltk.probability import FreqDist  # 统计命令出现频率
import operator
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

def load_user(filename):
        most_cmd = []
        mini_cmd = []
        cmd_list = []
        cmd_seq = []
        # 获取操作序列
        with open(filename, 'r', encoding="utf-8") as f:
                cmd = f.readline()
                temp = []
                cnt = 0
                while(cmd):
                        cmd_list.append(cmd.strip('\n'))
                        temp.append(cmd.strip('\n'))
                        cnt = cnt + 1
                        if(cnt == 150):  #  这里不按照书上的分,我这里按照150个命令为一个序列,刚好和标签对上号,因为标签只有100个值
                                cmd_seq.append(temp)
                                cnt = 0
                                temp = []
                        cmd = f.readline()

        # 获取最频繁的前50个命令,获取最不频繁的前50个命令
        fdist = sorted(FreqDist(cmd_list).items(),key = operator.itemgetter(1), reverse = True) # 按照出现频率排序
        most_cmd = [ item[0] for item in fdist[:50]]
        mini_cmd = [ item[0] for item in fdist[-50:]]
        return cmd_seq, most_cmd, mini_cmd

def get_user_feature(user_cmd_list, user_max_freq, user_min_freq):

        user_feature = [] 
        for cmd_list in user_cmd_list:
                # 获取每个序列不重复命令的个数
                seq_len = len(set(cmd_list))
                # 将每个序列按照出现频率由高到低的排列命令
                fdist = sorted(FreqDist(cmd_list).items(), key=operator.itemgetter(1), reverse=True)
                seq_freq = [item[0] for item in fdist]

                # 获取最频繁和最不频繁的前10个命令
                f2 = seq_freq[:10]
                f3 = seq_freq[-10:]
                # 计算重合度
                f2 = len(set(f2) & set(user_max_freq))
                f3 = len(set(f3) & set(user_min_freq))
                # 合并特征:①每个序列不重复的命令个数;②每个序列最频繁的前10个命令和user中最频繁的50个命令重合度;③每个序列最不频繁的前10个命令和user中最不频繁的前50个命令重合度;
                user_feature.append([seq_len, f2, f3])

        return user_feature

def get_labels(filename):  # 获取第三列的标签
        labels = []
        cnt = 0
        with open(filename, 'r', encoding="utf-8") as f:
                temp = f.readline().strip('\n')
                while(temp):
                        labels.append(int(temp[4])) 
                        cnt += 1
                        temp = f.readline().strip('\n')    
        return labels

if __name__ == "__main__":
        user_cmd_list, user_max_freq, user_min_freq = load_user('user.txt')
        user_feature = get_user_feature(user_cmd_list, user_max_freq, user_min_freq)
        labels = get_labels('labels.txt')

        # 切割数据集:训练集和测试集
        x_train = user_feature[0:70]
        y_train = labels[0:70]
        x_test = user_feature[70:]
        y_test = labels[70:]
        # 训练数据
        neight = KNeighborsClassifier(n_neighbors=3)
        neight.fit(x_train, y_train)
        # 预测
        y_predict = neight.predict(x_test)
        # 计算得分
        score = np.mean(y_test == y_predict) * 100
        print(score)  # 90.0

Ultimately, the correct rate of 90%.

to sum up

① get the top 50 most frequently command, not before most frequent way to get on the books of 50. Here I changed under the code. ② labels and data do not add up, command a total of 15,000, only 100 tags. Approach on the book is a sequence of operations for each of 100, 150 there is a sequence of operations, then the tag 50 again increases the front label. My code is 150, as a command sequence, so down just right.

The last recommendation personal blog: https://unihac.github.io/

Guess you like

Origin blog.51cto.com/13155409/2464935