sklearn the nearest neighbor algorithm knn
Neighbor algorithm to detect abnormal operation
Source: Masquerading http://www.schonlau.net/ pages of User Data. 50 which includes a user operation log, each log comprising operation commands 1500, 5000 in front of a normal operation, the latter then contains 10,000 log abnormal operation. With specific reference to "Web security Machine Learning Getting Started"
Basics
Originally I wanted to write, but too much information on the Internet, there is no need to write. Look the other chiefs have written on the line:
The two looked at, basically nothing a problem
Programming
Step 1: Just select another user log, which each row represents a command. Each 150 command and making a sequence of operations, save in the list
def load_user(filename):
most_cmd = []
mini_cmd = []
cmd_list = []
cmd_seq = []
# 获取操作序列
with open(filename, 'r', encoding="utf-8") as f:
cmd = f.readline()
temp = []
cnt = 0
while(cmd):
cmd_list.append(cmd.strip('\n'))
temp.append(cmd.strip('\n'))
cnt = cnt + 1
if(cnt == 150): # 这里不按照书上的分,我这里按照150个命令为一个序列,刚好和标签对上号,因为标签只有100个值
cmd_seq.append(temp)
cnt = 0
temp = []
cmd = f.readline()
Step 2: Then the user logs all commands statistics, the statistics are the most frequent 50 command, and the most frequent commands 50
# 获取最频繁的前50个命令,获取最不频繁的前50个命令
fdist = sorted(FreqDist(cmd_list).items(),key = operator.itemgetter(1), reverse = True) # 按照出现频率排序
most_cmd = [ item[0] for item in fdist[:50]]
mini_cmd = [ item[0] for item in fdist[-50:]]
Step 3: characterization. In Step 1 of the operation sequence, a series of operations as we press unit, wherein the number of the command ① statistics do not overlap, the most frequent ② 10 command, ③ most frequent command 10
user_feature = []
for cmd_list in user_cmd_list:
# 获取每个序列不重复命令的个数
seq_len = len(set(cmd_list))
# 将每个序列按照出现频率由高到低的排列命令
fdist = sorted(FreqDist(cmd_list).items(), key=operator.itemgetter(1), reverse=True)
seq_freq = [item[0] for item in fdist]
# 获取最频繁和最不频繁的前10个命令
f2 = seq_freq[:10]
f3 = seq_freq[-10:]
Step 4: Because the KNN numeric type can receive input. At the command in Step 4, ② and ③ are strings, we need to scalarization. Scalarization way: statistics most frequently used 50 commands and 50 commands most frequently used to calculate the degree of coincidence
# 计算重合度
f2 = len(set(f2) & set(user_max_freq))
f3 = len(set(f3) & set(user_min_freq))
# 合并特征:①每个序列不重复的命令个数;②每个序列最频繁的前10个命令和user中最频繁的50个命令重合度;
# ③每个序列最不频繁的前10个命令和user中最不频繁的前50个命令重合度;
user_feature.append([seq_len, f2, f3])
python3 complete code is as follows
from nltk.probability import FreqDist # 统计命令出现频率
import operator
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
def load_user(filename):
most_cmd = []
mini_cmd = []
cmd_list = []
cmd_seq = []
# 获取操作序列
with open(filename, 'r', encoding="utf-8") as f:
cmd = f.readline()
temp = []
cnt = 0
while(cmd):
cmd_list.append(cmd.strip('\n'))
temp.append(cmd.strip('\n'))
cnt = cnt + 1
if(cnt == 150): # 这里不按照书上的分,我这里按照150个命令为一个序列,刚好和标签对上号,因为标签只有100个值
cmd_seq.append(temp)
cnt = 0
temp = []
cmd = f.readline()
# 获取最频繁的前50个命令,获取最不频繁的前50个命令
fdist = sorted(FreqDist(cmd_list).items(),key = operator.itemgetter(1), reverse = True) # 按照出现频率排序
most_cmd = [ item[0] for item in fdist[:50]]
mini_cmd = [ item[0] for item in fdist[-50:]]
return cmd_seq, most_cmd, mini_cmd
def get_user_feature(user_cmd_list, user_max_freq, user_min_freq):
user_feature = []
for cmd_list in user_cmd_list:
# 获取每个序列不重复命令的个数
seq_len = len(set(cmd_list))
# 将每个序列按照出现频率由高到低的排列命令
fdist = sorted(FreqDist(cmd_list).items(), key=operator.itemgetter(1), reverse=True)
seq_freq = [item[0] for item in fdist]
# 获取最频繁和最不频繁的前10个命令
f2 = seq_freq[:10]
f3 = seq_freq[-10:]
# 计算重合度
f2 = len(set(f2) & set(user_max_freq))
f3 = len(set(f3) & set(user_min_freq))
# 合并特征:①每个序列不重复的命令个数;②每个序列最频繁的前10个命令和user中最频繁的50个命令重合度;③每个序列最不频繁的前10个命令和user中最不频繁的前50个命令重合度;
user_feature.append([seq_len, f2, f3])
return user_feature
def get_labels(filename): # 获取第三列的标签
labels = []
cnt = 0
with open(filename, 'r', encoding="utf-8") as f:
temp = f.readline().strip('\n')
while(temp):
labels.append(int(temp[4]))
cnt += 1
temp = f.readline().strip('\n')
return labels
if __name__ == "__main__":
user_cmd_list, user_max_freq, user_min_freq = load_user('user.txt')
user_feature = get_user_feature(user_cmd_list, user_max_freq, user_min_freq)
labels = get_labels('labels.txt')
# 切割数据集:训练集和测试集
x_train = user_feature[0:70]
y_train = labels[0:70]
x_test = user_feature[70:]
y_test = labels[70:]
# 训练数据
neight = KNeighborsClassifier(n_neighbors=3)
neight.fit(x_train, y_train)
# 预测
y_predict = neight.predict(x_test)
# 计算得分
score = np.mean(y_test == y_predict) * 100
print(score) # 90.0
Ultimately, the correct rate of 90%.
to sum up
① get the top 50 most frequently command, not before most frequent way to get on the books of 50. Here I changed under the code. ② labels and data do not add up, command a total of 15,000, only 100 tags. Approach on the book is a sequence of operations for each of 100, 150 there is a sequence of operations, then the tag 50 again increases the front label. My code is 150, as a command sequence, so down just right.
The last recommendation personal blog: https://unihac.github.io/