# URL anomaly detection

(Isolation Forest unsupervised) This algorithm is a generalization of random forests.
iTree tree structure: a randomly selected attribute, then a randomly selected value of the characteristics of the binary sample is divided, the above operation is repeated.
iTree after a good build, the data can predict it, prediction process is to test it on iTree records go, to see which records fall leaf node test. iTree can effectively detect an abnormality is assumed: outliers are generally very rare, in iTree will soon be classified into the leaf nodes, the leaf nodes to the root node may be a path h (x) to determine the length of a record x whether outliers.
Alt text

Is closer to 1 indicates a high possibility of abnormal points;
closer to 0 indicates a high possibility of a normal point;
if most of the training samples s (x, n) are close to 0.5, the entire data is no obvious abnormalities.

Iforest configured: to extract a portion of the randomly sampled every tree data structure, to ensure that differences in the tree.
The difference between the Random Forest:
1, Random Forests required number of samples is equal to the training sample set number set, IForest sampling is random, but the number of samples less than the number of training sets. Since our aim is to abnormality detection, only sample portion we generally can be outliers difference out of
2, IForest random select a division wherein, on dividing wherein randomly selecting a division threshold, the RF division required information gain or information gain ratio as the basis for selecting the attributes and thresholds.

URL anomaly detection (IForest)

Data source here .
Normal data request

/103886/
/rcanimal/
/458010b88d9ce/
/cclogovs/
/using-localization/
/121006_dakotacwpressconf/

Data malicious requests

/top.php?stuff='uname >q36497765 #
/h21y8w52.nsf?
/ca000001.pl?action=showcart&hop=">&path=acatalog/
/scripts/edit_image.php?dn=1&userfile=/etc/passwd&userfile_name= ;id;

Here it is necessary to quantify the knowledge of the text, call sklearn in CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer  
vectorizer=CountVectorizer()
corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"] 
print(vectorizer.fit_transform(corpus))
'''文本的序号,词的序号,词频'''
     (0, 16)         1
     (0, 3)          1
     (0, 15)         2
print(vectorizer.get_feature_names())#I是停用词不统计,打印分割好的词向量
['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']
print(vectorizer.fit_transform(corpus).toarray()) //转换成矩阵打印,四行代表四句话,列上的数字代表出现频率,统计词频后的19维特征做为文本分类的输入

Alt text
TF-IDF is a statistical method to evaluate the importance of a term set for a file or a document in the corpus where the.
TF (Term Frequency) Frequencies, a word that appears in the article number or frequency.
IDF (inverse document frequency) inverse document frequency. A word "weight" of the measure, the more common word, the lower the IDF.
IDF formula:
Alt text
N representative of the total number of Chinese text corpus, and N (x) representative of the total number of the text corpus containing the word x, a corresponding thesaurus
Alt text

from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.feature_extraction.text import CountVectorizer  

corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"] 

vectorizer=CountVectorizer()

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))  
print(tfidf)

得到(文本的序号,词的序号,TF-IDF值)
Alt text
这块知识的相关参考文章:
https://www.cnblogs.com/pinard/p/6693230.html
https://www.cnblogs.com/pinard/p/6693230.html

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix
import itertools
from sklearn.metrics import accuracy_score
#data
good_data=pd.read_csv('goodqueries.txt',names=['url'])
good_data['label']=0
data=good_data
data.head()
##feature
vectorizer = TfidfVectorizer(min_df = 0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,3)) #converting data to vectors
X = vectorizer.fit_transform(data['url'].values.astype('U')) #TF-IDF向量化

Alt text
分为训练集和测试集

X_train, X_test, y_train, y_test = train_test_split(X, data['label'].values, test_size=0.2, random_state=42) #splitting data
print(X_train) #y_test标签都设置为0表示都是正常数据集
clf=IsolationForest()
clf.fit(X_train)
y_pre = clf.predict(X_test)
ny_pre = np.asarray(y_pre)
ny_pre[ny_pre==1] = 0 ##最后输出1为正常,-1为异常 ==>0是正常,1是异常点
ny_pre[ny_pre==-1] = 1

ny_test = np.asarray(y_test)  #y_test都是0因为只导入了goodqueries.txt数据集
accuracy_score(ny_test,ny_pre)

Alt text

URL异常检测(LSTM)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/6/4 9:25
# @Author  : afanti

import sys
import os
import json
import pandas as pd
import numpy
import optparse
from keras.callbacks import TensorBoard
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from collections import OrderedDict
from keras.models import load_model
from keras.models import model_from_json
def model():
    dataframe = pd.read_csv('goodqueries.txt', names=['url'])
    dataframe['label']=0
    # dataframe.head()
    dataframe1 = pd.read_csv('badqueries.txt', names=['url'])
    dataframe1['label']=1
    # dataframe1.head()
    dataset=pd.concat([dataframe,dataframe1])
    dataset=dataset.sample(frac=1).values
    X = dataset[:,0]
    Y = dataset[:,1]
    for i in range(len(X)):
        if type(X[i])==float:
            X[i]=str(X[i])
    tokenizer = Tokenizer(filters='\t\n', char_level=True)
    tokenizer.fit_on_texts(X)
    X = tokenizer.texts_to_sequences(X) #序列的列表,列表中每个序列对应于一段输入文本
    word_dict_file = 'build/word-dictionary.json'

    if not os.path.exists(os.path.dirname(word_dict_file)):
        os.makedirs(os.path.dirname(word_dict_file))

    with open(word_dict_file, 'w',encoding='utf-8') as outfile:
        json.dump(tokenizer.word_index, outfile, ensure_ascii=False) #将单词(字符串)映射为它们的排名或者索引

    num_words = len(tokenizer.word_index)+1 #174

    max_log_length = 100
    train_size = int(len(dataset) * .75)

    # padding
    X_processed = sequence.pad_sequences(X, maxlen=max_log_length)
    # 划分样本集
    X_train, X_test = X_processed[0:train_size], X_processed[train_size:len(X_processed)]
    Y_train, Y_test = Y[0:train_size], Y[train_size:len(Y)]

    model = Sequential()
    model.add(Embedding(num_words, 32, input_length=max_log_length))
    model.add(Dropout(0.5))
    model.add(LSTM(64, recurrent_dropout=0.5))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    model.fit(X_train, Y_train, validation_split=0.25, epochs=3, batch_size=128)
    # Evaluate model
    score, acc = model.evaluate(X_test, Y_test, verbose=1, batch_size=128)
    print("Model Accuracy: {:0.2f}%".format(acc * 100))
    # Save model
    model.save_weights('securitai-lstm-weights.h5')
    model.save('securitai-lstm-model.h5')
    with open('securitai-lstm-model.json', 'w') as outfile:
        outfile.write(model.to_json())

    df_black = pd.read_csv('badqueries.txt', names=['url'], nrows=20000)
    df_black['label'] = 1
    X_waf = df_black['url'].values.astype('str')
    Y_waf = df_black['label'].values.astype('str')
    X_sequences = tokenizer.texts_to_sequences(X_waf)
    X_processed = sequence.pad_sequences(X_sequences, maxlen=max_log_length)
    score, acc = model.evaluate(X_processed, Y_waf, verbose=1, batch_size=128)
    print("Model Accuracy: {:0.2f}%".format(acc * 100))

这里用到了keras的Tokenizer分词器

from keras.preprocessing.text import Tokenizer
import keras
tokenizer = Tokenizer(char_level=True)
text = ["/javascript/nets.png", "/javascript/legacy.swf"]
tokenizer.fit_on_texts(text)
# tokenizer.word_counts                       #[[2, 5]]   False
tokenizer.texts_to_sequences(["nets swf"]) 
tokenizer.word_index

当char_level=True时,会按char-level将字符频度进行统计,如下图a出现次数最多,
Alt text

输出结果,nets swf输出在上面图查找:

#[[11, 12, 6, 3, 3, 17, 18]] True 语义没了

输入数据做了padding操作,这里补成了128维

# padding
    X_processed = sequence.pad_sequences(X, maxlen=max_log_length)

Alt text
相关Tokenizer操作看这里:
https://blog.csdn.net/wcy23580/article/details/84885734
https://blog.csdn.net/wcy23580/article/details/84957471
训练模型前Embedding层做一次word embedding,单词嵌入是使用密集的矢量表示来表示单词和文档的一类方法。output_dim这是嵌入单词的向量空间的大小,input_dim这是文本数据中词汇的取值可能数这里是173,input_length:输入文档都由1000个字组成,那么input_length就是1000.
下面我们定义一个词汇表为173的嵌入层(例如从1到173的整数编码的字,包括1到173),一个32维的向量空间,其中将嵌入单词,以及输入文档,每个单词有128个字符
嵌入层自带学习的权重,如果将模型保存到文件中,则将包含嵌入图层的权重。

model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=32, input_length=max_log_length))
model.add(Dropout(0.5))
model.add(LSTM(64, recurrent_dropout=0.5))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

总的来说就是char-level+统计向量化+词嵌入+神经网络LSTM

Related reference here: https://juejin.im/entry/5acc23f26fb9a028d1416bb3

URL anomaly detection (LR)

Import Data

# -*- coding:utf8 -*-
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import urllib.parse
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

def loadFile(name):
    directory = str(os.getcwd())
    filepath = os.path.join(directory, name)
    with open(filepath,'r',encoding="utf8") as f:
        data = f.readlines()
    data = list(set(data))
    result = []
    for d in data:
        d = str(urllib.parse.unquote(d))   #converting url encoded data to simple string
        result.append(d)
    return result

badQueries = loadFile('badqueries.txt')
validQueries = loadFile('goodqueries.txt')

badQueries = list(set(badQueries))
validQueries = list(set(validQueries))
allQueries = badQueries + validQueries
yBad = [1 for i in range(0, len(badQueries))]  #labels, 1 for malicious and 0 for clean
yGood = [0 for i in range(0, len(validQueries))]
y = yBad + yGood
queries = allQueries

Divided into a training set test set

X_train, X_test, y_train, y_test = train_test_split(queries, y, test_size=0.2, random_state=42) #splitting data
badCount = len(badQueries) #44532
validCount = len(validQueries) #1265974

Pipeline Many algorithmic models can be connected in series, such as feature extraction, normalization, classification and organization together to form a typical machine learning problem workflows.
tfidf TfidfVectorizer class parameter analyzer as defined feature word (word) or character n-gram.
Application of TF sublinear_tf linear scaling, e.g., using 1 + log (tf) covered tf.
ngram_range to extract the n-gram n-values of the lower and upper limits class_weight = {0: 0.9, 1 : 0.1}, this type of weight 0 90% weight, the type and weight of 1 was 10%.

pipe_lr = Pipeline([('tfidf', TfidfVectorizer(min_df = 0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,3))),
                    ('clf', LogisticRegression(class_weight="balanced"))
                    ])
pipe_lr.fit(X_train, y_train)

predicted = pipe_lr.predict(X_test)
predicted=list(predicted)
fpr, tpr, _ = metrics.roc_curve(y_test, (pipe_lr.predict_proba(X_test)[:, 1]))
auc = metrics.auc(fpr, tpr)

print("Bad samples: %d" % badCount)
print("Good samples: %d" % validCount)
print("Baseline Constant negative: %.6f" % (validCount / (validCount + badCount)))
print("------------")
print("Accuracy: %f" % pipe_lr.score(X_test, y_test))  #checking the accuracy
print("Precision: %f" % metrics.precision_score(y_test, predicted))
print("Recall: %f" % metrics.recall_score(y_test, predicted))
print("F1-Score: %f" % metrics.f1_score(y_test, predicted))
print("AUC: %f" % auc)
joblib.dump(pipe_lr,"lr.pkl")

Alt text
test

from urllib.parse import urlparse
from sklearn.externals import joblib
lr=joblib.load("lr.pkl")
def url(url):
    try:
        parsed_url=urlparse(url)
        paths=parsed_url.path+parsed_url.query
        result=lr.predict([paths])
        
        if result==[0]:
            return False
        else:
            return True
    except Exception as err:
        #print(str(err))
        pass

result=url('http://br-ofertasimperdiveis.epizy.com/examples/jsp/cal/feedsplitter.php?format=../../../../../../../../../../etc/passwd\x00&debug=1')
result1=url('http://br-ofertasimperdiveis.epizy.com/produto.php?linkcompleto=iphone-6-plus-apple-64gb-cinza')
result2=url('http://br-ofertasimperdiveis.epizy.com/?q=select * from x')

Alt text

char-level / word-level + n-gram + tfidf a shuttle down, we can solve many problems, including a long text and short text, and many of the key information security can be seen as a long text and short text, such as domain name request, the malicious code, malicious files.

https://www.freebuf.com/articles/network/131279.html
https://github.com/exp-db/AI-Driven-WAF/blob/master/waf.py
HTTPS: //www.kdnuggets. COM / 2017/02 / Machine-Learning-Driven the firewall.html-
N-model appreciated that the Gram
total connected reference:
https://xz.aliyun.com/t/5288#toc-4

Guess you like

Origin www.cnblogs.com/afanti/p/10974676.html