Anomaly detection IF isolated forest practice

Table of contents

1. Principle

2. Detailed explanation of parameters

3. Actual combat

1. Data introduction

2. Data preprocessing

3. Feature derivation

4. Anomaly analysis and feature screening

5. Model construction and evaluation

6. Optimization of categorical variable coding

Focus on

reference


Order-Important

        In anomaly detection, algorithm selection is only one part of the process. The most important thing in the early stage is to mine target-related features based on business scenarios and business goals (if applied to credit/transaction fraud, you need to focus on mining fraud features), grasp data distribution, and characteristics. Screen and then select the appropriate algorithm based on the feature distribution. In addition, interpretability must be considered in some business scenarios; and anomaly detection itself is an unsupervised algorithm, and its implementation is more suitable for the assistance of supervised models rather than independent decision-making. This series will cover these points as comprehensively as possible, and readers are welcome to communicate and discuss.

        

        Paper address: Isolation Forest

1. Principle

        Isolation Forest is an anomaly detection algorithm based on a tree structure. Its principle can be briefly described as follows: by randomly selecting features and randomly selecting split points, the data set is gradually divided into smaller subsets until each The subset contains only one data point, thus forming many isolated data points, while outliers only need to go through fewer splits to be isolated, so their path length will be shorter than normal points, on average across all trees Samples with shorter paths have higher anomaly scores. The specific calculation process is as follows:

  1. Construct an isolated forest: Set the number of trees in the isolated forest and the maximum depth of each tree. For each tree, randomly select a part of the data set as a sample, randomly select features and eigenvalues ​​from it, and perform recursive segmentation of binary trees until each tree Each branch has only one sample point or reaches the tree depth threshold.

  2. Calculate the path length: For each data point x, calculate its path length in the tree h(x,T) in each tree, and find the average of the path lengths of all trees h(x).

  3. Calculate the anomaly score: According to the path length, calculate the anomaly score s(x) of each data point. The formula is as follows:

     where H(i) is the harmonic series, n is the number of samples in the data set; c(n) is the expected value of the average path length, and the formula is as follows:

  4. Filter out abnormal points: filter out abnormal points based on anomaly scores and predetermined thresholds. If $s(x)$ is greater than the threshold, $x$ is considered an outlier.

2. Detailed explanation of parameters

params={
    'n_estimators' : 1000 ,   : 迭代次数、孤立树的数量
    'max_samples' : 'auto' ,  : 每个孤立树中采样的样本数量,“auto”表示训练样本量
    'contamination' : 'auto' ,  : 数据集中异常值的比例,“auto”表示训练样本中异常值比例
    'max_features' : 1.0 ,  : 列采样比例
    'bootstrap' : False ,  : 是否使用 bootstrap 重复采样
    'n_jobs' : 4 ,  : 并行核数
    'random_state' : 1 , 
    'verbose' : 0  : 信息显示
}

3. Actual combat

1. Data introduction

        Data source: Using UEBA-based user abnormal online behavior analysis competition data on DataFoutain, you can download it from the link below (you need to log in to register first), or you can follow the Weixin public account Python risk control model and data analysis reply anomaly detection IF practice to obtain

                Analysis of abnormal user online behavior based on UEBACompetitions - DataFountain

        Competition title introduction:

        With the continuous improvement of the level of enterprise informatization, data as an asset has become the consensus of more and more enterprises. Enterprises are involved in production, operation and management activities such as industry and services, marketing support, business operations, risk management and control, and information leakage. to a large amount of trade secrets, work secrets, and private information of employees and customers.
  At present, the vast majority of enterprises have introduced relevant management measures and operational codes of conduct around sensitive data protection, but there are still abnormal operational behaviors that lead to sensitive data leakage. The "Securonix 2020 Insider Threat Report" pointed out that 60% of internal network security and data are involved. The leakage incidents are all related to the abnormal operating behavior of enterprise users.
  In order to effectively protect corporate sensitive data, practice corporate security operation codes of conduct, and prevent corporate sensitive data leakage security incidents caused by abnormal operating behaviors, user abnormal behavior analysis and identification have become one of the key and difficult technologies.

        Using machine learning, deep learning, UEBA and other artificial intelligence methods, based on unlabeled users' daily online log data, we build a user online behavior baseline and online behavior evaluation model, and determine the degree of deviation based on the distance between online behavior and the baseline.
  (1) Construct a behavioral baseline through users’ daily Internet surfing data;
  (2) Use an unsupervised learning model to build an Internet behavioral evaluation model based on the characteristics of users’ Internet surfing behavior to evaluate the degree of deviation of Internet surfing behavior from the baseline.

2. Data preprocessing

        The training set data size is 52w+, where ret is the anomaly score. Label encoding for account, group, IP, url, switchIP, port, vlan and other fields; extract time fields such as year, month, week, day, hour and so on from the time field

3. Feature derivation

        Count the number of unique values ​​of IP, url, switchIP, port, vlan and other fields according to group, account and other groups, such as group_IP_nunique;

        According to the number of group and IP group statistical data, the feature group_IP_cnt is obtained, and the same applies to other fields;        

1-3 Code part:

# 导包
import re
import os
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,roc_auc_score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import gc
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
import math
from sklearn import metrics
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
import time
from sklearn.model_selection import KFold,cross_val_score


# 1、数据读取
df=pd.read_csv('数据文件/基于UEBA的用户上网异常行为分析.csv',encoding='gbk')
print(df.shape)
df.head()

# 2、特征编码及特征衍生
def get_label_encode(df,col_list):
    
    def label_encode(df,col):
        value_list=list(df[col].unique())
        return {value:i+1 for value,i in zip(value_list,range(len(value_list)))}

    result={}
    for col in col_list:
        result[col]=label_encode(df,col)
    return result
    
def data_label_encode(df,map_dic):
    df_copy=df.copy()
    for col in map_dic.keys():
        if col in df.columns:
            df_copy[col]=df_copy[col].map(map_dic[col])
    return df_copy

def time_pre(df):
    df.time=pd.to_datetime(df.time)
    df['year']=df.time.dt.year
    df['month']=df.time.dt.month
    df['week']=df.time.dt.week
    df['day']=df.time.dt.day
    df['hour']=df.time.dt.hour
    df['minute']=df.time.dt.minute
    return df

def fea_derive(df_copy):
    result=(
        df_copy
        .merge(df_copy.groupby(['group','IP']).id.count().reset_index().rename(columns={'id':'group_IP_cnt'}),how='left',on=['group','IP'])
        .merge(df_copy.groupby(['group','switchIP']).id.count().reset_index().rename(columns={'id':'group_switchIP_cnt'}),how='left',on=['group','switchIP'])
        .merge(df_copy.groupby(['group','vlan']).id.count().reset_index().rename(columns={'id':'group_vlan_cnt'}),how='left',on=['group','vlan'])
        .merge(df_copy.groupby(['group','port']).id.count().reset_index().rename(columns={'id':'group_port_cnt'}),how='left',on=['group','port'])
        .merge(df_copy.groupby(['account','IP']).id.count().reset_index().rename(columns={'id':'account_IP_cnt'}),how='left',on=['account','IP'])
        .merge(df_copy.groupby(['account','switchIP']).id.count().reset_index().rename(columns={'id':'account_switchIP_cnt'}),how='left',on=['account','switchIP'])
        .merge(df_copy.groupby(['account','vlan']).id.count().reset_index().rename(columns={'id':'account_vlan_cnt'}),how='left',on=['account','vlan'])
        .merge(df_copy.groupby(['account','port']).id.count().reset_index().rename(columns={'id':'account_port_cnt'}),how='left',on=['account','port'])
        
        .merge(df_copy.groupby(['group']).IP.nunique().reset_index().rename(columns={'IP':'group_IP_nunique'}),how='left',on=['group'])
        .merge(df_copy.groupby(['group']).switchIP.nunique().reset_index().rename(columns={'switchIP':'group_switchIP_nunique'}),how='left',on=['group'])
        .merge(df_copy.groupby(['group']).vlan.nunique().reset_index().rename(columns={'vlan':'group_vlan_nunique'}),how='left',on=['group'])
        .merge(df_copy.groupby(['group']).port.nunique().reset_index().rename(columns={'port':'group_port_nunique'}),how='left',on=['group'])
        .merge(df_copy.groupby(['account']).IP.nunique().reset_index().rename(columns={'IP':'account_IP_nunique'}),how='left',on=['account'])
        .merge(df_copy.groupby(['account']).switchIP.nunique().reset_index().rename(columns={'switchIP':'account_switchIP_nunique'}),how='left',on=['account'])
        .merge(df_copy.groupby(['account']).vlan.nunique().reset_index().rename(columns={'vlan':'account_vlan_nunique'}),how='left',on=['account'])
        .merge(df_copy.groupby(['account']).port.nunique().reset_index().rename(columns={'port':'account_port_nunique'}),how='left',on=['account'])

    )
    
    return result

map_dic=get_label_encode(df,['account', 'group', 'IP', 'url', 'switchIP','port','vlan'])
df_copy=df.pipe(data_label_encode,map_dic).pipe(time_pre).pipe(fea_derive)
df_copy.head()

4. Anomaly analysis and feature screening

        Since anomaly detection targets data distribution, data (low-frequency values) that are different from most data distributions are considered outliers, so statistical analysis of feature distributions can be considered.

        First, check the IP, url, switchIP, port, vlan and other features. Among them, there are a large number of low-frequency values ​​​​in url and port. Such features do not contribute much to anomaly detection, so they are eliminated;

        IP, vlan, and switchIP are distributed normally and these characteristics are retained. 

        In the time characteristics, month, week, and day are all evenly distributed, so only hour is retained. As shown in the figure, 12 o'clock to 23 o'clock is the low-frequency Internet access period.

        The statistical characteristics of cnt and nunique are based on the distribution. Only account_IP_nunique, account_switchIP_nunique and account_port_nunique are retained.

import pyecharts.options as opts
from pyecharts.charts import Bar, Line

def distribution_plot_df(df,col,plot='bar'):
    sta=df_copy[col].value_counts().sort_index()
    x,y=list(sta.index),list(sta)
    return distribution_plot(x,y,col,plot)

def distribution_plot(x,y,col,plot='bar'):
    if plot=='bar':
        return distribution_bar(x,y,col)
    else:
        return distribution_line(x,y,col)

def distribution_bar(x,y,col):
    bar = (
        Bar(init_opts=opts.InitOpts(width="700px", height="500px"))
        .add_xaxis(x)
        .add_yaxis(
            col, 
            y,
            label_opts=opts.LabelOpts(is_show=False),
            markpoint_opts=opts.MarkPointOpts(
                label_opts=opts.LabelOpts(
                    font_size=13,
                    border_width=10
                ),
            ),
        )

        .set_global_opts(
#             title_opts=opts.TitleOpts(title=title),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            yaxis_opts=opts.AxisOpts(
                splitline_opts=opts.SplitLineOpts(
                    is_show=True, linestyle_opts=opts.LineStyleOpts(opacity=1)
                ),
            ),
        )
    )
    return bar

def distribution_line(x,y,col):
    line = (
        Line(init_opts=opts.InitOpts(width="900px", height="500px"))
        .add_xaxis(x)
        .add_yaxis(
            col, 
            y,
            label_opts=opts.LabelOpts(is_show=False),
            is_smooth=True,
            is_symbol_show=False,
            linestyle_opts=opts.LineStyleOpts(width=1.3),
        )

        .set_global_opts(
#             title_opts=opts.TitleOpts(title=title),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            yaxis_opts=opts.AxisOpts(
                type_="value",
                splitline_opts=opts.SplitLineOpts(
                    is_show=True, linestyle_opts=opts.LineStyleOpts(opacity=1)
                ),
            ),
        )
    )
    return line

5. Model construction and evaluation

        Final feature list: account,group,hour,vlan,switchIP,IP,account_IP_nunique,account_switchIP_nunique,account_port_nunique

        Divide the training set and test set according to the ratio of 8:2, and build an IF model for prediction. Since the ret anomaly score range is between 0-1 and the IF prediction result is between -0.2-0.1, the prediction results are normalized. change. Finally, RMSE and Score are used for evaluation on the training set and test set respectively. The closer the Score is to 1, the better the effect.

        The distribution of real values ​​and predicted values ​​is as follows. You can consider adjusting the distribution of prediction results to approximate the target distribution.

from sklearn.ensemble import IsolationForest
def distribution_line2(x,y1,y2):  # 拟合曲线
    line = (
        Line(init_opts=opts.InitOpts(width="650px", height="400px"))
        .add_xaxis(x)
        .add_yaxis(
            series_name="true",
            y_axis=y1,
            label_opts=opts.LabelOpts(is_show=False),
            is_smooth=True,
            is_symbol_show=False,
            linestyle_opts=opts.LineStyleOpts(width=1.1),
        )
        .add_yaxis(
            series_name="pred",
            y_axis=y2,
            label_opts=opts.LabelOpts(is_show=False),
            is_smooth=True,
            is_symbol_show=False,
            linestyle_opts=opts.LineStyleOpts(width=1.1),
        )
        .set_global_opts(
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            xaxis_opts=opts.AxisOpts(
                type_="category",
                axislabel_opts=opts.LabelOpts(rotate=30)
            ),
            yaxis_opts=opts.AxisOpts(
                type_="value",
                splitline_opts=opts.SplitLineOpts(
                    is_show=True, linestyle_opts=opts.LineStyleOpts(opacity=1)
                ),
            ),
        )
        .set_series_opts(
            areastyle_opts=opts.AreaStyleOpts(opacity=0.05),
            label_opts=opts.LabelOpts(is_show=False),
        )
    )

    return line

def rmse_value(y_true,y_pred):
    mse=mean_squared_error(y_true, y_pred)
    rmse=mse**0.5
    score=1/(1+rmse)
    return rmse,score

def get_if_model(df,fea_list): # 训练模型
    params={
        'n_estimators' : 1000 ,   # 迭代次数、孤立树的数量
        'max_samples' : 'auto' ,  # 每个孤立树中采样的样本数量,“auto”表示训练数据集的样本数量
        'contamination' : 'auto' ,  # 数据集中异常值的比例,“auto”表示训练样本中异常值比例
        'max_features' : 1.0 ,  # 列采样比例
        'bootstrap' : False ,  # 是否使用 bootstrap 重复采样
        'n_jobs' : 4  ,  # 并行核数
        'random_state' : 1 , 
        'verbose' : 0  # 信息显示
    }
    if_model = IsolationForest(**params)
    if_model.fit(df[df['sample']=='train'][fea_list])
    
    return if_model

fea_list=['account', 'group', 'hour', 'vlan', 'switchIP', 'IP', 
           'account_IP_nunique', 'account_switchIP_nunique', 'account_port_nunique']

if_model=get_if_model(df_copy,fea_list)
df_copy['if_pred']=if_model.decision_function(df_copy[fea_list])

scaler = MinMaxScaler()
df_copy['if_pred_adjust']=1-pd.DataFrame(scaler.fit_transform(df_copy[['if_pred']]))[0]
rmse_value(df_copy.ret,df_copy.if_pred_adjust)

6. Optimization of categorical variable coding

        In 2. Data preprocessing , label encoding is used directly for categorical variables such as IP and vlan. Here we try to encode according to frequency, placing low-frequency features at the tail to form a long-tail distribution, making it easier for isolated forests to Samples with low frequency values ​​are divided in advance

         Keeping the model feature list and model parameters unchanged, retrain the IF model for evaluation. The test set effect increased from 0.8202 to 0.8216.

Focus on

        Follow Weixin public account Python risk control model and data analysis reply anomaly detection IF practice to get the data and complete code of this article! Also get more theoretical knowledge and code sharing

reference

        Entry proposals in the 2021 CCF BDCI results compilation

Guess you like

Origin blog.csdn.net/a7303349/article/details/130273311