【读论文】Graph Neural Network-Based Anomaly Detection in Multivariate Time Series

1. The main research content of this article

Graph Neural Network-Based Anomaly Detection in Multivariate Time Series: Multivariate Time Series Anomaly Detection Based on Graph Neural Network

Goal : To develop specific methods for multivariate time-series data that explicitly capture graphs of relationships between sensors.

Weaknesses of Existing Approaches : Existing approaches do not explicitly learn the structure of existing relationships between variables, or use them to predict the expected behavior of time series.

Improvements for this defect : In order to take full advantage of the complex relationship between sensors in multivariate time series, graph neural networks (GNNs) are used to learn the relationship graph between sensors.

Graph-based methods provide a way to model the relationship between sensors by representing interdependencies with edges.

  • In general, GNN assumes that the state of a node is affected by the state of its neighbors;
  • Graph Convolution Networks (GCNs) model the feature representation of nodes by aggregating representations of their one-step neighbors;
  • Graph attention networks (GATs) use attention functions to compute different weights for different neighbors during aggregation.

Proposed method :Graph Deviation Network (GDN), the method learns a graph of relationships between sensors and detects biases from these patterns. The method consists of four main parts:

  1. Sensor Embedding, sensor embedding : use the embedding vector to flexibly capture the unique characteristics of each sensor;
  2. Graph Structure Learning, graph structure learning : graph structure learning learns the relationship between sensor pairs and encodes them as edges in the graph;
  3. Graph Attention-Based Forecasting, prediction based on graph attention : predict the future behavior of sensors based on the attention function of adjacent sensors in the graph;
  4. Graph Deviation Scoring, Graph Deviation Scoring : Identify and account for deviations in sensor relationships learned from graphs, and localize and account for these deviations.

2. Graph Deviation Network (GDN)

1. Problem Statement

Training data : NNN sensors atT train T_{train}TtrainData during this period: train = [ train ( 1 ) , train ( 2 ) , . . . , train ( T train ) ] s_{train} = [s^{(1)}_{train} , s^{(2)}_{train} , ... , s^{(T_{train})}_{train}]strain=[strain(1),strain(2),...,strain(Ttrain)] ,其中, s t r a i n ( t ) s^{(t)}_{train} strain(t)is a NNN- dimensional vector, denoted inttt timeNNThe values ​​of N sensors.
Following the usual unsupervised anomaly detection formulation, it is assumed that the training data contains only normal data.

Our goal is to be able to detect anomalies in the test data represented as: strain = [ stest ( 1 ) , stest ( 2 ) , . . . , stest ( T test ) ] s_{train} = [s^{(1)}_{test} , s^{(2)}_{test} , ... , s^{(T_{test})}_{test}]strain=[stest(1),stest(2),...,stest(Ttest)]

The output of the algorithm is a set of size T test T_{test}TtestThe binary labels of , this set of labels indicates that at each time ttt whether an exception occurred. a ( t ) ∈ { 0 , 1 } a(t)\in\{0,1\}a(t){ 0,1 } , wherea ( t ) = 1 a(t)=1a(t)=1 meansttAn exception occurred at time t .

2. Overview

The method consists of four main parts:

  1. Sensor Embedding, sensor embedding : use the embedding vector to flexibly capture the unique characteristics of each sensor;
  2. Graph Structure Learning, graph structure learning : graph structure learning learns the relationship between sensor pairs and encodes them as edges in the graph;
  3. Graph Attention-Based Forecasting, prediction based on graph attention : predict the future behavior of sensors based on the attention function of adjacent sensors in the graph;
  4. Graph Deviation Scoring, Graph Deviation Scoring : Identify and account for deviations in sensor relationships learned from graphs, and localize and account for these deviations.

insert image description here

(1) Sensor Embedding, sensor embedding

Introduce an embedding vector for each sensor to represent its features: vi ∈ R d v_i \in R^dviRd , for i ∈ { 1 , 2 , . . . , N } i \in \{1,2,...,N\} i{ 1,2,...,N}

These embedding vectors vi v_iviThe similarity between indicates the similarity of behavior. Therefore, there should be a high correlation between sensors with similar embedded values.

(2) Graph Structure Learning, graph structure learning

Graph structure learning will learn a weighted directed graph whose nodes represent sensors and edges represent dependencies between sensors.
For sensor iiFor i , we calculate sensoriiThe embedding vector of i and its candidate relationshipC i C_iCiThe similarity (normalized dot product) of eji e_{ji}eji
e j i = v i T v j ∣ ∣ v i ∣ ∣ ⋅ ∣ ∣ v j ∣ ∣ f o r j ∈ C i e_{ji} = \frac{\pmb{v_i}^T\pmb{v_j}}{||\pmb{v_i}|| \cdot ||\pmb{v_j}||} for j \in C_i eji=∣∣vivi∣∣∣∣vjvj∣∣viviTvjvjf or jCi

Then select the former kkk such normalized dot products,kkThe value of k can be chosen by the user according to the desired degree of sparsity.
A ji = 1 { j ∈ T op K ( eki : k ∈ C i ) } A_{ji} = 1\{j\in TopK({e_{ki}:k\in C_i})\}Aji=1{ jTop of K ( e _to:kCi)}
where,1{·}is an indicative function, namely1{值为真的表达式} = 1,1{值为假的表达式} = 0.

In the absence of prior information, sensor iiCandidate relations for i are all sensors except itself.


(3) Graph Attention-Based Forecasting, prediction based on graph attention

at time ttt , we based on historical time series data with sizewwThe sliding window of w , the input of the model is defined asx ( t ) : = [ s ( t − w ) , s ( t − w + 1 ) , . s^{(t-1)}}]xx(t):=[s(tw)s(tw),s(tw+1)s(tw+1),...,s(t1)s( t 1 ) ]. The target output that the model needs to predict is the sensor data at the current moment, that is,s ( t ) s^{(t)}s(t)

To capture different behaviors of sensors, we introduce a graph attention-based feature extractor to fuse node information with neighboring nodes based on the learned graph structure:
zi ( t ) = R e LU ( α i , i W xi ( t ) + ∑ j ∈ N ( i ) α i , j W xj ( t ) ) \pmb{z}^{(t)}_i = ReLU(\alpha_{i,i}\pmb{W} \pmb{x}^{(t)}_i + \sum\limits_{j\in N(i)} \alpha_{i,j} \pmb{W}\pmb{x}^{(t)}_j)zzi(t)=R e LU ( ai,iWWxxi(t)+jN(i)ai,jWWxxj(t)) , wherexi ( t ) \pmb{x}^{(t)}_ixxi(t)For model input, N ( i ) = { j ∣ A ji > 0 } N(i) = \{ j | A_{ji} > 0 \}N(i)={ jAji>0} W W W is the weight matrix obtained from training, pay attention to the coefficientα i , j \alpha_{i,j}ai,jThe calculation formula is:

insert image description here
In this way, we get the representation of all N nodes, namely { z 1 ( t ) , z 2 ( t ) , .{ zz1(t),zz2(t),...,zzN(t)}

For each zi ( t ) \pmb{z}^{(t)}_izzi(t), we associate it with the corresponding embedding vector vi \pmb{v}_ivviPerform element-wise multiplication (denoted as ∘ \circ ), and take the calculation results of all nodes as the input of the N-dimensional fully connected layer to predictttVector s ( t ) \pmb{s}^{(t)}of sensor values ​​at time tss(t)
s ^ ( t ) = f θ ( [ v 1 ∘ z 1 t , v 2 ∘ z 2 t , . . . , v N ∘ z N t ] ) \pmb{\widehat{s}^{(t)}} = f_\theta ([ \pmb{v}_1 \circ \pmb{z}^{t}_1 , \pmb{v}_2 \circ \pmb{z}^{t}_2 , ... , \pmb{v}_N \circ \pmb{z}^{t}_N ]) s (t)s (t)=fi([vv1zz1t,vv2zz2t,...,vvNzzNt])
We hope that the predicted output of the model is as close as possible to the real value, so use the predicted outputs ^ ( t ) \pmb{\widehat{s}^{(t)}}s (t)s ( t ) and observed datas ( t ) \pmb{s^{(t)}}s(t)s( t ) is minimized as a loss function:
LMSE = 1 T train − w ∑ t = w + 1 T train ∣ ∣ s ^ ( t ) − s ( t ) ∣ ∣ 2 2 L_{MSE} = \frac{1}{T_{train - w}} \sum\limits^{T_{train}}_{t=w+1} || \pmb{\widehat{s}^{(t)}} - \pmb{s^{(t)}} ||^2_2LMSE=Ttrainw1t=w+1Ttrain∣∣s (t)s (t)s(t)s(t)22


(4) Graph Deviation Scoring, graph deviation score

Given the learned relationships, we wish to detect and explain anomalies that deviate from these relationships.

sensor iii inttDeviation between predicted behavior and observed behavior at time t : E rri ( t ) = ∣ si ( t ) − s ^ ( t ) ∣ Err_i(t) = | s^{(t)}_i - \widehat{s}^{(t)} |Erri(t)=si(t)s (t)

The deviation of different sensors may have different scales, so the deviation of each sensor is normalized: ai ( t ) = E rri ( t ) − μ ~ i σ ~ i a_i(t) = \frac{Err_i(t) - \widetilde{\mu}_i}{\widetilde{\sigma}_i}ai(t)=p iErri(t)m i\widetilde{\mu} _im i E r r i ( t ) Err_i(t) Erri( t ) ,σ ~ i \widetilde{\sigma}_ip i E r r i ( t ) Err_i(t) Erri( t ) Inter-quartile range (IQR).

Inter-Quartile Range, IQR is the difference between the 1/4th and 3/4th of a distribution or set of values, ie IQR = Q3 - Q1, and is a strong measure of the spread of the distribution.

To calculate ttThe overall abnormal situation at time t , we usemax maxThe max function aggregates sensors (since anomalies will only affect a small subset of sensors, or even a single sensor):
A ( t ) = max ⁡ iai ( t ) A(t) = \max\limits_i a_i(t)A(t)=imaxai( t )
ifA ( t ) A(t)A ( t ) exceeds the set threshold, andttThe data at time t is marked as abnormal.

Guess you like

Origin blog.csdn.net/qq_42757191/article/details/126303195