"Rethinking Video Anomaly Detection - A Continual Learning Approach" Anomaly Detection WACV-2022

Rethinking Video Anomaly Detection - A Continual Learning Approach WACV-2022

Source address of the paper: Rethinking Video Anomaly Detection - A Continual Learning Approach WACV-2022

1. Briefly summarize the main idea

A brief overview: the current abnormal video detection is mainly aimed at the detection of the set abnormal frames in the three popular data sets. The current related research has a deficiency that only trains the model to recognize the normal sample characteristics given in the data set. Then for Frames that are still normal events that do not appear in the test set may be considered abnormal.
However, such a detection method is somewhat contrary to the abnormal activity events that actually occurred in real life, so the author team re-examined the video anomaly detection, gave a new data set and anomaly detection evaluation index, and proposed a new detection algorithm on this basis That is, the detection of abnormal activity events is realized through continuous learning.

Research innovations:
1. design a framework for continual learning and propose a new performance metric based on detection delay and alarm precision; design a framework for continuous learning
and propose a new performance metric based on detection delay and alarm precision ;

2, introduce a new comprehensive dataset for continual learning in VAD;
introduce a new comprehensive dataset for continuous learning based on VAD;


3. propose a novel algorithm that significantly outperforms the state-of-the-art methods in online activity detection and continual learning, and provide guidance for future algorithm design. New algorithms for advanced methods and provide guidance for future algorithm design.

2. (Related Work) related work

Current status of anomaly detection:

Recent algorithms can be broadly divided into reconstruction-based methods, where anomaly detection is achieved by using generative adversarial networks to attempt to classify frames based on reconstruction errors, and prediction-based methods, which attempt to predict future frames when combined with actual groundtruth Anomaly detection. It is mainly aimed at the location detection of abnormal frames, lacks the consideration of time continuity in video anomaly detection, and is far from the abnormal events that occur in real life.

3, (Continual Video Anomaly Detection) continuous video anomaly detection

In this subsection, the authors propose a new continuous video anomaly detection concept.

Ideally, when a video anomaly detection system is faced with new detection information, it should be able to update its recognition of nominal patterns/behaviors to avoid false alarms.

However, this is not easy for existing algorithms, because current algorithms extensively rely on end-to-end trained deep neural networks, which are prone to catastrophic forgetting when trained on a relatively large amount of data . forgetting) , when trained continuously, they tend to forget previously learned information.

Therefore, in this context, the author team first carefully defines a framework for continuous learning in the context of video anomaly detection. The authors then propose a new metric for evaluating online activity detection, and an effective algorithm for continual VAD.

New Dataset: NOLA

The authors introduce a new dataset consisting of 110 training video clips and 50 test clips in 11 clips captured from a mobile camera on a famous street in New Orleans, Louisiana, USA.
Dataset link address: https://www.earthcam.com/usa/louisiana/neworleans/bourbonstreet/cam=catsmeow2
insert image description here

clipped at 9000 frames, extracted at 30 frames per second.

The definition of abnormality and examples are introduced in detail in the data set, you can directly read the original text.
insert image description here

Overall, the dataset consists of 990,000 training frames and 450,000 testing frames, making it significantly larger than any other available dataset, as shown in Table 1. This dataset was manually collected, cleaned and annotated by the authors. The training set is divided into 11 smaller batches to evaluate the performance in terms of continual learning. One of the splits is used for initial training, and the remaining 10 splits are used to evaluate continuous learning performance (Fig. 1).
Time series diagram

3.1. Problem Formulation

Although a stream of video frames F = {f1, f2, . . . } is a standard data structure for general video processing, for anomaly detection, a video frame is not a natural data unit due to two main reasons: lack of temporal continuity and interpretability.

For anomaly detection, a video frame is not a natural data unit for two main reasons: lack of temporal continuity and interpretability.

Activities happening in a video are the cause of temporal continuity, eg, running person, falling object,
that is, the occurrence of events should be a series of temporal continuity.

Therefore, we consider a data structure of streaming video activitiesX = {x1, x2, . . . }. So the
author team redefines the data set and considers a data structure of streaming video activities x = {x1, x2, ...}.

The specific definition can be seen in the figure below:
insert image description here
One thing to note: for the anomaly detection task, it is not necessary to explicitly identify the activities in the video, thus separating it from the activity recognition task.

Proposing a meaningful and challenging video anomaly detection .

On the basis of the data set, the challenge of video anomaly detection (meaningful and challenging) is proposed.
Two competing objectives make VAD a meaningful and challenging problem: raise an alarm as soon as possible when an anomalous activity takes place, and raise an alarm only when it is relevant
. Alert.

About the new anomaly detection metrics:

Detection Delay: It is the parameter δi=Ti−τi that allocates the detection delay.
![Detection delay formula](https://img-blog.csdnimg.cn/f8254da537234ec7951e51e07c02748d.pngdelay formula

Alarm Precision: is to alarm only when necessary, which is equivalent to the well-known accuracy measure of binary classification. Maximizing alarm accuracy means maximizing the ratio of true alarms/total alarms.

As shown in the figure: if the alarm j is issued within the relevant duration of the abnormal activity, it is a real alarm, ie, Tj+1 ∈ ∪[τi
, τi + δmax], and the rest are regarded as false alarms.
Alarm accuracy

This is the formula for alarm accuracy:
Alarm accuracy

Average Precision Delay: In order to obtain a single metric for conveniently comparing VAD algorithms, we propose a new metric called Average Precision Delay (APD), which combines average detection delay and alarm precision.
Average detection latency combines detection latency and alarm accuracy.
insert image description here

Similar to the way the popular AUC metric summarizes TPR and FPR, APD measures the area under the Precision vs. normalized ADD (NADD) curve.
Similar to how the popular AUC metric summarizes TPR and FPR, APD measures the area under the curve of precision versus normalized ADD (NADD).

Therefore, it can be concluded that a higher detection accuracy needs to satisfy:
have high precision and low delay in its alarms.
An algorithm with an APD value close to 1 must have higher accuracy and low delay.

The goal of the continual learning setting is to consistently improve APD performance at each training split k.
insert image description here
A successful continuous VAD algorithm will continuously improve its APD performance with more training splits.

3.2. Continual VAD Algorithm

Continuous learning is addressed by a two-stage approach: first an end-to-end deep learning model is used to extract low-dimensional feature embeddings for each frame, and then a k-nearest neighbor (kNN) based RNN model is used to prevent catastrophic forgetting.

Specific measures include:

  1. First, the authors use a pre-trained object detector to detect objects in each frame, such as YOLO-v4 [29]. The authors then use the extracted bounding boxes to construct a feature embedding to represent the observed spatio-temporal activities in the frame
    . In particular, the algorithm monitors the number of objects detected for each object class, the number of object classes observed, the day of the week, and the time of day to which the video frame belongs. At the same time, the time type is divided into weekends and weekdays, daytime, and night as a whole, so as to plan the abnormal attribution types of activities on this basis.
  2. Second, in addition to extracting more complex features from each detected object, the authors also use a re-identification and tracking algorithm called DeepSORT, which performs a re-identification and tracking algorithm for each detected object Do real-time path tracing. The extracted object paths are fed to an RNN to predict future paths. The prediction errors of all object paths are then superimposed together with spatio-temporal features into a single feature vector.
  3. Next, from the set of nominal feature vectors (nominal feature vectors) stored in the memory module, the kNN distance of the feature vectors is calculated.
    To continuously update k-DNN, we use experience replay, ie, in addition to the most recent feature vector and its kNN value, previous feature vectors and kNN values ​​are also used to update k-DNN.

4,Implementation Details:

  1. For the kNN regression network (k-DNN), the authors use a fully connected deep neural network with 3 hidden layers of 20 neurons each. The authors empirically chose the simplest network with significantly lower prediction error. A single hidden layer LSTM
    with a two input time steps is used for the decision RNN.
  2. The YOLO object detector is trained on the MS-COCO dataset containing 80 classes, while the DeepSORT object tracker is trained on the MOT16 dataset. For path prediction, an LSTM with three hidden layers with 20 input time steps is used.
  3. The authors removed trajectories with a duration of less than 50 frames. All features are normalized to [0,1] using the trained maximum and minimum values.
  4. The entirepipeline is able to run at approximately 18 fps on a RTX 2070 GPU, which can be significantly improved by using a better GPU or more lightweight models.

Algorithm Model Framework Display

5. Experimental results:

insert image description here
According to the algorithm framework of sustainable learning, the author modified and tested two algorithms with public codes, tested them on the new data set and obtained the comparative results. It can be seen that under the framework of continuous learning, its APD accuracy gradually increases with the training batches, which shows the role played by its sustainability.

6,Conclusion

The authors propose a new framework and a new comprehensive dataset for continuous learning in video anomaly detection. And new definitions of datasets and anomaly detection. The authors also propose a new video anomaly detector capable of continuous learning and through experience replay.

Through extensive testing on the proposed NOLA dataset and available benchmark datasets, the authors show that the proposed algorithm outperforms two state-of-the-art methods in terms of continuous learning as well as the standard frame-level AUC metric.

For future work, the authors plan to leverage audio and video in a multi-modal setup for improved detection performance.

Guess you like

Origin blog.csdn.net/qq_45496282/article/details/124936264