# Donut: unsupervised anomaly detection using VAE

In this post, we are going to use Donut, an unsupervised anomaly detection algorithm based on Variational Autoencoder which can work when the data is unlabeled but can also take advantage of the occasional labels when available. In particular, we are going to focus on detecting anomalies on time series KPIs (key performance indicators) which are time-series data, measuring metrics such as the number of page views, online users, and the number of orders monitored by large internet-based companies. KPIs time series have a very simple structure: a timestamp and a value as we can see in the next figure.

Thus, we use the term anomalies to denote the recorded points which do not follow normal patterns (sudden spikes and dips), while using abnormal to denote both anomalies and missing points.

Anomaly detection on KPIs can be formulated as follows: for any time t, given historical observations xt−T+1,…,xt, determine whether an anomaly occurs (denoted by yt=1). An anomaly detection algorithm typically computes a real-valued score indicating the certainty of having yt=1 such as p(yt=1|xt−T+1,…,xt), instead of directly computing yt. Human operators can then affect whether to declare an anomaly by choosing a threshold, where a data point with a score exceeding this threshold indicates an anomaly.

### Variational Autoencoder

Variational Autoencoders (VAEs) are deep generative models in which we have some data X distributed according to an unknown distribution Preal(x). Its task is to learn a distribution Pgen(x) such that Pgen(x) is as similar as possible to Preal(x). Nevertheless, it is intractable to learn Pgen(x) directly, but it may be easier to choose some distribution P(z) and instead model P(x|z). Thus, VAE learns Pgen(x) by first learning an encoded representation of x (encode), which we will call z, drawn from a normal distribution P(z). Afterward, it will use z to generate a new sample x‘ (decode), which will be similar to the original x, drawn from the latent distribution P(x|z). Pgen(x) and P(x|z) are both modeled by a neural network called encoder and decoder respectively.

To deal with the anomaly detection task, a standard VAE model may first recognize normal regions in the original space x or in the latent feature space z, and then compute the anomaly score by measuring how far an observation is from the normal regions. However, existing VAE based anomaly detection methods were not designed for KPIs, thus, not perform well on this kind of datasets.

Important: in order to better understand the following sections, basic knowledge about VAE is required.

### Donut

The network structure of Donut is very similar to is the one of a variational autoencoder, however, VAE is not a sequential model, thus we apply sliding windows of length W over the KPIs: for each point xt, we use xt−W+1,…,xt as the x input vector of VAE. The second main difference is that while the means are derived from linear layers (as in VAE), the standard deviations are derived from soft-plus layers, plus a non-negative small number ϵ (remind that Softplus(a) = log[exp(a) + 1]). The variances are derived in such a way, instead of deriving them by using linear layers because the local variations in the KPIs are so small that the computed variances would probably get extremely close to zero, making their logs unbounded. This would cause severe numerical problems when computing the likelihood of Gaussian variables. Finally, the hidden layers keep the fully-connected structure in order to make the overall architecture fairly simple.

Besides its structure, the are three key techniques that allow Donut to greatly outperform state-of-art supervised and VAE based anomaly detection algorithms: Modified ELBO and Missing Data Injection during training and MCMC Imputation during detection.

### Training with Modified ELBO and Missing Data Injection

VAE based anomaly detection models work by learning normal patterns in the data while avoiding to learn abnormal patterns whenever possible. Therefore, during training, missing points are simply filled with zeros and the ELBO is modified such to exclude the contribution of anomalies and missing points. We will call this particular technique as Modified ELBO (M-ELBO). With M-ELBO, the loss is calculated in the following way:

$L(x)=E_{q_\phi(z|x)}[\sum_{w=1}^{W}\alpha_w \, log \, p_\theta(x_w|z)+\beta \, log \, p_\theta(z)-log \, q_\phi(z|x)]$

$\beta=\sum_{w=1}^{W}\alpha_w/W$

Where αw is defined as an indicator, if αw=1 it indicates xw being not an anomaly or missing point, and αw=0 otherwise. Note this equation still holds when there are no labeled anomalies in the training data (β=1). The contribution of pθ(xw|z) from labeled anomalies and missing points are directly excluded by αw, while the scaling factor β shrinks the contribution of pθ(z) according to the ratio of normal points in x. This modification trains Donut to correctly reconstruct the normal points within x, even if some points in x are abnormal.

Furthermore, during training is also applied the Missing Data Injection technique: randomly is set a ratio λ of normal points to be zero, as if they are missing points. Such that with more missing points, Donut is trained more often to reconstruct normal points when given abnormal x, thus the effect of M-ELBO is amplified. This injection is done before every epoch, and the points are recovered once the epoch is finished.

### Detection with MCMC imputation

Unlike discriminative models that are designed for just one purpose (a classifier is designed for just computing the classification probability p(y|x)), generative models like VAE can derive various outputs. In the scope of anomaly detection, the output the likelihood of observation window x (pθ(xw|z) in VAE), since we want to see how well a given x follows the normal patterns rather than generating a new sample. The main problem is that during detection, the anomalies and missing points in a testing window x can bring bias to the mapped z, and further make the reconstruction probability inaccurate. However, since the missing points are always known, we have the chance to eliminate the biases introduced by missing points by exploiting a novel method known as MCMC Imputation (Markov chain Monte Carlo). It works in the following way: the testing data x is divided into observed and missing parts, (xo,xm). Firstly, a z sample is obtained from qϕ(z|xo,xm), then a reconstruction sample (x’o,x’m) is obtained from pθ(xo,xm|z). Finally, (xo,xm) is replaced by (xo,x’m) which results in the observed points being kept fixed and the missing points replaced by the new values. This process is iterated for M times and then the final (xo,x’m) is used for computing the reconstruction probability. The intermediate x’m will keep getting closer to normal values during the whole procedure. Given sufficiently large M, the biases can be reduced, so that Donut can get a more accurate reconstruction probability.

After MCMC, we take L samples of z to compute the reconstruction probability by Monte Carlo integration. During the reconstruction phase, is computed only the reconstruction probability score for the last point in the sliding window (xt in xt−W+1,…,xt), since we want to respond to anomalies as soon as possible during the detection.

### Implementation

Donut implementation relies on two important libraries: Zhusuan and TFsnippet. Once installed, we can proceed to install the Donut package itself. More information about how to set up the working environment can be found on the official repository: https://github.com/haowen-xu/donut.

Important: Donut and Zhusuan are not supported for Tensorflow v2.x. Update as 12/2019

Then, let’s start by importing the required packages and loading the csv file containing the dataset.

import numpy as np
import pandas as pd
from donut import complete_timestamp, standardize_kpi, Donut, DonutTrainer, DonutPredictor
from tensorflow import keras as K
from tfsnippet.modules import Sequential
from tfsnippet.utils import get_variables_as_dict, VariableSaver
import tensorflow.compat.v1 as tf
import os
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import f1_score
tf.disable_v2_behavior()

# path to the dataset
file_csv = "cpu4.csv"

timestamp = data["timestamp"]
values = data["value"]
labels = data["label"]
dataset_name = file_csv.split('.')[0]
print("Timestamps: {}".format(timestamp.shape[0]))

As mentioned in the introduction, KPIs datasets are actually pretty simple, in our case we have just three columns: timestamp, value, and label (0 normal, 1 anomaly). We can then print the number of rows which should be exactly 99999.

Nevertheless, the dataset may contain some missing points (for example, xn and xn+2 are present but xn+1 is missing). The complete_timestamp function fills these missing points with 0s and returns an array indicating their position in the augmented dataset.

# Complete the timestamp filling missing points with zeros, and obtain the missing point indicators.
timestamp, missing, (values, labels) = complete_timestamp(timestamp, (values, labels))
print("Missing points: {}".format(np.sum(missing == 1)))
print("Labeled anomalies: {}".format(np.sum(labels == 1)))

The resulting augmented dataset should contain 3037 missing points and 9168 anomalies labeled.

Next, as usual, we split the dataset in training and test set.

# Split the training and testing data.
test_portion = 0.3
test_n = int(len(values) * test_portion)
train_values, test_values = values[:-test_n], values[-test_n:]
train_labels, test_labels = labels[:-test_n], labels[-test_n:]
train_missing, test_missing = missing[:-test_n], missing[-test_n:]
print("Rows in test set: {}".format(test_values.shape[0]))
print("Anomalies in test set: {}".format(np.sum(test_labels == 1)))

There are 1023 anomaly points in the test set our model has to find out of 30910 rows.

The last step of the data preprocessing phase consists of standardizing both the train and test set but excluding the missing points or the anomalies from the computation of the mean and the standard deviation.

# Standardize the training and testing data, anomaly points or missing points are excluded
train_values, mean, std = standardize_kpi(
train_values, excludes=np.logical_or(train_labels, train_missing))
test_values, _, _ = standardize_kpi(test_values, mean=mean, std=std)
print("Train values mean: {}".format(mean))
print("Train values std: {}".format(std))

Now that the data is ready, let’s define the model building it within the scope of “model_vs”.

sliding_window = 120

# define the model inside the 'model_vs' scope
with tf.variable_scope('model') as model_vs:
model = Donut(
h_for_p_x=Sequential([
K.layers.Dense(100, kernel_regularizer=K.regularizers.l2(0.001),
activation=tf.nn.relu),
K.layers.Dense(100, kernel_regularizer=K.regularizers.l2(0.001),
activation=tf.nn.relu),
]),
h_for_q_z=Sequential([
K.layers.Dense(100, kernel_regularizer=K.regularizers.l2(0.001),
activation=tf.nn.relu),
K.layers.Dense(100, kernel_regularizer=K.regularizers.l2(0.001),
activation=tf.nn.relu),
]),
x_dims=sliding_window,
z_dims=5,
)
# use DonutTrainer class to train the model
trainer = DonutTrainer(model=model, model_vs=model_vs, max_epoch=30)
# use DonutPredictor class to make predictions
predictor = DonutPredictor(model)

The class DonutTrainer is used to train the model while predictions are made through the class DonutPredictor. Our model has a sliding window of 120 points and a feature space of 5 neurons. Both encoder and decoder are symmetrical, composed of two hidden layers of 100 neurons each and Relu activation function.

Next, we are going to train it and create a file to save it so that we can reuse the same model in the future to make new predictions avoiding to retain it again.

save_dir = "model/" + dataset_name + "/"
if not os.path.exists(save_dir):
os.makedirs(save_dir)
saved = True
if len(os.listdir(save_dir)) == 0:
saved = False

if saved is False:
with tf.Session().as_default():
# train the model
trainer.fit(train_values, train_labels, train_missing, mean, std)
# save variables to 'save_dir' directory
var_dict = get_variables_as_dict(model_vs)
saver = VariableSaver(var_dict, save_dir)
saver.save()
saved = True

Once the model has been trained, we can conclude the code with the prediction phase.

if saved:
with tf.Session().as_default():
# restore variables from 'save_dir'
saver = VariableSaver(get_variables_as_dict(model_vs), save_dir)
saver.restore()
# make predictions
test_score = predictor.get_score(test_values, test_missing)
print("Number of predictions: {}".format(test_score.shape[0]))
# try different thresholds
best_threshold = 0
best_f1 = 0
best_predictions = []
thresholds = np.arange(5, 50, 0.2)
for t in thresholds:
threshold = t  # can be changed to better fit the training data
anomaly_predictions = []
for l in test_score:
if abs(l) > threshold:
anomaly_predictions.append(1)
else:
anomaly_predictions.append(0)
# strategy to compute modified metrics
# https://arxiv.org/pdf/1802.03903.pdf, fig 7
for i in range(sliding_window-1, len(anomaly_predictions)):
if anomaly_predictions[i-sliding_window+1] == 1 and test_labels[i] == 1:  # true positive
j = i-1
while j >= sliding_window-1 and test_labels[j] == 1\
and anomaly_predictions[j-sliding_window+1] == 0:
anomaly_predictions[j-sliding_window+1] = 1
j -= 1
j = i+1
while j < len(anomaly_predictions) and test_labels[j] == 1\
and anomaly_predictions[j-sliding_window+1] == 0:
anomaly_predictions[j-sliding_window+1] = 1
j += 1
f1 = f1_score(test_labels[sliding_window-1:], anomaly_predictions, average='binary')
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold
best_predictions = anomaly_predictions

anomaly_predictions = np.array(best_predictions)
print("--  final results --")
print("Best anomaly threshold {}".format(best_threshold))
print("Anomalies found: {}/{}".format(np.sum(anomaly_predictions == 1), np.sum(test_labels == 1)))
prfs = precision_recall_fscore_support(test_labels[sliding_window-1:], anomaly_predictions)
print("--- anomaly rows ---")
print("precision: {:.3f}".format(prfs[0][1]))
print("recall: {:.3f}".format(prfs[1][1]))
print("fscore: {:.3f}".format(prfs[2][1]))

Note that in order to find the best threshold we test different threshold values memorizing only the one giving the highest value of F1 metric computed on the detected anomalies. Is it also worth reminding that the first W-1 point, where W is the length of the sliding window, are excluded from the final results since their label can’t be predicted.

### Evaluation method

During the evaluation, the testing results are computed according to the following simple strategy: if any point in an anomaly segment in the ground truth can be detected by a chosen threshold, we say this segment is detected correctly, and all points in this segment are treated as if they can be detected by this threshold. Meanwhile, the points outside the anomaly segments are treated as usual. Thus, the precision, recall, and F1 and are then computed according to this method.

In the above figure, the first row is the ground truth with 10 contiguous points and two anomaly segments highlighted in the shaded squares. The detector scores are shown in the second row. The third row shows the point-wise detector results with a threshold of 0.5. The fourth row shows the detector results after adjustment. In this way, we shall get precision 0.6, and recall 0.5.

### Results

After executing the code, we should find the best threshold having a value of 46.8 and have been founded 1032 out of 1023 anomalies, thus finding some false positives. We should also get a precision value of around 0.99, 100% recall, and an F1 equal to 0.996. For simplicity, we tested Donut on a simple and small dataset extracted from the KPIs of a large internet-based company. On normal KPIs, the best F1 scores of Donut range from 0.75 to 0.90 which is still enough to outperform other state-of-art supervised and VAE-based anomaly detection algorithms.

The full code, which can be used to reproduce the same results or as a baseline for other projects, can be found on my GitHub repository linked in the next section.