**Anomaly detection** plays an important role in the management of modern large-scale distributed systems. Logs are widely used for anomaly detection, recording system runtime information, and errors.

Traditionally, operators have to go through the logs manually with keyword searching and rule matching. The increasing scale and complexity of modern systems, however, make the volume of logs explode, which renders the infeasibility of manual inspection. To reduce manual eﬀort, we need anomaly detection methods based on automated log analysis.

Raw log messages are usually unstructured texts. To enable automated mining of unstructured logs, the ﬁrst step is to perform log parsing, whereby unstructured raw log messages can be transformed into a sequence of structured events. Then we are able to do anomaly detection based on these sequences.

The process of log analysis for anomaly detection involves four main steps：

**Log collection****Log parsing****Feature extraction****Anomaly detection**

**Important**: The Python code to run the last three steps of the anomaly detection pipeline, as well as the log file used for the experiment, can be found on GitHub.

### Log collection

Large-scale systems routinely generate logs to record system states and runtime information, each comprising a timestamp and a log message indicating what has happened. This valuable information could be utilized for multiple purposes (e.g., anomaly detection), and thereby logs are collected first for further usage.

### Log parsing

Logs are plain text that consists of constant parts and variable parts, which may vary among different occurrences.

For instance, given the logs of “Connection from 10.10.34.12 closed” and “Connection from 10.10.34.13 closed”, the words “Connection”, “from” and “closed” are considered as constant parts because they always stay the same, while the remaining parts are called variable parts as they are not fixed.

Constant parts are predefined in source codes by developers, and variable parts are often generated dynamically (e.g., port number, IP address) that could not be well utilized in anomaly detection. The purpose of log parsing is to separate constant parts from variable parts and form a well-established log event (i.e., “Connection from * closed” in the example).

There are two types of log parsing methods: clustering-based (e.g., LKE, LogSig) and heuristic-based (e.g., iPLoM, SLCT). In clustering-based log parsers, distances between logs are calculated first, and clustering techniques are often employed to group logs into different clusters in the next step.

Finally, the **event template** is generated from each cluster. For heuristic-based approaches, the occurrences of each word on each log position are counted. Next, frequent words are selected and composed as the event candidates. Finally, some candidates are chosen to be the log events.

### Feature extraction

The main purpose of this step is to extract valuable features from log events that could be fed into anomaly detection models. The input of feature extraction is log events generated in the log parsing step, and the output is an event count matrix. In order to extract features, we first need to separate log data into various groups, where each group represents a log sequence.

To do so, windowing is applied to divide a log dataset into finite chunks. After constructing the log sequences with windowing techniques, an event count matrix X is generated. In each log sequence, we count the occurrence number of each log event to form the event count vector.

For example, if the event count vector is [0, 0, 2, 3, 0, 1, 0], it means that event 3 occurred twice and event 4 occurred three times in this log sequence. Finally, plenty of event count vectors are constructed to be an event count matrix X, where entry Xij records how many times the event j occurred in the i-th log sequence.

The main purpose of this step is to extract valuable features from log events that could be fed into anomaly detection models. The input of feature extraction is log events generated in the log parsing step, and the output is an event count matrix.

In order to extract features, we first need to separate log data into various groups, where each group represents a log sequence. To do so, windowing is applied to divide a log dataset into finite chunks.

### Anomaly detection

Finally, the feature matrix can be fed to machine learning models for training, and thus generate a model for anomaly detection. The constructed model can be used to identify whether or not a new incoming log sequence is an anomaly.

In this post, we are going to analyze three unsupervised machine learning models: Log Clustering, PCA, and **Invariants Mining**. Unlike supervised methods, unsupervised learning is another common machine learning task but its training data is unlabeled. Unsupervised methods are more applicable in a real-world production environment due to the lack of labels.

### Log Clustering

Log Cluster method requires two training phases, namely the knowledge base initialization phase, and the online learning phase. Thus, the training instances are divided into two parts for these two phases, respectively.

The knowledgebase initialization phase contains three steps: log vectorization, log clustering, and representative vector extraction.

Firstly, log sequences are vectorized as event count vectors, which are further revised by Inverse Document Frequency (IDF) and normalization. Secondly, LogCluster clusters normal and abnormal event count vectors separately with agglomerative hierarchical clustering, which generates two sets of vector clusters (i.e., normal clusters and abnormal clusters) as a knowledge base. Finally, we select a representative vector for each cluster by computing its centroid.

The online learning phase is used to further adjust the clusters constructed in the knowledge-base initialization phase. In the online learning phase, event count vectors are added to the knowledge base one by one. Given an event count vector, the distances between it and existing representative vectors are computed.

If the smallest distance is less than a threshold, this event count vector will be added to the nearest cluster and the representative vector of this cluster will be updated. Otherwise, LogCluster creates a new cluster using this event count vector.

After constructing the knowledge base and completing the online learning process, LogCluster can be employed to detect anomalies.

Specifically, to determine the state of a new log sequence, we compute its distance to representative vectors in the knowledge base. If the smallest distance is larger than a threshold, the log sequence is reported as an anomaly. Otherwise, if the nearest cluster is a normal/ abnormal cluster, the log sequence is reported as normal/abnormal.

### PCA

Principal Component Analysis (PCA) is a statistical method that has been widely used to conduct dimension reduction. The basic idea behind PCA is to project high-dimension data (e.g., high-dimension points) to a new coordinate system composed of *k* principal components (i.e., *k *dimensions), where *k* is set to be less than the original dimension.

PCA calculates the k principal components by finding components (i.e., axes) that catch the most variance among the high-dimension data. Thus, the PCA-transformed low-dimension data can preserve the major characteristics (e.g., the similarity between two points) of the original high-dimension data.

Employing PCA, two subspaces are generated, namely normal space Sn and anomaly space Sa. Sn is constructed by the first k principal components and Sn is constructed by the remaining (n−k), where n is the original dimension.

Then, the projection ya of an event count vector y to Sa is calculated, where P = [v1,v2, …,vk] is the first k principal component. If the length of *y _{a}* is larger than a threshold, the corresponding event count vector will be reported as an anomaly.

### Invariants Mining

Program Invariants are the linear relationships that always hold during system running even with various inputs and under different workloads. Logs that have the same session id (e.g., block id in HDFS) often represent the program execution flow of that session.

Intuitively, Invariants Mining could uncover the linear relationships (e.g., n (A) = n (B)) between multiple log events that represent system normal execution behaviors. Linear relationships prevail in real-world system events. For example, normally, a file must be closed after it was opened.

Thus, the log with the phrase “open file” and the log with the phrase “close file” would appear in pairs. If the number of log events “open file” and of “close file” in an instance is not equal, it will be marked abnormal because it violates the linear relationship.

Invariants mining, which aims at finding invariants (i.e., linear relationships), contains three steps. The input of invariants mining is an event count matrix generated from log sequences, where each row is an event count vector.

Firstly, the invariant space is estimated using singular value decomposition, which determines the amount *r* of invariants that need to be mined.

Secondly, this method finds out the invariants by a brute-force search algorithm. Finally, each mined invariant candidate is validated by comparing its support with a threshold (e.g., supported by 98% of the event count vectors). This step will continue until *r* independent invariants are obtained.

In anomaly detection based on invariants, when a new log sequence arrives, we check whether it obeys the invariants. The log sequence will be reported as an anomaly if at least one invariant is broken.

### Log file: HDFS

The log analysis has been performed on the HDFS (http://hadoop.apache.org/hdfs) log file where HDFS is the Hadoop Distributed File System designed to run on commodity hardware.

**Important**: the original HDFS log ﬁle has a size of over 1.5G. Running the parser and the anomaly detection algorithms on this huge ﬁle can cause both memory and running time problems depending on your hardware performance.

In particular, the Log Cluster algorithm doesn’t scale very well with the number of log lines resulting in an extremely long time to ﬁnish its work. Hence, the size of the original HDFS log has been reduced to about 100k lines due to hardware limitations.

However, reducing the size of the log ﬁle results in less data to train the machine learning models, thus, reducing their performances when detecting anomalies.

### Parser: IPLom

In order to parse the raw log file has been used the IPLom algorithm. IPLoM (Iterative Partitioning Log Mining) is a log parsing method based on heuristics specially designed according to the characteristics of log messages.

Speciﬁcally, IPLoM performs log parsing through a three-step hierarchical partitioning process before template generation:

**Partition by event size**: log messages are partitioned into different clusters according to different lengths.**Partition by token position**: for each partition, words at different positions are counted. Then the position with the least number of unique words is used to split the log messages.**Partition by search for mapping**: further partition is performed on clusters by searching for mapping relationships between the set of unique tokens in two token positions selected using a heuristic criterion.**Log template generation**: Clusters with enough log messages are selected from candidates. Then, the log messages in each cluster can be combined to generate a log template, while the remaining log messages are placed into an outlier cluster.

### Models comparison

Among the three anomaly detection algorithms, PCA in the one presenting the lowest **recall** with a value of just 44% while the other two algorithms achieve over 55% and 61% recall values respectively.

Conversely, all of the analyzed models present a very high **precision** achieving values of near 100% score. This is very important in the anomaly detection context since most of the detected anomalies will result to be true anomalies rather than false positives. Nevertheless, because of the low recall, only a modest number of anomalies have been discovered.

The **F1** metric depends on both the precision and the recall, in particular, it indicates the harmonic mean of precision and recall. Thus, having a low recall comports a signiﬁcant impact on the F1 value of the three models with PCA performing the worst with a metric value equal to 60% while the other two achieve a score of about 70% with Log Clustering performing slightly better.

It is still important to underline that these results have been inﬂuenced by the fact that the models have been trained on a log ﬁle of reduced size with respect to the original one.

Finally, by comparing the **running time** of the three algorithms, it’s easy to observe that LogClustering is by far the slowest one taking about 100ms to complete its detections, and InvariantsMining is the fastest one ﬁnishing its work in a few milliseconds.

Anyway, the bottleneck of the whole process is the time required to train the models which can take a few minutes on the reduced log file but might take several hours or even days when training on the full HDFS log.