Developing the Splunk App for Anomaly Detection

Anomaly detection is one of the most common problems that Splunk users are interested in solving via machine learning. This is highly intuitive, as one of the main reasons our Splunk customers are ingesting, indexing, and searching their systems’ logs and metrics is to find problems in their systems, either before, during, or after the problem takes place. In particular, one of the types of anomaly detection that our customers are interested in is time series anomaly detection. Many quantities within Splunk can be represented as a time series, which is a sequence of numeric values recorded over time. Some of the time series Splunk customers like to monitor are number of logins, error counts, and network traffic. Time series anomaly detection refers to the problem of finding unexpected behavior in the time series, which our customers want to be notified about so they can investigate what is going on. For example, an anomaly in a time series representing an app’s error counts over time could indicate that a bug has been introduced in a recent release.

Splunk today has existing anomaly detection approaches that can be applied to time series data, both in SPL and in MLTK, for more advanced users. After considering these existing approaches, our team determined that we could deliver an additional time series anomaly detection solution that would be easier to use for our users who are not familiar with machine learning. Thus, we built the Splunk App for Anomaly Detection for .conf23.

In this technical blog, we’ll walk through our app’s features, as well as explain how our time series anomaly detection algorithm works and how we evaluated our algorithm. If you are looking for a product overview of the app, you can read "Fastest Time-to-Value Anomaly Detection in Splunk: The Splunk App for Anomaly Detection 1.1.0."

Remediation Workflow

One of the largest obstacles to developing an extensible solution for anomaly detection is the wide variety of pathologies that can arise in real-world time series data. In production IT and security environments, it’s common for data to arrive sporadically at irregularly-spaced intervals and to contain missing or malformed/non-numeric values. These issues cause the majority of time series forecasting and anomaly detection methods - which standardly assume an uninterrupted sequence of regularly-spaced numeric values - to suffer performance degradation at best and to break altogether at worst, especially when it comes to the detection of seasonal/frequency anomalies.

Our team is invested in a long-term effort to compare, improve, and combine a wide variety of different approaches for time series anomaly detection in order to provide users with a state-of-the-art solution. This goal becomes insuperably difficult if different algorithms have to make different assumptions about the data and each require their own custom data preprocessing routine. Most importantly, as the purpose of the app is to provide an entirely off-the-shelf solution for anomaly detection, we want to relieve users from the burden of needing to manually clean or augment their data. As such, we developed a universal remediation workflow that can be applied to any time series in order to standardize it for subsequent analysis by any algorithm.

Regularity Check: Upon input to the app, we first check whether a series’s timestamps are evenly-spaced (“regular”), modulo some possible missing values. For example, a series with most timestamps 5 seconds apart but a few 10 seconds apart can be considered regular with a 5 second resolution and some missing values. We check whether the series is regular with respect to the smallest observed resolution in the data and contains no more than 10% missing values under this resolution.
Downsampling: If the series is irregular or contains too many missing values, we ask the user to downsample their series to a larger resolution. We recommend suitable resolutions that maintain the series’ fidelity while minimizing the amount of missing data, and provide an easy-to-use UI element that allows the user to choose the aggregation function to be applied in the downsampling.
Missing Value Imputation: If the series contains missing or non-numeric values, we perform linear interpolation between the nearest observed points in order to impute these values. Thus, once the time series hits our algorithm, we are assured that it is both regular and contains no missing or malformed values.

^{An example of an irregularly-spaced time series (left). Our workflow suggests that the data be aggregated over a 1 second resolution; the resultant series has a number of missing values (top right), which we automatically impute using linear interpolation (bottom right).}

An added benefit of modularizing the timestamp handling in this way is that it provides us with explicit control over how missing data should influence the alerts that a user ultimately receives. When the user operationalizes their anomaly detection job, they can configure alerts both on the detected anomalies as well as on missing data that occurred in the new data that arrived at the index being monitored since the job was configured. Missing data alerting is a new capability not previously delivered within Splunk that is enabled via a custom search command that our team developed. The custom search command scans the data and returns the maximum number of consecutive data points that are missing; then, the user can set up alerting on that maximum number of missing data points in a row, as shown in the following image of our alert setup dialog.

The benefit of the missing data alerting feature to the anomaly detection algorithm is that since we are giving the user the option to be notified of any missing data, we can more safely rely on linear interpolation to fill in that missing data when we are sending the user’s data to our algorithm for anomaly detection. If we weren’t alerting on missing data, we would potentially have to worry that linear interpolation could cover up the fact that some data was missing. Also, we might have the requirement that anomalies should be thrown on either side of a missing data segment. By separating the concerns of detecting missing data and detecting anomalies, we can simplify the user’s workflow and require less input from the user.

Anomaly Detection Algorithm

The new Splunk App for Anomaly Detection is designed with a powerful detection algorithm at its core, leveraging a sequential ensemble of two detectors. This section will delve into the inner workings of this innovative system, highlighting its primary components: the Anomaly Detection based on Explicit Seasonality detection using Clustering Analysis (ADESCA) detector and an ensemble of simple detectors.

ADESCA Detector

The ADESCA detector serves as a foundational element in our sequential ensemble. Initially developed for theITSI Assisted Thresholding project, the ADESCA detector utilizes a three-step process for anomaly detection:

Ensemble of Simple Detectors

In cases where a clear seasonal pattern cannot be detected in the time series or the data is too noisy for reliable pattern detection, the ensemble of simple detectors comes into play. It is a parallel ensemble (Earthgecko Skyline) comprising several individual detectors that collectively make decisions using a supermajority vote.

Rolling Average Detector: Calculates the test statistic of a data point as the ratio between the absolute distance of the data point from the rolling mean and the rolling standard deviation.
Rolling Average without Smoothing Detector: Similar to the Rolling Average Detector, but uses the raw value of a data point without applying local smoothing.
Exponential Weighted Moving Average (EWMA) Detector: Similar to above, but applies exponential smoothing to the time series before calculating the rolling mean and standard deviation.
Median Absolute Deviation (MAD) Detector:Calculates the test statistic as the ratio between the absolute deviation from the median and the median of these absolute deviations.
Same-Hour-Yesterday Average Detector: Compares the data point's absolute deviation from the average of the same time yesterday over a one-hour period with the standard deviation over the same period.
Histogram Detector:Calculates the test statistic as the ratio between a pre-configured threshold and the cardinality of points falling in the same histogram bin as the data point to be tested, effectively reporting anomalies when few points fall within a histogram bin.

Combined Synergy

The ADESCA detector and the ensemble of simple detectors work harmoniously to provide comprehensive anomaly detection:

In cases where a clear seasonal pattern is detected, the ADESCA detector establishes a robust anomaly boundary based on normal behavior, resulting in highly confident anomaly reports.
When no discernible pattern exists or the data is too noisy for pattern detection, the ensemble of simple detectors collectively evaluates the time series, effectively making comprehensive decisions for anomaly detection.

Evaluation Methodology

In order to enable data-driven development of the app’s algorithm and to quantify our performance relative to other methods available both within Splunk and from the open-source community at large, we assembled an in-house evaluation suite with various datasets, algorithms, and metrics of relevance to our specific problem setting. The evaluation runs automatically in our codebase’s CI/CD pipelines, so it’s effortless for us to see the performance impacts of every proposed change to the app’s code.

Datasets

The evaluation suite currently contains 200 time series spanning a variety of both internal and external sources. These sources were chosen under the criteria of being domain-specific and representative of the types of time series that we expect customers to bring to the app, while also covering a broad variety of characteristics and different types of anomalies (e.g. contextual, seasonal, frequency, spike, and drift anomalies). While most of the the externally-sourced datasets came with explicit anomaly labels, we were responsible for labeling the internally-sourced data ourselves; to do so, we leveraged Label Studio in order to label regions of anomalous behavior (using a consensus protocol across multiple labelers) in the time series.

Algorithms

In order to justify shipping our custom anomaly detection algorithm as a new app, we wanted to make sure that it compares favorably against (i) the other anomaly detection offerings available within Splunk today, as well as (ii) some open-source alternatives for unsupervised anomaly detection on time series. To do so, we included the following anomaly detection algorithms in our evaluation suite:

Splunk’s anomalydetection command
Splunk’s anomalousvalue command
Splunk’s outlier command
Splunk’s anomalies command
Splunk’s DensityFunction
Earthgecko Skyline
ADESCA: our in-house seasonal anomaly detector described above
ADESCA + Skyline: ADESCA ensembled with the Earthgecko Skyline algorithm (the approach described in full in the previous section)

Metrics

While the anomalies in our evaluation data are labeled as regions, some algorithms are only designed to throw individual point anomalies. To handle evaluation in this setting, we introduce the following terminology. Note that regardless of whether an algorithm’s detection consists of a single point or multiple consecutive points, we consider it to be a single detection.

True Anomalies (TA): The number of anomalous regions labeled in a time series

True Anomalies Detected (TAD): The number of labeled regions in which an algorithm called at least one point anomalous

True Positives (TP): The number of anomalies thrown by an algorithm which fell (at least 50%) inside some labeled region

False Positives (FP): The number of anomalies thrown by an algorithm which did not fall (at least 50%) inside some labeled region

Precision (P): What proportion of the algorithm’s detections were correct?

P = TP / (TP + FP)

Recall (R): What proportion of the true anomalous regions did we detect?

R = TAD / TA

F1: Harmonic mean of Precision and Recall

F1 = 2 * (P * R) / (P + R)

^{Example illustration of metrics; circles indicate an algorithm's detections. Algorithm detected 1 anomaly (consisting of 3 consecutive points) inside the TA => 1 True Anomaly Detected (TAD), 1 True Positive (TP). Algorithm detected 1 anomaly outside the TA => 1 False Positive (FP)}

These metrics were selected so as to judge an algorithm’s ability to detect a period of anomalous behavior irrespective of how it detects it (i.e. with a point or with an interval). The alerting in AnomalyApp works by periodically running a scheduled search on a user-defined cadence to see how many anomalies were detected during that period, and subsequently triggering an alert if the number of detected anomalies exceeded a user-specified minimum value. Since both types (point and interval) of detection trigger the same end-user alerting mechanism, neither one is more valuable than the other to the overall AnomalyApp workflow, and we wanted to reflect this indifference in our evaluation. Under traditional pointwise metrics, algorithms would be encouraged to throw more anomalies overall, as the reward for detecting a larger percentage of a labeled region - despite still only producing a single alert - would outweigh the cost of throwing some additional false positives. This is clearly a poor tradeoff from the end-user’s perspective, as false positives result in alert fatigue and excess work for human operators that need to investigate any alert. The main purpose of the alerts from AnomalyApp is to focus the human operator’s attention on the interesting subsets of the time series they are monitoring rather than to find all the anomalous points in those interesting subsets; this fits with our team’s vision of building assistive intelligence experiences. For the sources we consulted when coming up with our metrics, please see the last three citations in our References section.

Results

In this table, we present our metrics for each of the algorithms we evaluated, which contain a mix of SPL, MLTK, open-source, and proprietary algorithms. In the future, we will add other algorithms to our evaluation.

Algorithm	TA	TAD	TP	FP	Precision	Recall	F1
ADESCA-Skyline Ensemble	530	198	259	334	0.437	0.374	0.403
ADESCA	530	183	224	268	0.455	0.345	0.393
anomalydetection	530	208	466	845	0.355	0.392	0.373
Earthgecko Skyline	530	128	229	650	0.261	0.242	0.251
anomalousvalue	530	284	1281	9528	0.119	0.536	0.194
outlier	530	136	1208	12488	0.088	0.257	0.131
Density Function	530	271	901	11117	0.075	0.511	0.131
anomalies	530	296	1675	34346	0.047	0.558	0.086

The results show that the ADESCA-Skyline ensemble - the algorithm described in detail above which we ship in the Splunk App for Anomaly Detection - achieves the overall best F1 score. A close runner-up among the others is the anomalydetection algorithm from SPL. We have already started investigating how we can leverage insights from how the anomalydetection algorithm works to continue to improve our algorithm.

Future Work

For the next release of the Splunk App for Anomaly Detection, we plan to further improve our anomaly detection algorithm. In particular, we plan to experiment with several approaches from the academic literature that leverage deep learning for time series anomaly detection, as well as different techniques and decision logic for ensembling algorithms together. In parallel, we expect to expand our evaluation suite to include more datasets as well as more top-performing open-source algorithms.

We also plan to explore a tighter integration between our app’s features and the rest of Splunk’s platform so that users can leverage our capabilities without having to leave the context in which they typically work with their data. Stay tuned for more exciting announcements around anomaly detection in Splunk!

Summary

The Splunk App for Anomaly Detection provides high-quality time series anomaly detection results with an easy-to-use workflow for any Splunk user, including those with no experience with machine learning. In this blog post, we have presented some of the key capabilities delivered in the app, from the remediation workflow for time series datasets with uneven spacing or missing data, to our anomaly detection algorithm and operationalization / alerting. We have also presented our methodology for evaluating the algorithm, including hand-labeling a mix of open-source and internal datasets. We also show the results of our evaluation, where our algorithm achieves top performance. We plan to deliver even better performance and more features in the next version! Make sure to download the free app to get started with machine learning today.

Co-Authors

Houwu Bai is a Senior Data Scientist at Splunk. His work is focused on time series analysis and anomaly detection. Before joining Splunk, he worked for various companies in the San Francisco bay area, including SignalFx, Apple, and Arena Solutions. He received his Master's degree in Machine Learning and Pattern Recognition from the Institute of Automation at Chinese Academy of Sciences.

Kristal Curtis is a Senior Engineering Manager at Splunk. She leads a team that is responsible for delivering machine learning solutions for anomaly detection throughout the Splunk portfolio. Prior to becoming a manager, she worked as an engineer and researcher on various problems related to integrating machine learning into Splunk’s products. Before joining Splunk, Kristal earned her PhD in Computer Science at UC Berkeley, where she was advised by David Patterson and Armando Fox and belonged to the RAD and AMP Labs.

Will Deaderick is a Senior Data Scientist at Splunk where he builds real-time machine learning systems that leverage user interactions to continually learn. Driven by interests in neuroscience and cognitive psychology, he previously pursued graduate studies with a focus on reinforcement learning, receiving an M.S. in Computational and Mathematical Engineering from Stanford University.

References

Ahmad, S., Lavin, A., Purdy, S., & Agha, Z. (2017). Unsupervised real-time anomaly detection for streaming data. Neurocomputing, Available online 2 June 2017, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2017.04.070
Skyline: An ensemble learning algorithm that identifies anomalies when most detections are confirmed. https://github.com/earthgecko/skyline/blob/master/LICENSE
Tkachenko,Maxim, Malyuk, Mikhail, Holmanyuk, Andrey, & Liubimov, Nikolai. (2020). Label Studio: Data labeling software. Retrieved from https://github.com/heartexlabs/label-studio
Nesime Tatbul, Tae Jun Lee, Stan Zdonik, Mejbah Alam, and Justin Gottschlich. 2018. Precision and Recall for Time Series. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/2018/file/8f468c873a32bb0619eaeb2050ba45d1-Paper.pdf
John Paparrizos, Yuhao Kang, Paul Boniol, Ruey Tsay, Themis Palpanas, and Michael Franklin. "TSB-UAD: An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection". Proceedings of the VLDB Endowment (PVLDB 2022) Journal, Volume 15, pages 1697–1711. https://github.com/TheDatumOrg/TSB-UAD
John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S Tsay, Aaron Elmore, and Michael J Franklin. 2022. Volume under the surface: a new accuracy evaluation measure for time-series anomaly detection. Proceedings of the VLDB Endowment 15, 11 (2022), 2774–2787. https://www.researchgate.net/publication/363485317_Volume_Under_the_Surface_A_New_Accuracy_Evaluation_Measure_for_Time-Series_Anomaly_Detection

Will Deaderick

Will Deaderick is a Senior Data Scientist at Splunk, where he builds real-time machine learning systems that leverage user interactions to continually learn. Driven by interests in neuroscience and cognitive psychology, he previously pursued graduate studies with a focus on reinforcement learning, receiving an M.S. in Computational and Mathematical Engineering from Stanford University.

Developing the Splunk App for Anomaly Detection | Splunk (2024)