Real-Time Anomaly Detection Using Distributed Tracing in Microservice Cloud Applications

Mahsa Raeiszadeh, Amin Ebrahimzadeh, Ahsan Saleem, Roch Glitho, Johan Eker, Raquel Mini

Research output: Chapter in Book/Report/Conference proceedingPaper in conference proceedingpeer-review

193 Downloads (Pure)

Abstract

Distributed tracing plays a vital role in microservice infrastructure, and learning-based trace analysis has been utilized to detect anomalies within such systems. However, existing approaches for learning-based trace-based anomaly detection face certain limitations. Some assume that trace patterns can be learned solely from normal executions, while others depend on anomaly injection to generate labeled traces categorized as normal or anomalous. However, in practical scenarios, anomalies may also happen during the normal execution. Moreover, a wide variety of anomalies may occur in practice, which cannot be captured solely through anomaly injection. To address these issues, we propose a Trace-Driven Anomaly Detection (TDAD) approach based on a Span Causal Graph (SCG) representation, which trains a model using a Graph Neural Network (GNN) and Positive and Unlabeled (PU) learning. This technique allows the model parameters to be optimized by estimating the underlying data distribution. As a result, TDAD can be effectively trained using a small number of labeled anomalous traces along with a relatively large number of unlabeled traces. Our evaluation reveals that TDAD outperforms not only the existing unsupervised trace-based anomaly detection methods by 11.9% in terms of F1-score but also a supervised learning-based benchmark by 12x in terms of detection time.
Original languageSwedish
Title of host publicationProceeding of IEEE CloudNet 2023
Publication statusPublished - 2023 Nov 1
Externally publishedYes

Subject classification (UKÄ)

  • Control Engineering

Cite this