TY - JOUR
T1 - The value of human data annotation for machine learning based anomaly detection in environmental systems
AU - Russo, Stefania
AU - Besmer, Michael D.
AU - Blumensaat, Frank
AU - Bouffard, Damien
AU - Disch, Andy
AU - Hammes, Frederik
AU - Hess, Angelika
AU - Lürig, Moritz
AU - Matthews, Blake
AU - Minaudo, Camille
AU - Morgenroth, Eberhard
AU - Tran-Khac, Viet
AU - Villez, Kris
N1 - Publisher Copyright:
© 2021
PY - 2021/11/1
Y1 - 2021/11/1
N2 - Anomaly detection is the process of identifying unexpected data samples in datasets. Automated anomaly detection is either performed using supervised machine learning models, which require a labelled dataset for their calibration, or unsupervised models, which do not require labels. While academic research has produced a vast array of tools and machine learning models for automated anomaly detection, the research community focused on environmental systems still lacks a comparative analysis that is simultaneously comprehensive, objective, and systematic. This knowledge gap is addressed for the first time in this study, where 15 different supervised and unsupervised anomaly detection models are evaluated on 5 different environmental datasets from engineered and natural aquatic systems. To this end, anomaly detection performance, labelling efforts, as well as the impact of model and algorithm tuning are taken into account. As a result, our analysis reveals the relative strengths and weaknesses of the different approaches in an objective manner without bias for any particular paradigm in machine learning. Most importantly, our results show that expert-based data annotation is extremely valuable for anomaly detection based on machine learning.
AB - Anomaly detection is the process of identifying unexpected data samples in datasets. Automated anomaly detection is either performed using supervised machine learning models, which require a labelled dataset for their calibration, or unsupervised models, which do not require labels. While academic research has produced a vast array of tools and machine learning models for automated anomaly detection, the research community focused on environmental systems still lacks a comparative analysis that is simultaneously comprehensive, objective, and systematic. This knowledge gap is addressed for the first time in this study, where 15 different supervised and unsupervised anomaly detection models are evaluated on 5 different environmental datasets from engineered and natural aquatic systems. To this end, anomaly detection performance, labelling efforts, as well as the impact of model and algorithm tuning are taken into account. As a result, our analysis reveals the relative strengths and weaknesses of the different approaches in an objective manner without bias for any particular paradigm in machine learning. Most importantly, our results show that expert-based data annotation is extremely valuable for anomaly detection based on machine learning.
KW - Anomaly detection
KW - Environmental systems
KW - Labels
KW - Machine learning
U2 - 10.1016/j.watres.2021.117695
DO - 10.1016/j.watres.2021.117695
M3 - Article
C2 - 34626884
AN - SCOPUS:85116532784
SN - 0043-1354
VL - 206
JO - Water Research
JF - Water Research
M1 - 117695
ER -