Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Extended Histogram-based Outlier Score (EHBOS)

Created by
  • Haebom

Author

Tanvir Islam

Outline

The Histogram-Based Outlier Score (HBOS) is a widely used outlier detection method due to its computational efficiency and simplicity. However, because it assumes independence between features, its ability to detect outliers in datasets where feature interactions are significant is limited. In this paper, we propose the Extended Histogram-Based Outlier Score (EHBOS), an enhancement of HBOS that incorporates two-dimensional histograms to capture dependencies between feature pairs. This extension enables EHBOS to identify contextual and dependency-based anomalies that HBOS fails to detect. Using 17 benchmark datasets, we evaluate the effectiveness and robustness of EHBOS in various anomaly detection scenarios. EHBOS outperforms HBOS on several datasets where feature interactions are crucial for defining the anomaly structure, achieving significant improvements in ROC AUC. These results demonstrate that EHBOS can be a valuable extension of HBOS for modeling complex feature dependencies. Especially in datasets where contextual or relational anomalies play a significant role, EHBOS provides a powerful new anomaly detection tool.

Takeaways, Limitations

Takeaways:
We propose an EHBOS algorithm that overcomes the limitations of the existing HBOS by considering the interdependence between features.
Experimentally verifying the effectiveness and robustness of EHBOS in various anomaly detection scenarios.
We observed improved performance and ROC AUC compared to HBOS on datasets where feature interactions are important.
Provides a new tool useful for detecting situational or relational outliers.
Limitations:
Potential increase in scalability and computational cost for high-dimensional datasets (due to the use of two-dimensional histograms)
Further research is needed to determine optimal histogram bin sizes.
Need to evaluate generalization performance for various types of outlier patterns
👍