Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation

Created by
  • Haebom

Author

Kaixin Zhang, Hongzhi Wang, Ziqi Li, Yabin Lu, Yingze Li, Yu Yan, Yiming Guan

Outline

This paper defines the "triangular dilemma of set size estimation" (the trade-off between generality, accuracy, and updatability) that hinders the practical application of learning-based cardinality estimation. To address this issue, we present DistJoin, an efficient, distribution-based join set size estimator utilizing a multiple autoregressive model. DistJoin estimates the join set size by separating the probability distributions of individual tables and develops a high-throughput distribution estimation model, Adaptive Neural Predicate Modulation (ANPM), to ensure efficiency. We formally address the variance accumulation problem of existing similar approaches through variance analysis and effectively reduce variance through a selectivity-based approach. DistJoin is the first data-driven method to support both equi- and non-equi-joins, demonstrating high accuracy, robust data updates, generality, and fast, flexible update and inference speeds. Experimental results show that DistJoin achieves the highest accuracy, robustness, and generality compared to existing methods, while delivering comparable speed.

Takeaways, Limitations

Takeaways:
We present the first data-driven method that supports both equi and non-equi joins.
Achieves higher accuracy, robustness (to data updates), and generality compared to existing methods.
Provides fast and flexible update and inference speed.
We identify the distributed accumulation problem of existing similar approaches and propose a solution.
Limitations:
Lack of detailed description of the specific structure and learning process of the ANPM model.
Lack of detailed information about the experimental environment and dataset. Lack of detailed descriptions of the compared methods makes it difficult to ensure reproducibility.
Lack of performance evaluation in actual large-scale production environments.
👍