Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation

Created by
  • Haebom

Author

Kaixin Zhang, Hongzhi Wang, Ziqi Li, Yabin Lu, Yingze Li, Yu Yan, Yiming Guan

Outline

This paper defines the three challenges of learning-based set size estimation (generality, accuracy, and updatability) as the "Triangular Dilemma of Set Size Estimation" and proposes DistJoin, an efficient, distribution-based join set size estimator using a multi-autoregressive model. DistJoin separately utilizes the probability distributions of individual tables to estimate the join set size and achieves efficiency through Adaptive Neural Predicate Modulation (ANPM), a high-throughput distribution estimation model. Furthermore, we formally address the variance accumulation problem of existing similar approaches through variance analysis and effectively reduce variance through a selectivity-based approach. DistJoin is the first data-driven method to support both equi- and non-equi-joins, offering high accuracy and fast, flexible updates. Experimental results show that DistJoin achieves the highest accuracy, robustness to data updates, and generality compared to existing methods, while demonstrating comparable update and inference speeds.

Takeaways, Limitations

Takeaways:
We present the first data-driven method that supports both equi and non-equi joins.
Achieving high accuracy, robustness to data updates, and generality simultaneously.
Provides fast and flexible update capabilities.
A new approach to solving the distributed accumulation problem of existing methods is presented.
Limitations:
Lack of detailed explanation of the specific implementation and performance improvements of ANPM.
Additional experimental results are needed for different datasets and join types.
Additional verification of scalability and stability in real-world operating environments is required.
👍