Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes

Created by
  • Haebom

Author

Aamod Khatiwada, Harsha Kokel, Ibrahim Abdelaziz, Subhajit Chaudhury, Julian Dolby, Oktie Hassanzadeh, Zhenhan Huang, Tejaswini Pedapati, Horst Samulowitz, Kavitha Srinivas

Outline

This paper presents TabSketchFM, a neural network-based table model, to address the growing enterprise need to identify related tables (tables that are unionable, joinable, or subsets of each other) in their data lakes. TabSketchFM improves the data discovery efficiency of neural table models through a sketch-based pretraining method and fine-tunes the pretrained model to identify unionable, joinable, and subset table pairs. It demonstrates significant performance improvements over existing neural table models and highlights sketches that are crucial for each task through detailed ablation studies. Furthermore, the fine-tuned model is used to perform table search (the task of finding other tables in the data pool that are unionable, joinable, or subsets of a query table), demonstrating significant improvement in F1 scores compared to state-of-the-art techniques. Finally, we demonstrate the model's generalizability by demonstrating significant transfer learning performance across diverse datasets and tasks.

Takeaways, Limitations

Takeaways:
We demonstrate that sketch-based pre-training can improve the data discovery performance of neural network tabular models.
We achieved performance improvements over existing methods in unionable, joinable, subset table pair identification and table lookup operations.
We demonstrated the model's generalization ability through excellent transfer learning performance across diverse datasets and tasks.
Ablation studies clearly present the sketches that are important for each task.
Limitations:
Further research is needed to determine the generalizability of the sketch-based pre-training method presented in this paper. It may overfit certain datasets or tasks.
There is a lack of evaluation of performance and scalability in real-world large-scale data lake environments.
Further research is needed to explore its applicability to different types of data (e.g., unstructured data).
👍