This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
In this paper, we present InfoMax, a novel data pruning method that maximizes information content and minimizes redundancy. InfoMax measures the information content of individual samples using their importance scores and quantifies redundancy based on the similarity between samples. The core set selection problem is formulated as a discrete quadratic programming (DQP) problem, maximizing the sum of the contributions of individual samples minus the redundancy introduced by similar samples. Using an efficient gradient-based solver, a sparsification technique for the similarity matrix, and a dataset partitioning strategy, we ensure scalability even to datasets with millions of samples. We experimentally demonstrate InfoMax's superior performance on various data pruning tasks, including image classification, vision-language pretraining, and instruction tuning of large-scale language models. The code is available at https://github.com/hrtan/InfoMax .
InfoMax, a new data pruning method (coreset selection) based on information quantity, is presented.
◦
Development of scalable algorithms that can be efficiently applied to large-scale datasets.
◦
Excellent performance has been verified in various fields such as image classification, vision-language pre-training, and large-scale language model fine-tuning.
◦
Reproducibility is ensured through open code.
•
Limitations:
◦
Further analysis is needed on the performance and efficiency of gradient-based solvers for solving DQP problems.
◦
Additional validation of generalization performance across diverse datasets and models is needed.
◦
There is room for improvement in sample importance scores and similarity measures.