Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Created by
  • Haebom

Author

Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu

Outline

This paper presents a parameter-efficient, interpretable computational model that mimics the flexible human tool selection ability. We develop a framework that connects visual tool recognition and verbal task understanding using low-dimensional attribute representations. We build a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 attributes encompassing physical, functional, and psychological characteristics, paired with natural language scenarios describing tool use. A visual encoder (ResNet or ViT) extracts attributes from tool images, and a fine-tuned language model (GPT-2, LLaMA, DeepSeek) extracts the necessary attributes from task descriptions. The proposed approach achieves 74% accuracy on tool selection tasks, significantly outperforming direct tool matching (20%) and small multimodal models (21%-58%), and approaching the performance of the much more parameterized GPT-4o (73%). Human evaluation studies demonstrate that the proposed framework matches human decision-making patterns, and generalization experiments demonstrate effective performance on novel tool categories. Elimination studies show that manipulation-related attributes (graspability, length, hand relevance) are the most important across all modalities.

Takeaways, Limitations

Takeaways:
We present a parameter-efficient and interpretable computational model that mimics human flexible tool selection capabilities.
A novel framework linking tool recognition and linguistic task understanding is presented.
Achieved high accuracy (74%) in tool selection tasks.
Validating model performance consistent with human decision-making patterns.
Demonstrating generalization performance for new tool categories.
Reveals the importance of manipulation-related properties.
Limitations:
The ToolNet dataset may be relatively small compared to larger datasets.
The generalization performance of a model cannot be guaranteed for all types of tools and tasks.
There is still a slight performance difference compared to GPT-4o.
Possible bias towards certain attributes.
👍