This paper presents a parameter-efficient, interpretable computational model that mimics the flexible human tool selection ability. We develop a framework that connects visual tool recognition and verbal task understanding using low-dimensional attribute representations. We build a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 attributes encompassing physical, functional, and psychological characteristics, paired with natural language scenarios describing tool use. A visual encoder (ResNet or ViT) extracts attributes from tool images, and a fine-tuned language model (GPT-2, LLaMA, DeepSeek) extracts the necessary attributes from task descriptions. The proposed approach achieves 74% accuracy on tool selection tasks, significantly outperforming direct tool matching (20%) and small multimodal models (21%-58%), and approaching the performance of the much more parameterized GPT-4o (73%). Human evaluation studies demonstrate that the proposed framework matches human decision-making patterns, and generalization experiments demonstrate effective performance on novel tool categories. Elimination studies show that manipulation-related attributes (graspability, length, hand relevance) are the most important across all modalities.