Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Created by
  • Haebom

Author

Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya

Outline

This paper presents an end-to-end framework (NLKI) for improving the performance of small-scale visual-language models (sVLMs) in common-sense visual question answering (VQA). NLKI works by retrieving natural language facts, generating natural language explanations using an LLM, and feeding these signals to the sVLM. Leveraging ColBERTv2 and entity-rich prompts for fact retrieval, the generated explanations reduce hallucinations and improve accuracy by up to 7%. Furthermore, through further fine-tuning using a noise-robust loss function, we achieve an accuracy improvement of 2.5% on the CRIC dataset and 5.5% on the AOKVQA dataset, bringing the performance of sVLMs like FLAVA to the level of mid-sized VLMs like Qwen-2 VL-2B and SmolVLM-2.5B. This study demonstrates that LLM-based common-sense knowledge is more effective than common-sense knowledge base retrieval, that noise-aware learning enhances the stability of small models in external knowledge augmentation situations, and that parameter-efficient common-sense inference is possible even in models with 250 million parameters.

Takeaways, Limitations

Takeaways:
We present the possibility of improving common-sense VQA performance of small visual-language models through LLM-based common-sense knowledge integration.
We demonstrate that fine-tuning using a noise-robust loss function is effective in improving the performance of small models.
We demonstrate that parameter-efficient common-sense inference is possible even in models with 250 million parameters.
We suggest that integrating common sense knowledge using LLM can be more effective than searching common sense knowledge bases.
Limitations:
There is a label noise issue in the dataset used, which requires further analysis.
Further verification of the generalization performance of the proposed method is required.
Experiments are needed on various types of sVLMs and performance differences according to model characteristics are analyzed.
👍