This paper presents an end-to-end framework (NLKI) for improving the performance of small-scale visual-language models (sVLMs) in common-sense visual question answering (VQA). NLKI works by retrieving natural language facts, generating natural language explanations using an LLM, and feeding these signals to the sVLM. Leveraging ColBERTv2 and entity-rich prompts for fact retrieval, the generated explanations reduce hallucinations and improve accuracy by up to 7%. Furthermore, through further fine-tuning using a noise-robust loss function, we achieve an accuracy improvement of 2.5% on the CRIC dataset and 5.5% on the AOKVQA dataset, bringing the performance of sVLMs like FLAVA to the level of mid-sized VLMs like Qwen-2 VL-2B and SmolVLM-2.5B. This study demonstrates that LLM-based common-sense knowledge is more effective than common-sense knowledge base retrieval, that noise-aware learning enhances the stability of small models in external knowledge augmentation situations, and that parameter-efficient common-sense inference is possible even in models with 250 million parameters.