This paper systematically develops an image captioning model at the intersection of computer vision and natural language processing. We present five models (Genesis to Nexus), ranging from a simple CNN-LSTM encoder-decoder to an advanced Nexus model with an efficient attention mechanism. We experimentally analyze the performance changes associated with architectural improvements in each model. Specifically, we demonstrate that simply upgrading the visual backbone in a CNN-LSTM architecture can result in performance degradation, highlighting the importance of the attention mechanism. The final model, Nexus, trained on the MS COCO 2017 dataset, achieves a BLEU-4 score of 31.4, outperforming several baseline models and validating the effectiveness of the iterative design process. This work provides a clear and replicable blueprint for understanding the core architectural principles of modern vision-language tasks.