The document proposes a new image captioning model called VisionAid that aims to address several issues with existing approaches. It conducts a literature review of transformer-based image captioning methods to identify solutions. VisionAid incorporates grid-level feature extraction, augmented training data diversity using BERT embeddings, and a combination of normalized self-attention and geometric self-attention to better model object relationships while avoiding internal covariate shift issues. The model aims to generate more accurate and diverse captions by leveraging techniques from various transformer models discussed in the literature review.