This research presents the Movinet framework, which integrates vision transformers and natural language processing models to enhance cross-modal scene understanding, aiming to improve spatial awareness for applications in robotics and autonomous navigation. The framework is evaluated through tasks such as scene classification, object detection, and semantic segmentation, demonstrating superior performance compared to traditional unimodal methods. The findings highlight the potential of combining visual and textual information to advance AI's understanding of real-world environments.