This paper presents a discriminative latent semantic graph (d-lsg) framework designed to generate natural language captions summarizing visual content in long videos. The model incorporates a conditional graph for enhancing object proposals, a dynamic graph for aggregating visual information, and a validation module to ensure the captions reflect the original content accurately. It addresses the challenges of modeling complex object interactions and extracting high-level visual concepts from spatio-temporal data to improve video captioning effectiveness.