This document proposes enhancing video summarization using a vision-language embedding approach. It trains a two-branch neural network to learn a joint embedding space for visual and textual features. This allows computing representativeness and interestingness objectives for video summarization in the shared semantic space. The approach is tested on egocentric and TV episode datasets, showing improvements in summarization accuracy over vision-only baselines, as measured by ROUGE recall and f-measure scores. Text descriptions can also be provided to guide the summarization process.