The document covers various techniques in audio-visual learning, focusing on the integration of visual features from video with audio representations using self-supervised learning methods. It discusses specific research contributions, including the exploitation of ambient sounds for visual learning and convolutional neural networks (CNNs) for sound representation. The document also highlights several studies related to cross-modal retrieval and speech synthesis from silent video.
Related topics: