The document presents a talk on deep learning for multimedia by Xavier Giro-i-Nieto at an event in Barcelona, focusing on the integration of speech, vision, and text in various applications. It highlights different encoding and decoding techniques used in tasks like image classification, speech recognition, and translation, referencing significant research papers in the field. The talk emphasizes advances in cross-modal learning and representation for improving multimedia tasks.
Related topics: