From the course: Introduction to Multimodal Prompting for Generative AI

What is multimodality?

- Before we start leveraging multimodality, it's a good idea to take a moment to understand what a modality is in AI. We can think of a modality as the way something exists or is represented as far as an AI model is concerned. Now, let's think of something such as an apple. Here we have the word apple. It's a text representation of an apple. We can also have a drawing or an image of an apple as a representation of an apple. We can also have a crunch sound as an audio representation of that apple. Now, what's similar to modalities when we think about the way we perceive the world? If you thought about senses, that's actually a comparison that is often made. Various modalities include text, images, audio, and even video. Multimodality refers to systems and models that can take input and produce output from different modalities. So we can think of an image generator that can take an image and a text description and can modify the image accordingly. We can also think about a model that takes an image and a text question perhaps and outputs the text answer. Now in this course, we'll also look at systems such as music generators. They take in text, which is one modality, and produce audio, which is a different modality. While such a system leverages different modalities, it's still debatable whether this is an example of multimodality because there's a unified input and a unified output.

Contents