From the course: Hands-On Introduction to Transformers for Computer Vision
Unlock this course with a free trial
Join today to access over 24,700 courses taught by industry experts.
Tokenization of images: How vision transformers (ViTs) see - PyTorch Tutorial
From the course: Hands-On Introduction to Transformers for Computer Vision
Tokenization of images: How vision transformers (ViTs) see
- [Instructor] Hey, everyone, welcome to chapter four, video two, tokenization of images and how ViTs see. We're finally going to make that jump from language to vision in transformers and understand exactly what tweaks we need to make to the architecture we just learned to make it work in computer vision. It's important to understand that in 2017 when the paper, "Attention Is All You Need," came out that it was all based around language. Now, let's not diminish the work of the authors. Clearly, they thought the applications could be spread across other modalities, but just wasn't ready for that kind of jump yet. However, that didn't stop researchers from drilling down right away after the paper came out on exactly how we can take all the awesomeness that came from attention and self-attention mechanisms and translate it to other modalities. So fun little guess here, if you've been following along, maybe this is a time you can kind of test yourself. What about the transformer…