LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Hands-On Introduction to Transformers for Computer Vision

Unlock this course with a free trial

Join today to access over 24,700 courses taught by industry experts.

Tokenization of images: How vision transformers (ViTs) see

Tokenization of images: How vision transformers (ViTs) see - PyTorch Tutorial

From the course: Hands-On Introduction to Transformers for Computer Vision

Start my 1-month free trial Buy for my team

Tokenization of images: How vision transformers (ViTs) see

“

- [Instructor] Hey, everyone, welcome to chapter four, video two, tokenization of images and how ViTs see. We're finally going to make that jump from language to vision in transformers and understand exactly what tweaks we need to make to the architecture we just learned to make it work in computer vision. It's important to understand that in 2017 when the paper, "Attention Is All You Need," came out that it was all based around language. Now, let's not diminish the work of the authors. Clearly, they thought the applications could be spread across other modalities, but just wasn't ready for that kind of jump yet. However, that didn't stop researchers from drilling down right away after the paper came out on exactly how we can take all the awesomeness that came from attention and self-attention mechanisms and translate it to other modalities. So fun little guess here, if you've been following along, maybe this is a time you can kind of test yourself. What about the transformer…

Contents