In this video, we discuss the paper “An image is worth 16x16 words: Transformers for image recognition at scale” which introduced the Vision Transformer(ViT) architecture. We start with the motivation of this remarkable architecture, continue with early transformers for computer vision works, dive deep into the intricate details of ViT architecture, unpack ViT training and finetuning methodologies, and highlight significant developments from recent follow-up papers.
Enjoy the video? Show your support with a Like, and don't forget to Subscribe for more insightful discussions. Any feedback, questions, or innovative ideas are always welcome in the comment section below!
Slides: [ Ссылка ]
Personal links:
- Twitter: [ Ссылка ]
- LinkedIn: [ Ссылка ]
- GitHub: [ Ссылка ]
- Deep Learning Revision Newsletter:
[ Ссылка ]
- Personal website: [ Ссылка ]
- Complete Machine Learning Package: [ Ссылка ]
Some links from the video:
- ViT paper: [ Ссылка ]
- Big vision repo: [ Ссылка ]
- ViT Pytorch: [ Ссылка ]
- Yann LeCun Tweet on ViT vs CNNs: [ Ссылка ]
#deeplearning #ai #computervision #transformers
![](https://i.ytimg.com/vi/SXL4-DJ5_S4/maxresdefault.jpg)