Description
We're going to read and explain ViT (Vision Transformer) from the paper "An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale"! Video where I implement and train the ViT from scratch below ↓
Want to support the channel? Hit that like button and subscribe!
Implement and Train ViT From Scratch
[ Ссылка ]
ViT (Vision Transformer) is introduced in the paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
[ Ссылка ]
What should I read next? Let me know in the comments!
00:00 Introduction
01:10 ViT Explanation and Intuition
13:20 Paper Overview
20:20 Outro
![](https://i.ytimg.com/vi/8phM16htKbU/maxresdefault.jpg)