This video covers everything about self attention in Vision Transformer - VIT , and its implementation from scratch.
I go over all the details and explain everything happening inside attention in vision transformer in detail through visualizations and also go over how an implementation of self-attention from scratch would look like in Pytorch.
I cover Vision transformer ( VIT ) in three parts:
1. Patch Embedding in Vision Transformer VIT - [ Ссылка ]
2. Self Attention in Vision Transformer VIT - This video
3. Building Vision Transformer and visualizations - [ Ссылка ]
*Paper Link* - [ Ссылка ]
*Implementation* - [ Ссылка ]
*Other Good Resources*
Yannic Kilcher | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained) - [ Ссылка ]
AI Coffee Break with Letitia | An image is worth 16x16 words: ViT | Vision Transformer explained - [ Ссылка ]
James Briggs | Vision Transformers (ViT) Explained + Fine-tuning in Python - [ Ссылка ]
Good Place to understand general transformer further - [ Ссылка ]
*TimeStamps* :
00:00 Intro
00:33 Intuition of What isAttention & Why its helpful
03:23 Inside Attention - What is Relevant
07:53 Inside Attention - Building Context Representation
08:45 Building Context Representation For All Patches
09:45 Why Multi Head Attention
11:15 Building Context Representation For Multi Head Attention
12:35 Combining Wq, Wk,Wv matrix
13:34 Shapes of Every Matrix in Attention
14:48 Implementation Parts of Attention
15:12 Pytorch Implementation for Attention in Vision Transformer VIT
18:26 Outro
*Subscribe to Channel* - [ Ссылка ]
Background Track - Fruits of Life by Jimena Contreras
Email - [explainingai.official@gmail.com](mailto:explainingai.official@gmail.com)
![](https://i.ytimg.com/vi/zT_el_cjiJw/maxresdefault.jpg)