What is masked language modelling? Or next sentence prediction? And why are they working so well? If you ever wondered what tasks the Transformer architectures are trained on and how the Multimodal Transfomer learns about the connection between images and text, then this is the right video for you!
➡️ AI Coffee Break Merch! 🛍️ [ Ссылка ]
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
Patreon: [ Ссылка ]
Ko-fi: [ Ссылка ]
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🎬 Ms. Coffee Bean explained the Multimodal Transformer: [ Ссылка ]
🎬 She also explained the Language-based Transformer: [ Ссылка ]
Content:
* 00:00 Pre-training strategies
* 00:48 Masked language modelling
* 03:37 Next sentence prediction
* 04:31 Sentence image alignment
* 05:07 Image region classification
* 06:14 Image region regression
* 06:53 Pre-training and fine-tuning on the downstream task
📄 This video has been enabled by the beautiful overview table in the Appendix of this paper:
VL-BERT: Su, Weijie, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. "Vl-bert: Pre-training of generic visual-linguistic representations." arXiv preprint arXiv:1908.08530 (2019). [ Ссылка ]
🔗 Links:
YouTube: [ Ссылка ]
Twitter: [ Ссылка ]
Reddit: [ Ссылка ]
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research #BERT
Video and thumbnail contain emojis designed by OpenMoji – the open-source emoji and icon project. License: CC BY-SA 4.0
Ещё видео!