A walkthrough of Anthropic's Paper, A Toy Model of Superposition. I'm joined by Jess Smith, and we read through the paper together, give intuitions and discuss. See the paper here: [ Ссылка ]
And an explainer I wrote on some of the ideas in the paper: [ Ссылка ]
This walkthrough mostly focuses on high-level ideas and themes, let me know if you want a part 2 that finishes going through the rest of the paper in detail!
If you enjoyed this video, you might enjoy learning more about reverse engineering language models! Check out Concrete Steps to Get Started in Transformer Mechanistic Interpretability: [ Ссылка ]
(We had some audio and connection issues as we recorded this, sorry for any disruptions to the viewing experience!)
00:00 Toy Models of Superposition
03:53 Overview
10:20 Polysemanticity vs Superposition
24:05 Feature Importance
34:32 Large Dimensional Spaces
37:41 Interference vs Internal Reprsentation
42:02 Superposition in Toy Models vs. Real Transformers
50:46 Activation Functions and Interference
59:25 Internal Reprsentations of Features
01:22:54 Definitions of Features
01:42:47 Sparsity Diagram
01:44:16 Simulating Bigger Models
01:48:23 A Hierarchy of Feature Properties
01:51:42 Experimental Setup
01:59:18 Experimental Results
02:18:19 A Mathematical Analysis
02:25:22 Final Takeaways
Ещё видео!