Transformer Notes

Transformer visualizations:

https://bbycroft.net/llm
https://poloclub.github.io/transformer-explainer/
Autograd / miniGPT by hand (blog roadmap)
CS336 @ Stanford
Transformers from scratch - sensibly written, broken down by topic
Provable optimal transport with transformers - optimal transport - math for finding efficiently probability distribution 1 -> prob distrib 2
- Wasserstein distance - between two probability distribs
- Sinkhorn algorithm - iterative, converges towards a shortest
Transformers learn in-context from gradient descent (ETH Zurich)
- in-context learning - e.g. english followed by french, a “circuit” appears in the “neurons” called an “induction head”
- naively this might be, in the output, copying the english words - so the circuit is doing the copy, and then it passes the same tokenization of the english through the french neurons
- and this, but scaled up to another level of abstraction, is what is generally referred to as in-context learning (src)
- assertion of the ETH Zurich paper is that gradient descent happens in the forward pass
Unelicitable backdoors in Language Models via Cryptographic Transformer Circuits - feb2025, wrote a language to compile sha256 into a pytorch transformer, that only triggers when a particular input is present
[Trail of Bits’ Anamorpher] - Hiding text that only appears when an image is downscaled - text then interpreted by the LLM