I got curious how Stable Diffusion works.

Here’s some links:

  • https://github.com/AUTOMATIC1111/stable-diffusion-webui
  • https://github.com/Stability-AI/stablediffusion?tab=readme-ov-file
  • https://github.com/facebookresearch/xformers?tab=readme-ov-file
  • https://bbycroft.net/llm
  • https://www.rand.org/pubs/research_reports/RRA2849-1.html

Diffusion models, in short:

  • PNG -> Some “latest space”
  • Compressed with a “variational encoder / decoder”
  • Add noise according to a gaussian
  • Pass noise+latent repr of image into a unet
  • Remove noise?
  • Pass through decoder

PNG -> Noise -> U-Net -> New PNG.