Explainer of Diffusion LLMs from Andrej Karpathy: “Most of the LLMs you’ve been seeing are ~clones as far as the core modeling approach goes. They’re all trained “autoregressively”, i.e. predicting tokens from left to right. Diffusion is different - it doesn’t go left to right, but all at once. You start with noise and gradually denoise into a token stream.”
You must log in or register to comment.
The premise is sort of hilarious. “Everybody’s just blindly copying this one kind of network. We made the bold decision to copy the other one.”