Imagine trying to reconstruct a sand castle from a pile of jumbled sand. The only way to do it is to have watched thousands of sand castles being knocked down — so you know what the 'unscrambling' should look like. How does this connect to how an image generator might work?
Rote answer
"Diffusion models learn to remove noise from images."
Understood
"The model watches millions of images being gradually scrambled into noise — like a sand castle being kicked apart in slow motion — and learns to reverse each step. To generate a new image, it starts with a pile of random noise (blank sand) and applies that learned 'unscrambling' over and over until something meaningful appears."