A deep dive into AI models that turn random static into amazingly realistic art.
Introduction
An image that starts with a cloud of random static, like the fuzz on an old TV, and then gradually transforms that cloud into a realistic image of a cat, a cityscape, or a wonderful world.
This is what diffusion models do.
This is the magic behind AI art tools like Del·E, Midjourney, and Static Diffusion that can turn simple text into incredibly realistic images.
But how does it all work? Let’s see how it works in science!
Explore this interactive explainer: https://poloclub.github.io/difusion-explainer/
What exactly is a diffusion model?
A diffusion model describes a type of generative AI that means it can generate new data (images, sounds, etc.) instead of analyzing existing data.
This is inspired by dispersion in physics, the natural process by which particles separate (a drop of ink in water).
In AI, we apply the same idea, but we use it backwards:
The model deforms an image, adding random noise iteratively.
The model then learns to reverse the noise – clean up the noise until you finally see the image again.
Two important steps
are diffusion models Two important steps: Forward process And Reverse process.
1. Forward process – adding noise
During training, the model is given a real image and some small amount of random noise is added each time. This happens, over and over, thousands of times until the image floats in pure static – completely unrecognizable. As a result, AI now understands how to “break down” images step by step.
2. Reverse process – noise removal
Once the model understands how to denoise an image, it learns how to reverse the process—step by step—to remove the noise and reproduce the original image.
In training, the AI aims to predict what the noise is.
If the AI can predict correctly, then it is able to subtract that noise – in effect “cleaning” the image.
After enough training, the model becomes skilled at reverse noise, which means it is able to start something new from random noise.
Model training
Dataset: A dataset of millions of real images is used.
Noise Schedule: Specifies how much noise is added to each time step.
Loss function: The job of AI is to accurately predict the noise that has been added.
Architecture: Most models use U-NET for fine detail and global image detail modeling.
From Noise to Art: How Images Are Made
1. The initial input to the model is pure noise (random pixels).
2. The model then degenerates this image, perhaps 50, 100, or even 1000 times.
3. Every time the model refutes the image, it becomes something clearer.
4. After all the nitty-gritty steps are done, you have a realistic, detailed image.
The noise is always different every time you start, so each final image is always unique.
Conditioning: How text cues guide models
When you input a gesture like “panda on a bike”, you have another model (like Clip or T5) translate your text into a vector, a mathematical representation of the meaning.
The diffusion model leverages vectors in its interpolation process to ensure that the final image accurately represents your input.
Why are diffusion models so good?
Diffusion models outperform older AI image generation approaches such as GANS for four main reasons.
1. They can additively shape images, ensuring that fine detail is consistently maintained throughout the image generation process.
2. They produce more reproducible and higher quality images than GAN.
3. They don’t suffer from problems with “mode collapse”, which results in images that are repetitive.
4. They can be conditioned on text, depth maps and diagrams, which provide additional control over the generated images.
Real world applications
Diffusion models are not strictly limited to artistic applications – their use spans a number of industries:
🎨 AI Art: Software like Dal·e, Midjourney, and Stable Dispersion
🏥 Healthcare: Generating synthetic medical images for safe research
🧬 Science: Designing novel molecules and/or drug candidates
🎬 Film and Design: Concept art, visual effects and traditional animations
Challenges
- Despite their strengths, diffusion models have drawbacks:
- They can take massive amounts of computing power and GPUs.
- Training can take weeks or months.
- They depend on the quality of the data. Biased data will produce biased results.
- Producing high-resolution images can still be relatively slow.
From chaos to creativity
Diffusion models illustrate the idea that order can emerge from chaos. They start with noise and culminate in creativity – showing that pure noise can be transformed into artistry with the help of intellect.
As artificial intelligence develops and diffusion models evolve, they will transcend the boundaries of images—which aid in world building, material design, and visualizing the unimaginable.
“Every masterpiece begins with a little noise.”