hckrnws
Hey, I made this, thanks for posting!
It’s purposefully high level and non-technical for a general audience - my theory was that most people who aren’t into tech/AI don’t care too much about training, or how the system got to be the way that it is.
But they do have some interest in how it actually operates once you’ve typed in a prompt.
Happy to answer any questions or take on board feedback
I think some of the visualizations would be much better if you used a pixel-space model instead of a latent diffusion model.
Right now we are only seeing the denoising process after it's been morphed by the latent decoder, which looks a lot less intuitive than actual pixel diffusion.
If you can't find a suitable pixel-space model, then you can just trivially generate a forward process and play it backwards.
Thanks that’s a great suggestion.
Thanks for this!
Has there been any study of grammar and other word order effects in the result? Is "Dog fetches ball with tail" more likely to produce an image of dog with a ball grabbed with its tail than "tail ball dog fetch with"?
Like search engines, an issue is user searched for "best price on windows". Do they mean windows the OS or glass windows.
My impression, at least with image generation I've used, it's while there is some mapping of words and maybe phrases through the latent space to an image it's very weak. If you put "red ball" in a long prompt, it's nearly as likely "red" will get applied to some other part of the description than the ball.
Honestly I don’t know the answer to that but it’s a good question and something interesting to look into. The PRX model I used ran pretty well on my MacBook M4 so you could play around, although I guess it will depend on the specifics of the model.
When I was building this I did have to rework the prompts quite a bit so they worked nicely with the word-by-word reveal visualisation, i.e. they mention the subject early, then add adjectives about setting and light etc.
Loved the writeup!
Found the manual latent space exploration part really interesting.
Too many LLM/diffusion explanations fall in the proverbial “how to draw an owl” meme without giving a taste as to what’s going on.
It's quite clever and thoughtful. thanks for making it!
I enjoyed this a lot.
The interpolations between butterfly and snail were pretty horrifying. But something like Z-Image you could basically concatenate the text and end up with a normal image of both. Is the latent space for "butterfly and snail" just well off the path between the two individually?
It's hard to imagine what is nearby in latent space and how text contributes, so I did really like the section adding words to the prompt 1-by-1.
This is awesome. If you made a book or video-course that takes this level of high level explanation and translate it to the technical and then mathematical level, I would buy it in an heartbeat.
This is what I think is missing in most AI (broad sense) learning resources. They focus too much on the math that I miss the intuitive process behind it.
Pretty cool, playing with the guidance scale slider here taught me more than re-reading the DDPM paper did.
Thanks for sharing!!
Thanks for this article, this is the best explanation and visualization I have seen for explaining this flow. Great work!
Scroll to visualise steps is such a great idea! Great writeup.
Ha, I was going to say the exact opposite. My first thought was that the website was broken.
The scroll trigger was something I’ve seen and wanted to play around with, but I know it’s controversial so I added the toggle as well (upper left corner).
Oh I particularly loved that you made the prompts themselves interchangeable. Very well done!
Amazing explanations!! I absolutely love this. In 10 minutes it’s given me a huge boost in my intuition on diffusion, which I’ve been missing for years.
If the prompt is the compass, and represents a point in space, why walk there? Why not just go to that point in image space directly, what would be there? When does the random seed matter if you're aiming at the same point anyway, don't you end up there? Does the prompt vector not exist in the image manifold, or is there some local sampling done to pick images which are more represented in the training data?
So I’m not an expert, this post was just based on my understanding, but as I understand it: the prompt embedding space and the latent image space are different “spaces”, so there is no single “point” in the latent image space that represents a given prompt. There are regions that are more or less consistent with the prompt, and due to cross-attention between the text embedding vector and the latent image vector, it’s able to guide the diffusion process in a suitable direction.
So different seeds lead to slightly different end points, because you’re just moving closer to the “consistent region” at each step, but approaching from a different angle.
One way of thinking about diffusion is that you're learning a velocity field from unlikely to likely images in the latent space, and that field changes depending on your conditioning prompt. You start from a known starting point (a noise distribution), and then take small steps following the velocity field, eventually ending up at a stable endpoint (which corresponds to the final image). Because your starting point is a random sample from a noise distribution, if you pick a slightly different starting point (seed), you'll end up at a slightly different endpoint.
You can't jump to the endpoint because you don't know where it is - all you can compute is 'from where I am, which direction should my next step be.' This is also why the results for few-step diffusion are so poor - if you take big jumps over the velocity field you're only going in approximately the right direction, so you won't end up at a properly stable point which corresponds to a "likely" image.
Scrolling through pics on mobile is difficult. Wanted to see all 29 steps but couldnt scroll it reliably.
Turning off the scroll mode worked very well for me on a mobile.
Crafted by Rajat
Source Code