Giovanni's Diary > Programming > Notes > Intro to Machine Learning >
Diffusion Models
To recap, generative models train a generator \(G\) from latent space to data space and approximate the real data distribution. Variational Autoencoders have the ability to generate new samples to regular Autoencoders and use a probabilistic latent space (assumed to be a multivariate Gaussian), while GAN have a generator that starts from Gaussian noise and generates a data point in order to fool the discriminator.
Denoising diffusion
Denoising diffusion models consists of two processes:
- forward process to add noise
- reverse precess denoises to generate data
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
\[Pr(X_{n+1}=x|X_1=x_1, X_2 = x_2, ..., X_n = x_n)=Pr(X_{n+1}=x|X_{n}=x_n)\]
Forward Process
A forward process gradually adds noise to the images over \(T\) timesteps.
\[x_0\rightarrow x_1 \rightarrow x_2 \rightarrow ... \rightarrow x_T\]
The model will also be tasked to undo this noise called the "reverse" prorocess. Often the forward process is fixed and the reverse process needs to be trained.
A forward process has the following formulation:
\[q(x_t|x_{t-1})=N(\sqrt{1-\beta_t}x_{t-1},\beta, I)\Rightarrow q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1})\]
Define \(\alpha_t = (1-\beta_t)\), where \(\beta_t\) is the step size (schedule), then
\[\bar{\alpha}_t=\prod_{s=1}^t(1-\beta_s)\Rightarrow q(x_t|x_0)=N(\sqrt{\bar{\alpha_t}}x_0, (1-\bar{\alpha_t})I)\]
We can then sample directly at the desired timestep:
\[x_t=\sqrt{\bar{\alpha}_t}x_0 + \sqrt{(1-\bar{\alpha}_t)}\epsilon,\ \epsilon \sim N(0, I)\]
Formally we are applying a Gaussian convolution to the data at each timestep. Practically we are smoothening out the distribution to a Gaussian one.
Generation Process
Given that \(q(x_T)\approx N(0, I)\), a sample is \(x_T\sim N(0, I)\) and iteratively \(x_{t-1}=q(x_{t-1}|x_t)\).
\(q(x_{i-1}|x+t) \propto q(x_{t-1})q(x_t|x_{t-1})\) is generally intractable, but we can approximate with another Gaussian.
- \(p(x_T)\sim N(0, I)\)
- \(p_{\theta}(x_{t-1}|x_t)\sim N(\mu_{\delta}(x_t, t), \sigma^2 I)\)
- then \(p_{\theta}(x_{0:T})=p(x_T)\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_t)\)
Noising shedule
We can control the variance of the forward diffusion and reverse denoising processes respectively. Often a linear schedule is used for \(\beta_t\) and \(\sigma^2_t\) is set equal to \(\beta_b\).
Kingma et al introduce a new parametrization of diffusion models using signal-to-noise ration (SNR), and show how to learn the noise schedule by minimizing the variance of the training objective.
Connection to VAE, GANs
- Latent variables have the same dimensionality of data.
- The same model is applied across different timesteps.
- The model is trained by reveighing the variational bound.
Training Parametrisation
We can train the model in a similar fashion as VAE, with a Variational Upper Bound:
\[L=\mathbb{E}_{q(x_0)}[-\log p_\theta (x_0)]\le \mathbb{E}_{q(x_0)q(x_{1:T}|x_0)}[-\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}]\]
These can be divided into three terms:
\[L=\mathbb{E}_q[D_{KL}(q(x_T|x_0)||p(x_T))\] \[+\sum_{t>1}D_{KL}(q(x_{t-1|x_t}, x_0)||p_{\theta}(x_{t-1}|x_t))-\log\ p_{\theta}(x_0|x_1)]\]
- the first term is fixed
- the second term is just summing gaussians
KL between Gaussians has a nice closed form, but Ho (with some math) proves the training can be simplified toa noise prediction problem, obraining a new loss:
\[L_{simple}=\mathbb{E}_{x_0\sim q(x_0), \epsilon \sim N(0, I), t\sim U(1, T)}[||\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0+\sqrt{(1-\bar{\alpha}_t)}\epsilon, t)||]\]
Training Algorithm:
repeat
- \(x_0 \sim q(x_0)\)
- \(t \sim Uniform(\{ 1, ..., T \})\)
- \(\epsilon \sim N(0, I)\)
- Take gradient descent step on \(\nabla_\theta || \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon, t)||^2\)
until converged
Sampling algorithm:
- \(x_T \sim N(0, 1)\)
- for \(t=T, ..., 1\) do
- \(x\sim N(0, I)\)
- \(x_{t-1}=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta (x_t, t))+\sigma_t z\)
- end for
- return \(x_0\)
The choice of the architecture is free. For images use U-NET.
U-NET
The U-NET architecture contains two paths.
- The first path is the contraction path (also called as the encoder) which is used to capture the context in the image. The encoder is just a traditional stack of convolutional and max pooling layers.
- The second path is the symmetric expanding path (also called as the decoder) which is used to enable precise localization using transposed convolutions
- It is an end-to-end fully convolutional network, i.e. It only contains Concolutional layers and does not contain any Dense layer because of which it can accept image of any size.
Generative Trilemma
Often fast sampling, mode coverage / diversiry and high quality samples are difficult to coexist together.
- GAN have fast sampling with high quality samples but not mode coverage / diversity.
- Likelihood-based models (Variational Autoencoders and Normalizing flows) offer fast sampling and mode coverage/diversirt but not high quality samples.
- Denoising diffusion models have mode coverage/diversity and high quality samples but not fast sampling.
Diffusion GANs
Generative denoising diffusion models typically assume that the denoising distribution can be modeled by a Gaussian distribution. This assumption holds only foe small denoising steps, which in practice translates to thousands of denoising steps in the synthesis process. In diffusion GANs, the denoising model is represented using multimodal and complex conditional GANs, enabling to efficiently generate data in a few steps.
Compared to a one-shot GAN generator:
- Both generateor and discriminator are solving a much simpler problem
- Stronger mode coverage
- Better training stability
Distillation
Distill a deterministic DDIM sampler to the same model architecture. At each stage, a "student" model is learned to distill two adjacent sampling steps of the "teacher" model to one sampling step. At next stage, the "student" model from previous stage will serve as the new "teacher" model.
Latent-space diffusion models
The distribution of latent embeddings is close to Normal distribution, giving simpler denoising and faster synthesis. They allow augmented latent space and tailored autoencoders (graphs, text. 3D data, etc).
Text-to-image
Jointly train a text encoder and an image encoder. Traing by maximising the similarity between embeddings of (text, image) pairs. The resuling space has semantics for both images and text.
Diffusion usages
- super resolution
- image-to-image (color a black and white image, extend and image's borders)
- semantic segmentation
- image editing (add something to a portion of the image)
- video generation
- 3d shape generation
Travel: Intro to Machine Learning, Index