‎

Giovanni's Diary > Programming > Notes > Intro to Machine Learning >

Deep Generative Models

Definitions

Generative Models

Generative models are statistical models of the data distribution \(p_X\) or \(p_{XY}\) depending on the availability of target data.

Discriminative models

Discriminative models are statistical models of the conditional distribution \(p_{Y|X}\).

Note that a discriminative models can be constructed from a generative model via Bayes rule but not viceversa.

\[p_{Y|X}(y|x)=\frac{p_{XY}(x, y)}{\sum_{y'}p_{XY}(x, y') = p_X(x)}\]

Density Estimation

Explicit problem: Find a probability distribution \(f\in \Delta (Z)\) that fits the data, where \(z\in Z\) is sampled from an unknown data distribution \(p_{data}\in \Delta (Z)\).

In the case of supervised learning, \(Z=X\times Y\) while in the case of unsupervised learning, \(Z=X\). We will discuss this last case.

Implicit problem: Find a function \(f\in Z^{\Omega}\) that generates data \(f(\omega)\in Z\) from an input \(w\) sampled from some predefined distribution \(p_\omega \in \Delta (\Omega)\) in a way that the distribution of generated samples fits the (unknown) data distribution \(p_{data}\in \Delta (Z)\).

The objective is to define an hypothesis space \(H\subset \Delta (Z)\) of models that can represent probability distributions and a divergence measure \(d\in \mathbb{R}^{\Delta (z)\times \Delta (Z)}\) between probability distributions in \(\Delta (Z)\). Then find the hypothesis \(q*\in H\) that best fits the data distributed according to \(p_{data}\):

\[q*\in arg\ min\ d_{q\in H}(p_data, q)\]

Variational AutoEncoders (VAE)

Autoencoders are a way of decomposing high-dimensional data into a lower dimensional representation. A decoder reverses this and the objective is to minimize the divergence between input and its reconstruction. Variational Autoencoders add a probabilistic spin to traditional Autoencoders.

\[q_\theta (x)=\mathbb{E}_{\omega \sim p^{\omega}}[q_\theta (x|\omega)]\] \[\theta^* \in arg\ min\ d(p_{data}, q_{\theta})\]

As the divergence measure we will use the KL-Divergence:

\[d_{KL}(p_{data}, q_{\theta})=\mathbb{E}_{x\sim p_{data}}[\log \frac{p_{data}(x)}{q_{\theta (x)}}]=-\mathbb{E}_{x\sim p_{data}}[\log\ q_{\theta}(x)]+const\] \[= -\mathbb{E}_{x\sim p_{data}}[\log\ \mathbb{E}_{\omega \sim p_{\omega}}[q_{\theta (x|\omega)}]]+const\] \[\log\ \mathbb{E}_{\omega \sim p_{\omega}}[q_{\theta} (x|\omega)]=\log\ \mathbb{E}_{\omega \sim q_\psi (.|x)}[q_{\theta} (x|\omega)\frac{p_{\omega}(\omega)}{q_{\psi (\omega | x)}}]\]

Thanks to the Jensen's inequality we can write:

\[\ge \log\ \mathbb{E}_{\omega \sim q_\psi (.|x)}[\log(q_{\theta}(x|\omega)\frac{p_{\omega}(\omega)}{q_{\psi (\omega | x)}})] =\] \[= \mathbb{E}_{\omega \sim q_\psi (.|x)}[\log\ q_\theta (x|\omega)]-d_{KL}(q_\psi (.|x), p_\omega)\]

Those two last terms are the "Reconstruction" and the "Regularizer". The first is still NP-hard to compute but we can estimate the gradients of the parameters, while the regularizer might have closed-form solution (for example, using Gaussian distribution).

In practice training is divided into several steps: first data samples are fed to an encoder which can be something like a normal neural network or a convolutional one. This encoder will reduce the dimensionality of the input into what's called a "latent space" which Is composed of fewer nodes than the input layer. We are interested in the distribution of this latent sapce, so we approximate It as a gaussian and we save the mean and the variance. We then feed this statistical model to the decoder which generates again some data. Finally, we compare the original samples with the generated data via a loss function and update the weights of both the encoder and decoder do minimize the difference.

Conditional VAE

Assume we have side information \(y\in Y\) (e.g. digit labels, attributes, etc) and we want to generate new data conditioned on the side information (e.g. generate digit 7, or generate a face with glasses).

Modify the encoder and decoder to take the side information in input obtaining \(q_{\psi}(\omega | x, y)\) and \(q_{\theta}(x|\omega, y)\).

Define priors conditioned on side information \(p_{\omega}(\omega |y)\).

The problem with this approach is underfitting, since at initial stages the regularizer is too strong and tends to annihilate the model capacity, and blurry output data.

VQ-VAE

VQ-VAE-2 is a image synthesis model based on Variational Autoencoders. It produces images that are high quality comparable to big HANs models.

Generative Adversial Networks (GAN)

GAN models are composed by two "adversarial" submodels: a generator and a discriminator. The term adversarial means that the two submodels are in competition and there is one winner (a zero sum game). The generator is tasked to generate fake images, while the discriminator is tasked to recognize if an image is fake or not. The generator generates images for the discriminator to check, mixing real images with generated ones. If the generator fools the discriminator, than the latter needs to update Its weights, otherwise the opposite will happen.

GANs enable the possibility of estimating implicit densities. We assume to have a prior density \(p_\omega \in \Delta (\Omega)\) given and a generator (or decoder) \(g_\theta \in X^{\Omega}\) that generates data points in \(X\) given a random element from \(\Omega\).

The density induced by the prior \(p_{\omega}\) and the generator \(g_{\omega}\) is given by \(q_{\theta}(x)=\mathbb{E}_{\omega \sim p^{\omega}}\delta [g_\theta (\omega)-x]\), where \(\delta\) is the Dirac delta function.

The (original) GAN objective is to find \(\theta^*\) such that \(q_{\theta^*}\) best fits the data distribution \(p_{data}\) under the Jensen-Shannon divergence:

\[\theta^* \in arg\ min_{\theta}\ d_{JS}(p_{data}, q_{\theta})\]

where

\[d_{JS}(p, q)=\frac{1}{2}d_{KL}(p, \frac{p+q}{2})+\frac{1}{2}d_{KL}(q, \frac{p+q}{2})\] \[=\frac{1}{2}\mathbb{E}_{x\sim p}[\log\frac{2p(x)}{p(x)+q(x)}]+\frac{1}{2}\mathbb{E}_{x\sim q}[\log\frac{2q(x)}{p(x)+q(x)}]\] \[=\frac{1}{2}\mathbb{E}_{x\sim p}[\log\frac{p(x)}{p(x)+q(x)}]+\frac{1}{2}\mathbb{E}_{x\sim q}[\log\frac{q(x)}{p(x)+q(x)}] + \log(2)\] \[=\log(2)+\frac{1}{2}max_t\ \{ \mathbb{E}_{x\sim p}[\log\ t(x)] + \mathbb{E}_{x\sim q}[\log(1-t(x))] \}\]

Let \(t_\phi (x)\) be a classifier (or discriminator) for data point in the training set, we get the following lower bound on our objective

\[d_{JS}(p_{data}, q_\theta)=\log(2)+\frac{1}{2}max_t\ \{ \mathbb{E}_{x\sim p}[\log\ t(x)] + \mathbb{E}_{x\sim q}[\log(1-t(x))] \}\] \[\ge \log(2)+\frac{1}{2}max_\phi \ \{ \mathbb{E}_{x\sim p}[\log\ t_\phi(x)] + \mathbb{E}_{x\sim q}[\log(1-t_\phi(x))] \}\]

Which is minimized to obtain the generator's parameters:

\[\theta^* \in argmin_\theta\ max_\phi \{ \mathbb{E}_{x\sim p}[\log\ t_\phi(x)] + \mathbb{E}_{x\sim q}[\log(1-t_\phi(x))] \}\]

Travel: Intro to Machine Learning, Index