We hope to provide you all at KITP with an overview of one class of machine learning approaches, namely generative modeling. In this, we will discuss background on what generative models try to achieve and some recent advances in their approaches.
One of the major goals of machine learning is to understand the essential parameters explaining why a dataset is the way it is. Most commonly, this is seen as a problem of building a model that can learn a probability distribution that discriminates some data from other data. In the case of functional mapping in which a is associated to an input , this means learning a conditional probability , where the likelihood of the output value is predicted given some input condition. In applicable cases like classification or regression, the goal is to discriminatively ascertain likely values for a given . Other times, when a greater understanding of a functionally mapped space is desired beyond the distribution of some output conditioned on some input, the goal is to learn a joint probability -- to gain insight into the distribution that is responsible for generating the data.
Even more generally, there are many instances when there is no relational mapping or "labeling" behind the data distribution we are curious about. That is, we are only given some that are distributed according some probability distribution . In such case, the objective is to model the true distribution with some parameterized approximation so that we can generalize on i) estimating the likelihood of some in the domain and on ii) making novel samples that fall under the distribution. Approaches to this problem are detailed below.
Generative models can offer a variety of benefits or functional alternatives to discriminative approaches, some of which might be quite important for what physics want to accomplish:
These techniques could provide insight in validating experimental setups, expedite simulation tasks, and approve data abundance/quality, among other things.
Unsupervised learning with neural network architectures began with Hopfield and Hinton in the 1980s with the Hopfield Network and Restrictred Boltzmann Machine (RBM). We will not go into much detail about them here, but their details can be reviewed here and here. In general, these are traditionally binary, shallow neural network models that store an associated memory of an event or learn to reconstruct input. The energy minimization principles behind them lend themselves nicely to a physics perspective, but the advent of deep learning and its interplay with statistical learning theory over the past 15 years have ushered in a new era of deep generative modeling. The main approach that we will maintain our attention toward are methods that seek to maximize the likelihood of the data.
Each novel approach to generative modeling has its advantages and limitations. By their history and their approach to the objective of providing inference on a distribution, we can categorize generative models into three categories, as per Ian Goodfellow's NIPS 2016 GAN tutorial:
Explicit and Tractable Likelihood
Explicit and Approximate (bounded) Likelihood
Note: I might use the terms likelihood and density interchangeably here. Also note: I've included a few flow based models under the tractable likelihood category that I think are important to mention.
Pixel RNN / CNN
Pixel generative models are techniques that break the full likelihood estimate into the product of many sequential conditional probabilities. We can imagine that you perform inference pixel by pixel and use the chain rule of probability to amass an a measure of such that:
We can take a look at an example in the context of an RNN based on some recent work by Carrasquilla et al, "Reconstructing Quantum States with Generative Models."" Imagine that we have a 1D lattice of qubits that we want to perform measurements on to infer information about the distribution of measurement outcomes and relate that to reconstructing the quantum state.
Let's say we have a 50-qubit quantum system that is governed by the Transverse-Field Ising Model 1D Hamiltonian:
You want to be able to check on the fidelity of your setup to verify that you have properly prepared the quantum state. To do this, you want to perform inference on the density matrix of the mixed state -- to show that informationally complete measurement samples from your setup match the expected frequency associated with your intended density matrix . With the right choice of measurements -- namely informationally complete Postive-Operator Valued Measurements for each qubit , we can link the probability of measurement outcomes with the invertible expression given by Born's Rule: . Thus, if we can gain insight into the distribution of measurement ourcomes , we can gain insight into the density matrix . Here, each in can take on a discrete value to represent the measurement outcome for the qubit. The number of possible outcomes is decided by which POVM measurements are used. Let's say we have the Pauli-4 POVM and we then have 4 possible measurement outcomes . I will one-hot encode them to the vectors , , , . A given data example for our 50 qubits would be 50 of these one-hot vectors. We want to feed each of these sequentially into our RNN unit.
An RNN is made of a recurrent unit that takes in a hidden state (used for updating computation at different sequential states) and an input vector , which in our case is one of these one-hot vectors of measurement outcomes. The weight matrix is involved with some computation with and (which will vary depending on what type of RNN unit you use, like a GRU or LSTM).
At the first step , we feed in some fixed initial , of arbitrary value. The output of the RNN at the first time sequence is a probability distribution over measurement outcomes for the first qubit and an update to the hidden state . That is, the output of the RNN on the first go will give us and our new . During training, we then take the real first qubit measurement outcome from our training example and feed that in to the RNN unit as with to get our estimate . We do this for each sequential qubit in our model. We compute the loss by calculating the cross entropy between each and the sample from our training data for all in our qubit chain and sum.
When we want to sample the model, we alter this process a bit. After outputting a set of probabilities for the measurement outcome of the first qubit, we draw a sample outcome from this set of probabilities and feed that into the next step of the RNN rather than the training sample outcome. That is, say we compute the probabilities for measurement outcomes of the first qubit and get . We then sample from this and get , and this is what we feed as input into the RNN for the next pass estimating . Doing this for all qubits, you get a probaility distribution for each one that is dependent on the outcomes of those that preceded it. You can then start generating the measurements on 50 qubit lattices and using that to calculate expectations of observables for your density matrix :
You can compute the Classical Fidelity and the KL-divergence between and (ground truth):
For PixelCNNs rather than RNNs, imagine that the sequential steps come from 1D convolutions.
Flow Based Models
A recently introduced class of generative modes with explicit likelihood estimate is those that make use of normalizing flows. Normalizing flows help learn an invertible transformation (or set of invertible transformations for deeper models) such that performs a mapping between probability distributions. If you can make a sufficiently versatile representation of (say, with a neural network), then you should be able to transform most any probability distribution into another. Let's build the mathematical tools out for describing this. Say we have some and and we want to learn our generative mapping which takes some from the data space and transforms it to some in the latent space . We can use the change of variables formula to do this if is invertible:
Note: because we assume is multi-dimensional, we label as the Jacobian of the . This seems simple enough, so there must be a catch. And that catch is in coming up with a way to represent a highly flexible and complicated function that one can easily invert. A number of papers by Dinh et al. (NICE and RealNVP) as well as a recent paper by Kingma et al. (Glow) build up and optimize some clever methods to make this inversion possible, the basics of which we'll describe here.
In general, it is difficult to learn invertible functions. Additionally, the above equation is computational inefficient because calculating the Jacobian of a large matrix is intractable. This can circumvented if is constructed with what we'll call additive or affine coupling layers. NICE uses additive coupling layers, while RealNVP uses affine coupling layers. These layers are ways of splitting the invertible function into subparts so that the Jacobian calculation is tractable. I copy them from their respective papers:
Coupling + 1x1 Convolutions:
These are used in the Glow paper to help with image understanding. Details can be found in their paper and aren't essential for understanding the flow-based modeling framework.
The coupling allows for us to compute a Jacobian defined by a triangular matrix, which is significantly more efficient than would otherwise be possible. The Jacobian of these coupled functions looks like:
where is the identity matrix of dimension . We write it like such to show the familiar form of a 2x2 matrix. This is really convenient! We need to calculate the determinant of this matrix to do our transformations and inverse transformations between probability density functions. If you recall that for a 2x2 matrix , , we can see that the determinant of our Jacobian is just the determinant of the smaller sub matrix defined by . For as complex a transformation as the affine coupling, this only amounts to being able to compute:
If you stack at least two of these coupling layers together, you can transform all of your data. Moreover, one can see that computing neither the Jacobians, nor the inverses, require explicit sub-computations of the scaling in NICE (function ) or the scaling and transformation in RealNVP (functions and ). As such, these functions can be complicated and that won't impact the difficulty in calculating the inverse or the Jacobian.
With this flexibility of inversion, we can perform sampling and exact inference. If I draw some from and I know is an analytically known distribution like a Gaussian , I can pass it through to get a sample from . If I want to perform inference on a sample , I can pass it through to understand its likelihood in the known latent density:
In practical optimization contexts, we maximize the log likelihood of the data rather so that the multiplication in Equation can be split into additive terms whose optimization is more manageable to pursue:
Note: Flow-based techniques can be combined with other approaches! For example, normalizing flows were introduced to VAEs in 2015 with the paper BLANNNKKKK by BLANKKKKK, and autoregressive models can be hybridized with flow techniques, as per Inverse Autoregressive Flows and Masked Autoregressive Flows.
Under a similar vein as the flow based models in which we map our data to some known tractable variable , we can also approach this problem that the variables are actually a set of hidden or latent variables that drive processes from which the data are generated. That is, we can imagine that these latent variables drive and can be used to explain the observable data and how its distributed. This is called the latent variable model and has been used often in statistical modeling. By such, it is often used in research in the machine learning community, but it has a few limitations. We can describe the process of inference on as calculating
We run into the problem of having to marginalize out all of to exactly calculate . For high-dimensional systems that we often encounter in machine learning. In such case, we also can't access it through non-integral Bayes' Rule because the posterior density is also intractable. To overcome this, Kingma and Welling proposed a method of putting an optimizable lower bound on so that one can have a model that provides approximate, bounded measure of the likelihood of the data.
Let's take this latent model and run with it a bit. We want to generate some from and thereby need a model which tell us about . We can estimate this with a neural network defined by parameters as that will tell us about the probability of observing some given some to be decoded. We can choose our latent space to follow some simple distribution such as a Gaussian so . Because marginalizing out is intractable, we have to start by performing inference on . We can facilitate this by first encoding our data into a latent space by learning to model . We can estimate this with a neural network defined by parameters as , where will be used to represent the density of our samples that are encoded into the latent space.
I have used the words decode and encode to hint at the structure of the model. These words are used often to describe the two subparts of an autoencoder. A conventional autoencoder is a deterministic framework for learning to encode data into a dimensionally reduced space and then subsequently decode it for reconstruction. If the data can be reconstructed from the dimensionally reduced space, then the dimensionally reduced space captures the information behind the data well.
In our context, we want to put a probabilistic spin on this interpretation, where now the encoder represents our probability distribution and the decoder represents our . We can write out an approximation for the log-likelihood of the data with these terms, as is done here:
We expand the log-likelihood into terms from our model and end up with three. The expectation over equates the last two logarithmic fractions to KL-divergences. The first term is a posterior likelihood term that can be approximated on small minibatches with with a mean square error or a binary cross entropy. We can consider the reconstruction loss term as ensuring that real samples are embedded into the latent distribution and forcing the decoding function to be an approximate inverse of the encoding function. The second term is a measure of the similarity between the encoded distribution and our (generally) Gaussian prior, and the third term is intractable. However, , so the first two terms can still function as a differentiable lower bound on the log-likelihood -- also known as the also known as the evidence lower bound (ELBO).
Because it is differentiable, it can be optimized to maximize the lower bound of the likelihood. Using this variational principle, this has been dubbed a variational autoencoder (VAE). Note the value included the bound. The addenedum is known as the -VAE and helps control the conditional independence of the latent space (and thus helps with the adoption of conditional information) and improves the efficiency of the latent representation. It comes with a tradeoff though of poorer image reconstruction, as the first term is made comparatively less important in the loss.
Graphs of prototypical autoencoders and VAEs models are shown in the figure above to highlight their differences. An important distinction beyond the different objective function is in the encoding bottleneck. Instead of encoding down to some single latent representation, a VAE encodes to two bottlenecks -- a mean vector and a variance vector. The latent vector is then created by sampling the mean vector, the variance vector, and combining these samples with a noise term . This is to permit the VAE to generate new samples while still allowing the graph to be differentiable for backpropagation optimization and is known as the reparameterization trick. Autoencoders and VAEs are only cousins because of their encoding and decoding paradigm; otherwise, they are conceptually (Bayesian vs deterministic) and functionally (reductive vs generative) quite different.
You might be asking why the encoder maps to a and ? This is to make the calculation of that second KL-term analytically calculable. Those two parameters define the Gaussian for us, so we can compute a closed form representation of by treating and as functions of :
where is the dimensions of the latent space, and, in the last line, we switch to the exponent because one achieves more stable results computing than (as seen in the graphical representation above).
So what does the training process look like then? It's much like a balance optimization between being able to reconstruct images and ensuring that looks Gaussian for all . We can imagine that below, where we can envision 2D Gaussians for each getting near or close to the prior Gaussian :
As the KL term is more strongly enforced, data will be better spread around the Gaussian prior, but the generated image quality will be worse. If the KL term is not enforced as strongly (as might be seen above), posterior Gaussians can overlap, and multiple inputs might map to the same generated . This can be improved with the Wasserstein Autoencoder by Tolstikhin et al in which is forced to match rather than each on their own. Note: there are many many advances and perspectives on these models, and more keep coming. I just reference the Wasserstein Autoencoder case because it elucidates some of the behavior of the latent space visually well.
For example of some other advances in the understanding, one can take a look at ways of disentangling the latent space of these models so that each latent dimension has some more interpretable understanding of its representation. This can be done by enforcing independence between the latent dimensions, which becomes apparent if you break down the KL term into its newly discovered 3 subparts, one of which is the Total Correlation. This can be done adversarially, with d-Hilbert-Schmidt Independence Criteria, or with MMD and variations of adversarial training. One can also tailor their latent space to be a sampling on a manifold that corresponds to the symmetries/transformations that govern the true data.
Generative Adversarial Networks
A common dilemma in generative modeling is choosing the right evaluation metric in your training regime, as we have seen that likelihood calculations on latent variable models (as well as other MLE approaches using energy maximization techniques) can be quickly intractable. Moreover, these methods generally give a limited resolution on samples generated by the model. A novel training technique known as adversarial training replaces the traditional likelihood estimation with a trainable network, whose task is to discriminate generated samples that came from a implicitly learned by a model from those of the true . Generative adversarial networks (GANs) conceived in recent years follow this paradigm.
GANs function as a competition between two competing neural networks - a generator and a discriminator -- which are both represented by functions that are differentiable with respect to inputs and weight parameters. Let's say that the discriminator and generator have trainable network parameters and respectively. The generator takes in some latent noise vector and seeks to output a conditional signal . The discriminator takes in either the output of the generator which is a sample from the approximate distribution or a sample from the true distribution and outputs a value of the probability that the given sample is real or counterfeit. The two functions share a cost function -- one that the generator tries to minimize and the discriminator tries to maximize:
The two networks are thus jointly optimized in tandem with one another, and can be visualized like below:
Let's break down what this optimization process looks like a bit more and see what happens when we have an optimal discriminator . I will describe the above cost function with the full distributions so we can use integrals rather than expectation:
where is the probability that the generator produces . The optimal discriminator to maximize this (quick proof on continuous space provided in original GAN paper) is given by:
Using this, we rewrite our value function as:
when , . This is the optimal scenario for training, as our distributions would match. In this case, we have
is the lowest value of the cost. Factoring this value from the right-hand-side of will give us the minimum plus whatever needs to be minimized to 0 to achieve that minimum:
Thus, the training process on this value function with an optimal discriminator has been shown to be equivalent to minimizing the Jensen-Shannon divergence given by:
for KL divergence . The ultimate goal is for the generator to create samples that are coming from an approximate distribution so close to the real distribution that the discriminator can no longer tell them apart from real samples. It should be noted that because in real world applications the model does not have access to the full distribution , the generator can only best learn the training distribution .
While this approach is very powerful because of the flexible, parameterized replacement for the likelihood that allows for the generator to adeptly model multi-modal output, there are a number of limitations to the original GAN framework. These limitations could arise from a number of theoretical problems as detailed in a paper by Roth et al 2017 -- overfitting, density misspecification or dimensional misspecification -- and are often driven by the generator or discriminator outperforming the other. Two symptoms appear because of this problem one we'll call mode collapse and/or vanishing gradients as well as non-convergence, where the parameters of the model continue to oscillate back and forth. We can visualize why the model will not update significantly if the generated and empirical distribution produced are too far apart:
As you can see in the graph, when the distributions are far apart, the JS Divergence flattens, in which case only small gradients would be backpropagated.
There have been many proposed solutions and further analysis of GAN dynamics, but I'll highlight two here which work well in practice. I will refer to them as the Wasserstein GAN and the GAN with discriminator gradient penalty.
Wasserstein GAN (WGAN and WGAN-GP):
The idea behind the Wasserstein GAN is to replace the f-divergence with an integral probability metric. That is, the GAN is refrmed as a question of optimal transport (OT), in which our goal is to learn and minimize the best measure (minimum cost) of how much density of the model distribution need to be moved to make it equal to the true distribution. In this case, we replace the JS Divergence minimization problem with one of minimizing the Wasserstein distance between the distributions:
This is also known as the Earth-Mover distance. is the set of all joint distributions whose marginals are . The infimum is the greatest lower bound for any transport plan. We want to minimize the greatest lower bound to ensure we've definitely overlapped the distributions. In reality, this equation is intractable to estimate, but can be expressed under 1-Lipschitz constraint as:
1-Lipschitz functions are just functions for which their slopes/gradients are always less than or equal to 1: . This simplification can be derived via the Kantorovich-Rubinstein duality.
Moreover, this new metric gives us a new objective! The discriminator is no longer tasked with outputting a probability that the sample is real or fake, but rather is tasked with learning this supremal function f(x) that outputs a score of distributional distance. When the score is minimized, your distributions should overlap. As such, one should remove the sigmoid activation at the end of their original GAN discriminator because the score is not a probability.
One issue with this original proposal is that it is hard to enforce this Lipschitz constraint when learning . A way to enforce this constraint is proposed in Improved Training of Wasserstein GANs, in which a regularizer is applied to the objective that forces the gradients of the discriminator to be less than 1:
GAN w/ Discriminator Gradient Penalty (GAN-DP):
A gradient penalty can also be applied to the original GAN framework. This method was shown to be locally convergent whereas the WGAN-GP was not, though both work decently in practice under the right hyperparameters. I would say from my experience that GAN-DP is less fragile with respect to small changes in the model.
The penalty is on the gradient of the discriminator on real data, and helps keep the model headed toward and around Nash-equilibrium.
We can use these techniques to sample from very complex distributions, but for now here's a tutorial on sampling from some simple analytic distributions:
The tutorial can be found here.
This page will be continuously updated.