Restricted Boltzmann Machines
Contents
Restricted Boltzmann Machines#
In this section, we will learn how to build and train a restricted Boltzmann machine (RBM) to solve a simple engineering task.
The restricted Boltzmann machine is an example of a physics-inspired method that is useful for engineering problems. At a high-level, RBMs are probability distributions for binary data with many tunable parameters. When “training” an RBM, we tune its parameters to make the RBM’s probability distribution match the distribution of a training data set. Our goal in this unit is to make and train an RBM whose probability distribution matches the distribution of images in the MNIST handwritten digit data set.
Like a Hopfield network, a restricted Boltzmann machine is a particular type of Ising model. We saw that a Hopfield network can be thought of as a zero-temperature Ising model which equilibrates to a local energy minimum. Similarly, a restricted Boltzmann machine can be thought of as a finite-temperature Ising model with a particular connection structure between its spins. Because of their relation to statistical physics, RBMs are more interpretable than some other machine learning approaches. And because of their particular structure, they can be efficiently trained.
A restricted Boltzmann machine is a probability distribution over binary variables, which, like in the Hopfield network, can be interpreted as spins or neurons. In an RBM, the binary variables are separated into two types, “visible” \(v_i\) or “hidden” \(h_j\) variables. The visible and hidden variables do not interact with variables of the same type, but only interact with variables of the other type. This special “restricted” structure allows us to separate the binary variables into two layers. Described using the language of statistical physics, an RBM is simply a Boltzmann distribution of an Ising model for the spins \(v_i,h_j = \pm 1\):
where
is the energy of the visible and hidden spins, \(Z\) is a normalization constant, and the temperature is set to \(T=1\). Each entry of the weight matrix \(W_{ij}\) describes an interaction between a visible spin \(v_i\) and a hidden spin \(h_j\). The visible bias \(a_i\) (hidden bias \(b_j\)) is a magnetic field that encourages the visible spin \(v_i\) (hidden spin \(h_j\)) to turn on. Note that the energy \(E(v,h)\), and therefore the probability \(p(v,h)\), depends on the weights \(W_{ij}\) and biases \(a_i, b_j\). These parameters define the RBM and are what we will tune during training.
Our goal is to write an RBM code and make it “learn” a probability distribution from a training data set. Once trained on this data, the machine will be able to produce new samples from the data’s distribution (or its best estimate).
RBM as an Ising Model#
To summarize, an RBM is really just an Ising model.
Restricted Boltzmann Machine \(\longleftrightarrow\) Ising Model
Neuron \(\longleftrightarrow\) Ising Spin
\(+1\) = Spin Up
\(-1\) = Spin Down
\(W_{ij} \longleftrightarrow J_{ij}\)
\(a_i,b_j \longleftrightarrow h_i\)
The energy of the corresponding Ising model is
When we sample an RBM, we obtain the spin configuration \((v,h)\) with the \(T=1\) Boltzmann probability
There are \(N_v\) visible spins \(v_1,\ldots,v_{N_v}\) and \(N_h\) hidden spins \(h_1,\ldots,h_{N_h}\).
Understanding Probability Distributions#
There are a number of probability distributions that we need to be familiar with to understand RBMs. These probabilities characterize the frequency with which we observe configurations of visible and hidden spins when sampling an RBM.
The first and most important is the joint probability \(p(v,h)\). This is a probability distribution over multiple random variables that gives the probability of observing these variables simultaneously. In other words, it tells us probability that the visible spins are in configuration \(v\) and the hidden spins are in configuration \(h\). By definition, for an RBM, this is given by the Boltzmann distribution
This probability distribution contains all of the information about how variables \(v\) and \(h\) are correlated with one another. The remaining probability distributions that we discuss are derived from the joint distribution.
Next are the marginal probabilities \(p(v)\) and \(p(h)\). These are probability distributions for subsystems of the entire system. Consider \(p(v)\). Intuitively, \(p(v)\) is the probability of observing the visible spin configuration \(v\), averaged over all possible hidden spin configurations \(h\). This quantity can be computed from the joint distribution \(p(v) \propto \sum_h p(v,h)\), where the sum over \(h\) is a sum over all \(2^{N_h}\) spin confiurations of the hidden spins. So that this is a normalized probability distribution, with \(\sum_v p(v) = 1\), the marginal distribution for the visible spins is defined as
Likewise, the marginal distribution for the hidden spins is
Finally, we discuss the conditional probabilities \(p(v|h)\) and \(p(h|v)\), which are read as “probability of \(v\) given \(h\)” and “probability of \(h\) given \(v\)”. These are a bit subtle, so take some time to think about them and how they relate to the other distributions. A conditional probability is a probability distribution of a subsystem conditioned on the complementary subsystem taking a specific value. Consider \(p(v|h=\hat{h})\). This is the probability that \(v\) occurs given that \(h\) is fixed to the value \(\hat{h}\). Like the marginal distribution, this probability can be computed from the joint distribution \(p(v|h=\hat{h}) \propto p(v,h=\hat{h})\). Notice that \(p(v|h=\hat{h})\) depends on two variables \(v\) and \(\hat{h}\), but is only a probability distribution over the variable \(v\). This means that it is only normalized over \(v\) so that \(\sum_v p(v|h=\hat{h})=1\) is true for any \(\hat{h}\). This leads us to the normalized definition of \(p(v|h)\) (where we drop the \(h=\hat{h}\) notation) and its explicit form for an RBM
Likewise, \(p(h|v)\) is
Notice how closely related all of these probability distributions are. For example, the marginal and conditional probabilities are all directly related to the joint distribution \(p(v,h)\), but have different normalizations, which dramatically affect their interpretations.
In general, joint, marginal, and conditional probability distributions are related by a simple formula. Suppose that we have a system split into subsystem \(A\) and its complement \(B\). The probability of \(A\) conditioned on \(B\), denoted by \(p(A|B)\), relates to the marginal distribution of \(B\) and the joint distribution of \(A\) and \(B\) via
You should check why this formula is true from the definitions of conditional and marginal probabilities.
This relation forms the basis of Bayes’ theorem,
which holds whenever the joint distribution is symmetric, \(p(A,B) = p(B,A)\).
Why are marginal and conditional probabilities important? Why did we not consider them in the Ising model unit?
The variables in an RBM are by design split into two groups, visible and hidden variables. When training an RBM, the visible variables are associated with data, such as the pixels in images, while the hidden variables are simply extra variables in the model, sometimes called auxiliary or latent variables. Because of the special interpretation of the visible variables, it is natural to consider the probability of just the visible spins, i.e., the marginal distribution \(p(v)\). This marginal distribution is the probability distribution that we will want to match to a data set during training. This is why we care about marginal probabilities.
An important technical detail about RBMs is that, because of the bipartite structure of the RBM, there is an efficient way to sample spin configurations using conditional probabilities \(p(v|h)\) and \(p(h|v)\). This efficient sampling allows us to perform efficient training. This is why we care about conditional probabilities.
In the Ising model unit, there was only one type of spin and there was no reason to split the spins into different subsystems. So we never needed to consider probability distributions that related different subsystems.
Building and Sampling an RBM#
Your first goal is to set up the basic code structure for your RBM. You will want to store the states of your visible and hidden variables, as well as the values of your parameters, the weight matrix and the biases. Note that the weight matrix is generally rectangular, as there may be different numbers of visible \(N_v\) and hidden neurons \(N_h\). You should also set up a function that computes the energy, which will be useful for debugging.
Next, we want to learn how to sample configurations from our RBM, given the weights and biases, which we will need later in our training. Our goal is to produce spin configurations \((v,h)\) with probability \(p(v,h)\). Since the RBM is just an Ising model, one way we could perform this sampling is by using Metropolis Markov Chain Monte Carlo as we did in the Ising model unit, where we performed single spin flips randomly by using the energy function. It turns out there is a more efficient sampling method for RBMs, because of their special restricted structure, that allows us to “flip” many spins at once instead of one-at-a-time.
The efficient Monte Carlo method that we will use to sample the joint distribution \(p(v,h)\) is known as Gibbs sampling. Gibbs sampling involves the following steps:
Given the current visible layer \(v\), probabilistically choose a configuration for the hidden layer \(h\).
Given the chosen \(h\), probabilistically choose a configuration for \(v\).
Repeat steps 1 and 2 \(k\)-times.
This process is a Markov chain, and therefore has a stationary distribution. To produce the desired distribution, we must probabilistically choose new configurations of the hidden or visible layers in such a way that the stationary distribution is proportional to \(p(v,h) \propto \exp[-E(v,h)]\). This can be done by using conditional probabilities in each step of the sampling process. Specifically, the first two steps are:
Given \(v\), choose a new \(h\) with probability \(p(h|v)\).
Given \(h\), choose a new \(v\) with probability \(p(v|h)\).
Step 1. Because of the bipartite structure of the RBM, the conditional probability has a particularly nice, factorable form. For the hidden spins conditioned on the visible spins, it is
where \(m_j\) is an “effective magnetic field” felt by hidden spin \(h_j\), defined as
It is helpful to check this yourself, by using the definition of \(p(v,h)\) and \(p(h|v)\).
Because \(p(h|v)\) factors into a product of \(p(h_j|v)\), the \(h_j\) variables are distributed independently (for a given \(v\)) and we can sample them independently and at the same time. This is what makes Gibbs sampling efficient for RBMs. In practice, to sample \(p(h|v)\), we go through each hidden neuron \(h_j\) and choose to set it to \(+1\) with probability \(p(h_j=+1|v)=\exp(m_j)/(\exp(m_j) + \exp(-m_j))\) and set it to \(-1\) otherwise.
Step 2. The exact same logic applies. The conditional probability \(p(v|h)\) factors into
where \(m_i\) is an “effective magnetic field” felt by visible spin \(v_i\), defined as
To sample \(p(v|h)\), go through each visible spin \(v_i\) and set it to \(+1\) with probability \(p(v_i=+1|h)=\exp(m_i)/(\exp(m_i) + \exp(-m_i))\) and set it to \(-1\) otherwise.
One iteration of step 1 generates samples from \(p(h|v)\). One iteration of step 2 generates samples from \(p(v|h)\). However, \(k\) iterations of alternating steps 1 and 2, produces samples \((v,h)\) that are actually approximately distributed according to the joint distribution \(p(v,h)\). This approximation becomes better as \(k\) is increased, but works well even for modest \(k\). Making \(k\) larger leads to more accurate, but less efficient sampling. You can use \(k=10\).
Your goal is to implement Gibbs sampling. Since it is essential for the training algorithm, you need to make sure that it is working correctly by testing your generated samples against analytic results. We can do this check for a small RBM.
Testing
To check your Gibbs sampling, follow the steps below.
Setup an RBM with \(N_h=2\) hidden neurons and \(N_v=5\) visible neurons. Choose random initial states for the visible and hidden neurons. Choose random weights and biases.
For the given \(v\), sample the hidden layer spins 100,000 times and store a histogram of the resulting spin configurations. To do this I did something like (in C++):
map<string,double> probs;
//loop 100,000 times
string myString=rbm.toString(rbm.hidden);
if (probs.find(myString)==probs.end())
probs.insert(make_pair(myString,0));
probs[myString]+=1;
norm+=1;
Compare this histogram with the analytical result for \(p(h|v)\) obtained by brute-force calculation. My largest error was about 0.003.
For a given \(h\), sample the visible layer spins 100,000 times and store a histogram of the results.
Compare this with the analytical result for \(p(v|h)\).
Starting from a random initial spin configuration \((v_0,h_0)\), perform \(k=10\) iterations of Gibbs sampling and obtain the final spin configuration \((v_f, h_f)\). Repeat this 100,000 times and store the \((v_f,h_f)\) configurations into a histogram. Also make separate histograms of \(v_f\) and \(h_f\).
Compare the \((v_f,h_f)\)-histogram with the analytical result for \(p(v,h)\).
Compare the \(v_f\) histogram with the analytic result for \(p(v)\).
Compare the \(h_f\) histogram with the analytic result for \(p(h)\)
Run this a couple different times to make sure everything agrees.
Comment on implementation: When writing up your RBM, it is possible to perform all of the calculations using for
loops. However, if you are willing, it is worth trying to perform the calculations with matrix-vector multiplications. This can greatly simplify your code, since nested for loops can be replaced with a few lines of matrix-vector manipulations, and can speed up your code significantly (especially in python), since matrix-vector libraries are highly optimized. While the C++
standard library does not include matrix-vector functionality, there is a (relatively) easy-to-use library called Eigen that allows you to work with matrices in C++
. A handy reference for Eigen commands, and their equivalent commands in Matlab, is available here.
Note that you can represent \(v_i\) as a row vector and \(h_j\) as a column vector.
Grading
Write Gibbs sampling code that samples the hidden and visible layers of your RBM. For a small \(N_h=2\), \(N_v=5\) RBM, compute the probability distributions \(p(v,h), p(v), p(h), p(v|h)\) and \(p(h|v)\) using Gibbs sampling and by brute-force calculation using the explicit formulas for these probabilities that we gave above. Make sure that the approximate RBM probabilities and the exact brute-force probabilities agree within 0.01.
Unsupervised Learning and Training an RBM#
The task that restricted Boltzmann machines are typically used for is unsupervised learning. The goal of unsupervised learning is to train a machine to learn the probability distribution underlying a data set. To accomplish this goal, the machine is trained on unlabeled data, such as raw images without any captions or identifying information. By training on a large enough data set of high quality samples, the machine will hopefully be able to learn the correlations in the data set. Ultimately, we would like the machine to be able to generalize from the data it has seen. A well-trained unsupervised learning model should be able to generate new data samples that have the same features as the original training data. For this reason, such models are often called generative models.
Your next big step is to train your RBM to learn a probability distribution, called the data distribution, from a data set of samples of the distribution, called the training data. During training, the RBM’s parameters, the weights \(W_{ij}\) and biases \(a_i,b_j\), are tuned so that the marginal distribution of the visible spins \(p(v)\) matches the data distribution represented by the training data set. After successful training, sampling the visible layer of your RBM should (ideally) produces new samples that are also from the same underlying distribution of the training data.
In general it might be hard to exactly reproduce the data distribution, so instead we aim to get a probability distribution that is “close.” Let us define a measure of the “close-ness” of two distributions \(p(v)\) and \(q(v)\) as
In the information theory literature, \(O\) is called the Kullback–Leibler divergence, or the relative entropy of the distributions. In our case, \(p(v)\) is the marginal distribution of the RBM and \(q(v)\) is the data distribution. The function \(O\), which is a function of the RBM parameters since it depends on \(p(v)\), is the objective function that we seek to optimize to learn the data distribution.
We will minimize \(O\) locally via the method of gradient descent. Gradient descent finds local minima of an objective function using the function’s derivatives. Specifically, it starts at a particular point in the objective function’s parameter space, computes the objective function’s derivatives with respect to its parameters at that point, and then “steps” to a new point in the parameter space using the derivatives’ values. Because the derivatives decrease in magnitude as a minimum is approached, local minima are found by iterating the stepping procedure.
Gradient descent is performed by repeatedly updating the parameters as follows:
where \(\eta\) is an arbitrary parameter called the learning rate, or step size. The learning rate parameter affects the convergence of gradient descent and is generally chosen through trial and error.
Computing Gradients#
To perform gradient descent, we need to evaluate the gradients of the objective function \(O\) with respect to the parameters. For our particular objective function, the relative entropy between the data distribution and the RBM marginal \(p(v)\) distribution, the gradients turn out to have the following form
where \(\langle f(v,h) \rangle_{\textrm{RBM}|\textrm{data}} = \sum_{v,h} f(v,h) p(h|v) q(v)\) represents an average of the function \(f\) over the RBM, given that the visible variables are distributed according to the training data distribution \(q(v)\); and \(\langle f(v,h) \rangle_{\textrm{RBM}} = \sum_{v,h} f(v,h) p(v,h)\) represents an average over the RBM’s joint distribution \(p(v,h)\) without any conditioning on the visible variables. See here for notes deriving these derivatives.
More accurate derivatives
You can improve this derivative slightly by instead of evaluating terms like \(\langle v_i h_j\rangle\) by averaging over many samples of \(h\) (for the same value of \(v\)). In fact, this average can be taken analytically by just replacing \(h\) with \(p(h|v)\).
(We could have alternatively come to this conclusion by thinking of the derivatives as being made up of terms like \(q(v) \frac{\partial F(v)}{\partial W_{ij}}\) where \(F(v)\) is the free-energy.)
You are free to make this change in your code (or not). You can do this by adding a SampleHiddenProb
which returns the probabilities of the hidden spins and not a sample from them. Just be careful to only use this for the final sample of the hidden spins and not all of them!
Putting it together (and mini-batches)#
Training the RBM with gradient descent amounts to approximately computing these expectation values by performing Gibbs sampling and updating the RBM parameters using the approximate gradients.
A pseudocode of the gradient descent calculation is as follows:
Repeat many times:
Initialize the gradient variables:
dW=0; da=0; db=0
.Sample \(M\) configurations \(v^1, ...., v^M\) (called a mini-batch)
For each configuration:
Sample \(h\) from \(p(h|v)\) Update the gradient variables.
dW -= outerProduct(v,h)
da -= v
db -= h
Perform \(k\) iterations of Gibbs sampling: sample the visible and hidden variables back and forth \(v_0 \rightarrow h_1 \rightarrow v_1 \rightarrow h_2 \rightarrow \cdots \rightarrow h_k \rightarrow v_k\). Using the final spin configuration \((v_k, h_k)\), update the gradient variables:
dW += outerProduct(v,h)
da += v
db += h
Change the parameters by \(\eta\) dW/M; \(\eta\) da/M; \(\eta\) db/M
Beyond contrastive-divergence
Currently we are sampling back and forth in the Gibbs sampling \(k\)-times. This is called \(CD_k\). It is only in the large-k limit that we get the correct derivative (but in practice lower \(k\) often works better because of cancellation of errors). There is an alternative. Instead of Gibbs-sampling back and forth \(k\) times from the visible spins generated by data, for the second term in the derivative (step 4), we can sample back and forth from some other visible spins (visible_step4
) that has nothing to do with the actual data. We simply leave visible_step4
wherever it was in the previous step. This is somewhat more principled and is supposed to give better results but takes a lot longer to train (and you have to turn down the learning-rate). This approach goes under the name persistent chains. I don’t recommend using this, but if you want to explore it, I’d like to know where you found it helped and what you had to do to make it work.
Comments about mini-batches:
Part of the description requires us to choose \(M\) configurations from the visible data. Usually what people do if they have a dataset is to shuffle the dataset and then choose
data[0:M]
on the first mini-batch; choosedata[M:2M]
on the second mini-batch; etc. After they have looped over all the data they then reshuffle.The pseudo-code above essentially suggests a for-loop over the \(M\) configurations of the mini-batch. In practice, it is more efficient (especially in python) if you can compute the entire mini-batch with matrices (and no for-loops). This will be very important (in python) if you’re going to go on and do a MNIST or protein example below.
Testing#
Testing
To convince yourself that you have a working RBM code, you need to perform some tests. A good way to test a complicated method such as the RBM training algorithm is to run the algorithm on a small example that you can check by hand or with another method. For example, you can define an arbitrary probability distribution of three binary variables and train an RBM with \(N_v=3\) visible spins to match the distribution.
Here’s one way to go about producing an arbitrary probability distribution (in C++):
std::random_device rd;
std::mt19937 mt;
vector<int> ranInts;
std::uniform_int_distribution<> ran_int(0,100);
for (//loop over binary numbers in visible layers)
ranInts.push_back(ran_int(mt));
}
// At this point ranInts of [2,1,4,2] means there is a 2/8 chance of 00, a 1/8 chance of 01, a 4/8 chance of 10, and a 2/8 chance of 11
std::discrete_distribution<> d(ranInts.begin(),ranInts.end());
myVisibleLayer=binary_nums[d(mt)]; //assuming binary_nums is a vector with binary numbers in it
// this is the probability distribution you are training over.
and in python
prob_dist=np.random.ranf(32)
prob_dist=prob_dist/np.sum(prob_dist)
samples=np.random.choice(range(0,32),p=prob_dist,size=100000)
### Below just for plotting the distribution you got
plt.plot(prob_dist)
plt.hist(samples,density=True,bins=32)
plt.show()
You should be able to analytically compute the probability distribution and sample it and check that it matches the RBM \(p(v)\) distribution after the RBM is trained.
Some things you can check to debug your code:
Compute the value of the objective function at each iteration of the gradient descent. You should expect the objective function to decrease with each iteration.
Compute the gradients using the finite difference approximation \(\partial O/ W_{ij} \approx (O(W_{ij} + \delta) - O(W_{ij}))/\delta\) for small \(\delta\) and check that the finite difference gradients approximately match the estimated gradients obtained from Gibbs sampling when using a lot of samples.
For comparison, I used the following parameters for my RBM training:
\(k=1\) (this is a typical but non-trivial approximation)
\(\eta=0.1\) (note: this is the step for the average gradient over your mini-batch not the sum. If you’re summing, you need to divide by M)
\(M=64\)
I performed 30 epochs of a distribution of size 100000.
I used 3 visible and 5 hidden neurons.
Grading
Implement the RBM training algorithm. Using this algorithm, train your RBM and show that it can successfully learn a simple toy probability distribution.
Free Energy#
Recall that we are trying to optimize the KL-divergence: \( q \ln q - q \ln p\). The term \(q \ln q\) doesn’t change. Therefore, we want to optimize that second term. We can estimate how well we are doing by measuring the difference in free energy between the visible spins before sampling and after sampling.
Recall that the free energy is defined as
So that,
We can actually do this sum explicitly giving us (if you have spins that are -1 and 1)
or (if you have spins that are 1 and 0)
Grading
Write two functions that compute the free energy in both these ways. Do some tests to verify that they give the same result for some choice of \(v\).
Now, run your RBM for 10 epochs and measure the average change in the free energy as a function of epoch. Make a graph of this and paste it into your document (as a word of warning, this estimator is very noisy so it might not always look monotonic as it should be).
Applications of RBM#
In the next part of this assignment, you will choose an application to apply your RBM too. In most cases, this will involve making modifications to help improve the efficiency of your RBM. Choosing one of these applications will get you the extra 10 points you need for the assignment (90 up to here + 10 for the application). You can do additional applications for an extra 5 points per application.
MNIST 10 points
Protein Sequences (10 points extra credit)
Other interesting examples to do (but no instructions) include:
Ising models (This is sortof a meta-example. An RBM is an Ising model learning an ising model).
Quantum Circuits and Variational Wave-functions: We do some like this actively in my group. If you are interested, I’m happy to chat about it.