Restricted Boltzmann Machines
Contents
Restricted Boltzmann Machines#
In this section, we will learn how to build and train a restricted Boltzmann machine (RBM) to solve a simple engineering task.
The restricted Boltzmann machine is an example of a physics-inspired method that is useful for engineering problems. At a high-level, RBMs are probability distributions for binary data with many tunable parameters. When “training” an RBM, we tune its parameters to make the RBM’s probability distribution match the distribution of a training data set. Our goal in this unit is to make and train an RBM whose probability distribution matches the distribution of images in the MNIST handwritten digit data set.
Like a Hopfield network, a restricted Boltzmann machine is a particular type of Ising model. We saw that a Hopfield network can be thought of as a zero-temperature Ising model which equilibrates to a local energy minimum. Similarly, a restricted Boltzmann machine can be thought of as a finite-temperature Ising model with a particular connection structure between its spins. Because of their relation to statistical physics, RBMs are more interpretable than some other machine learning approaches. And because of their particular structure, they can be efficiently trained.
A restricted Boltzmann machine is a probability distribution over binary variables, which, like in the Hopfield network, can be interpreted as spins or neurons. In an RBM, the binary variables are separated into two types, “visible”
where
is the energy of the visible and hidden spins,
Our goal is to write an RBM code and make it “learn” a probability distribution from a training data set. Once trained on this data, the machine will be able to produce new samples from the data’s distribution (or its best estimate).
RBM as an Ising Model#
To summarize, an RBM is really just an Ising model.
Restricted Boltzmann Machine
Neuron
Ising Spin = Spin Up = Spin Down
The energy of the corresponding Ising model is
When we sample an RBM, we obtain the spin configuration
There are
Understanding Probability Distributions#
There are a number of probability distributions that we need to be familiar with to understand RBMs. These probabilities characterize the frequency with which we observe configurations of visible and hidden spins when sampling an RBM.
The first and most important is the joint probability
This probability distribution contains all of the information about how variables
Next are the marginal probabilities
Likewise, the marginal distribution for the hidden spins is
Finally, we discuss the conditional probabilities
Likewise,
Notice how closely related all of these probability distributions are. For example, the marginal and conditional probabilities are all directly related to the joint distribution
In general, joint, marginal, and conditional probability distributions are related by a simple formula. Suppose that we have a system split into subsystem
You should check why this formula is true from the definitions of conditional and marginal probabilities.
This relation forms the basis of Bayes’ theorem,
which holds whenever the joint distribution is symmetric,
Why are marginal and conditional probabilities important? Why did we not consider them in the Ising model unit?
The variables in an RBM are by design split into two groups, visible and hidden variables. When training an RBM, the visible variables are associated with data, such as the pixels in images, while the hidden variables are simply extra variables in the model, sometimes called auxiliary or latent variables. Because of the special interpretation of the visible variables, it is natural to consider the probability of just the visible spins, i.e., the marginal distribution
An important technical detail about RBMs is that, because of the bipartite structure of the RBM, there is an efficient way to sample spin configurations using conditional probabilities
In the Ising model unit, there was only one type of spin and there was no reason to split the spins into different subsystems. So we never needed to consider probability distributions that related different subsystems.
Building and Sampling an RBM#
Your first goal is to set up the basic code structure for your RBM. You will want to store the states of your visible and hidden variables, as well as the values of your parameters, the weight matrix and the biases. Note that the weight matrix is generally rectangular, as there may be different numbers of visible
Next, we want to learn how to sample configurations from our RBM, given the weights and biases, which we will need later in our training. Our goal is to produce spin configurations
The efficient Monte Carlo method that we will use to sample the joint distribution
Given the current visible layer
, probabilistically choose a configuration for the hidden layer .Given the chosen
, probabilistically choose a configuration for .Repeat steps 1 and 2
-times.
This process is a Markov chain, and therefore has a stationary distribution. To produce the desired distribution, we must probabilistically choose new configurations of the hidden or visible layers in such a way that the stationary distribution is proportional to
Given
, choose a new with probability .Given
, choose a new with probability .
Step 1. Because of the bipartite structure of the RBM, the conditional probability has a particularly nice, factorable form. For the hidden spins conditioned on the visible spins, it is
where
It is helpful to check this yourself, by using the definition of
Because
Step 2. The exact same logic applies. The conditional probability
where
To sample
One iteration of step 1 generates samples from
Your goal is to implement Gibbs sampling. Since it is essential for the training algorithm, you need to make sure that it is working correctly by testing your generated samples against analytic results. We can do this check for a small RBM.
Testing
To check your Gibbs sampling, follow the steps below.
Setup an RBM with
hidden neurons and visible neurons. Choose random initial states for the visible and hidden neurons. Choose random weights and biases.For the given
, sample the hidden layer spins 100,000 times and store a histogram of the resulting spin configurations. To do this I did something like (in C++):
map<string,double> probs;
//loop 100,000 times
string myString=rbm.toString(rbm.hidden);
if (probs.find(myString)==probs.end())
probs.insert(make_pair(myString,0));
probs[myString]+=1;
norm+=1;
Compare this histogram with the analytical result for
obtained by brute-force calculation. My largest error was about 0.003.For a given
, sample the visible layer spins 100,000 times and store a histogram of the results.Compare this with the analytical result for
.
Starting from a random initial spin configuration
, perform iterations of Gibbs sampling and obtain the final spin configuration . Repeat this 100,000 times and store the configurations into a histogram. Also make separate histograms of and .Compare the
-histogram with the analytical result for .Compare the
histogram with the analytic result for .Compare the
histogram with the analytic result for
Run this a couple different times to make sure everything agrees.
Comment on implementation: When writing up your RBM, it is possible to perform all of the calculations using for
loops. However, if you are willing, it is worth trying to perform the calculations with matrix-vector multiplications. This can greatly simplify your code, since nested for loops can be replaced with a few lines of matrix-vector manipulations, and can speed up your code significantly (especially in python), since matrix-vector libraries are highly optimized. While the C++
standard library does not include matrix-vector functionality, there is a (relatively) easy-to-use library called Eigen that allows you to work with matrices in C++
. A handy reference for Eigen commands, and their equivalent commands in Matlab, is available here.
Note that you can represent
Grading
Write Gibbs sampling code that samples the hidden and visible layers of your RBM.
For a small
Unsupervised Learning and Training an RBM#
The task that restricted Boltzmann machines are typically used for is unsupervised learning. The goal of unsupervised learning is to train a machine to learn the probability distribution underlying a data set. To accomplish this goal, the machine is trained on unlabeled data, such as raw images without any captions or identifying information. By training on a large enough data set of high quality samples, the machine will hopefully be able to learn the correlations in the data set. Ultimately, we would like the machine to be able to generalize from the data it has seen. A well-trained unsupervised learning model should be able to generate new data samples that have the same features as the original training data. For this reason, such models are often called generative models.
Your next big step is to train your RBM to learn a probability distribution, called the data distribution, from a data set of samples of the distribution, called the training data. During training, the RBM’s parameters, the weights
In general it might be hard to exactly reproduce the data distribution, so instead we aim to get a probability distribution that is “close.” Let us define a measure of the “close-ness” of two distributions
In the information theory literature,
We will minimize
Gradient descent is performed by repeatedly updating the parameters as follows:
where
Computing Gradients#
To perform gradient descent, we need to evaluate the gradients of the objective function
where
More accurate derivatives
You can improve this derivative slightly by instead of evaluating terms like
(We could have alternatively come to this conclusion by thinking of the derivatives as being made up of terms like
You are free to make this change in your code (or not). You can do this by adding a SampleHiddenProb
which returns the probabilities of the hidden spins and not a sample from them. Just be careful to only use this for the final sample of the hidden spins and not all of them!
Putting it together (and mini-batches)#
Training the RBM with gradient descent amounts to approximately computing these expectation values by performing Gibbs sampling and updating the RBM parameters using the approximate gradients.
A pseudocode of the gradient descent calculation is as follows:
Repeat many times:
Initialize the gradient variables:
dW=0; da=0; db=0
.Sample
configurations (called a mini-batch)For each configuration:
Sample
from Update the gradient variables.dW -= outerProduct(v,h)
da -= v
db -= h
Perform
iterations of Gibbs sampling: sample the visible and hidden variables back and forth . Using the final spin configuration , update the gradient variables:dW += outerProduct(v,h)
da += v
db += h
Change the parameters by
dW/M; da/M; db/M
Beyond contrastive-divergence
Currently we are sampling back and forth in the Gibbs sampling visible_step4
) that has nothing to do with the actual data. We simply leave visible_step4
wherever it was in the previous step. This is somewhat more principled and is supposed to give better results but takes a lot longer to train (and you have to turn down the learning-rate). This approach goes under the name persistent chains. I don’t recommend using this, but if you want to explore it, I’d like to know where you found it helped and what you had to do to make it work.
Comments about mini-batches:
Part of the description requires us to choose
configurations from the visible data. Usually what people do if they have a dataset is to shuffle the dataset and then choosedata[0:M]
on the first mini-batch; choosedata[M:2M]
on the second mini-batch; etc. After they have looped over all the data they then reshuffle.The pseudo-code above essentially suggests a for-loop over the
configurations of the mini-batch. In practice, it is more efficient (especially in python) if you can compute the entire mini-batch with matrices (and no for-loops). This will be very important (in python) if you’re going to go on and do a MNIST or protein example below.
Testing#
Testing
To convince yourself that you have a working RBM code, you need to perform some tests.
A good way to test a complicated method such as the RBM training algorithm is to run the algorithm on a small
example that you can check by hand or with another method. For example, you can define an arbitrary probability
distribution of three binary variables and train an RBM with
Here’s one way to go about producing an arbitrary probability distribution (in C++):
std::random_device rd;
std::mt19937 mt;
vector<int> ranInts;
std::uniform_int_distribution<> ran_int(0,100);
for (//loop over binary numbers in visible layers)
ranInts.push_back(ran_int(mt));
}
// At this point ranInts of [2,1,4,2] means there is a 2/8 chance of 00, a 1/8 chance of 01, a 4/8 chance of 10, and a 2/8 chance of 11
std::discrete_distribution<> d(ranInts.begin(),ranInts.end());
myVisibleLayer=binary_nums[d(mt)]; //assuming binary_nums is a vector with binary numbers in it
// this is the probability distribution you are training over.
and in python
prob_dist=np.random.ranf(32)
prob_dist=prob_dist/np.sum(prob_dist)
samples=np.random.choice(range(0,32),p=prob_dist,size=100000)
### Below just for plotting the distribution you got
plt.plot(prob_dist)
plt.hist(samples,density=True,bins=32)
plt.show()
You should be able to analytically compute the probability distribution and sample it and check that it matches the RBM
Some things you can check to debug your code:
Compute the value of the objective function at each iteration of the gradient descent. You should expect the objective function to decrease with each iteration.
Compute the gradients using the finite difference approximation
for small and check that the finite difference gradients approximately match the estimated gradients obtained from Gibbs sampling when using a lot of samples.
For comparison, I used the following parameters for my RBM training:
(this is a typical but non-trivial approximation) (note: this is the step for the average gradient over your mini-batch not the sum. If you’re summing, you need to divide by M)I performed 30 epochs of a distribution of size 100000.
I used 3 visible and 5 hidden neurons.
Grading
Implement the RBM training algorithm. Using this algorithm, train your RBM and show that it can successfully learn a simple toy probability distribution.
Free Energy#
Recall that we are trying to optimize the KL-divergence:
Recall that the free energy is defined as
So that,
We can actually do this sum explicitly giving us (if you have spins that are -1 and 1)
or (if you have spins that are 1 and 0)
Grading
Write two functions that compute the free energy in both these ways. Do some tests to verify that they give the same result for some choice of
Now, run your RBM for 10 epochs and measure the average change in the free energy as a function of epoch. Make a graph of this and paste it into your document (as a word of warning, this estimator is very noisy so it might not always look monotonic as it should be).
Applications of RBM#
In the next part of this assignment, you will choose an application to apply your RBM too. In most cases, this will involve making modifications to help improve the efficiency of your RBM. Choosing one of these applications will get you the extra 10 points you need for the assignment (90 up to here + 10 for the application). You can do additional applications for an extra 5 points per application.
MNIST 10 points
Protein Sequences (10 points extra credit)
Other interesting examples to do (but no instructions) include:
Ising models (This is sortof a meta-example. An RBM is an Ising model learning an ising model).
Quantum Circuits and Variational Wave-functions: We do some like this actively in my group. If you are interested, I’m happy to chat about it.