Please upload one PDF file to Gradescope (Entry code: ME62YG).
For this homework you will implement code (on PrairieLearn) for a pseudo-random number generator (RNG) and assess the quality of a stream of random numbers from a particular RNG. You will also study the technique of importance sampling. Random numbers and importance sampling are essential ingredients of Monte Carlo simulation techniques and in the next homework, you will modify your MD code to do Monte Carlo simulations.
On PrairieLearn, you are going to write (and test) a RNG that uses the linear congruent generator (LCG) algorithm. An LCG generates a random number \(X_{n+1}\) by taking the current random number \(X_n\) and performing the following arithmetic, \[X_{n+1}=(aX_n+c) \;\mod\; m,\] where \(m\) is the upper limit of possible values of \(X_n\), \(a<m\), and \(c<m\); \(\mod\) denotes the modulus. This algorithm requires choosing a number \(X_0\) to start the stream, called the seed.
The quality of the LCG algorithm is very sensitive to the choice of these parameters. For instance, your period can never be longer than \(m\).
If you have a stream of uniformly distributed random numbers, it is possible to apply a transformation to get a stream of random numbers that obeys some other distribution.
In simulations of physical systems at thermodynamic equilibrium, it is useful to have random numbers drawn from a Gaussian distribution. There are some standard methods to accomplish this, and you should use textbooks and the internet to learn about these algorithms.
After implementing a Gaussian RNG, you will verify that it works by plotting a histogram. The correct normalization of the histogram is not obvious. If you have \(N\) random numbers, and your histogram contains \(N_\text{B}\) bins over a range \(\pm L\), you should multiply by the normalization factor \(\frac{N_\text{B}}{2L N}\).
To test a RNG that claims to produce uniformly distributed random numbers on the interval [0,1), we can test the uniformity of the distribution by creating a histogram of the stream of random numbers. We create a series of evenly-spaced bins between [0,1) and we generate a stream of random numbers, binning each value appropriately. If the RNG stream is indeed uniformly distributed, we should observe the same number of random numbers in each bin.
Of course, if we actually try to do this empirically, we will find that not every bin gets the same number of counts. Instead, there will be fluctuations that cause some bins to get more counts than others. On the other hand, a RNG stream that is not truly uniform will also lead to a histogram with unequal counts in each bin. We want to distinguish between non-uniformity that occurs because of statistical fluctuations (i.e., having only a finite sample size) and non-uniformity due to a broken random number generator. We use the following metric to distinguish between these two cases: \[\chi^2 = \sum_i (N_i - n_i)^2/n_i\] where \(N_i\) is the number of counts in bin \(i\) and \(n_i\) is the expected number of counts in bin \(i\) (the total number of points divided by the number of bins \(N_\text{B}\) for our case of the uniform distribution).
If \(n_i > 5\) is true for all \(i\), the \(\chi^2\) statistic approaches the eponymous chi-squared distribution with \(N_\text{B}-1\) degrees of freedom. The statement of the goodness-of-fit can be simplified to this: If this \(\chi^2\) value is greater than some critical value, then with some quantifiable confidence this indicates that the generator of this stream is not what it is claimed to be (i.e., a uniform distribution of \(x\in [0,1)\)). We can try solving for these critical values numerically by integrating the probability distribution function for the chi-squared distribution, or for degrees of freedom up to 100 just use the tabulated values provided by NIST. For this homework, let us choose the 90% confidence level for the hypothesis that the generator is not broken (random). Critical values of \(\chi^2\) should be taken from the column labeled “0.10”.
One can also bin numbers in more than 1 dimension. For example, for 2 dimensions one could take pairs of numbers and bin them onto a grid. This is useful because it will expose correlations between successive numbers in the series. This is also a serious problem, because we generally use a RNG stream with the assumption that successive numbers are independent of one another.
You will write functions that compute the \(\chi^2\) metric also for histograms in two and three dimensions. Note that the \(\chi^2\) test becomes increasingly stringent in higher dimensions, and RNGs judged suitable for research purposes are generally required to pass the \(\chi^2\) test in 10 dimensions. You may find yourself working with degrees of freedom greater than 100 (off the NIST table mentioned above), in which case you will have to calculate the values yourself by numerical integration or use the critical value calculator found at Chi Square Calculator.
For 90% confidence level, you should set the “probability” parameter to 0.9. (Note: Do not set it to 0.10 as you did above, where it represented the upper tail integral.).
You will work to compute the integral \(I = \int_{-\infty}^{\infty} dx \, \frac{\exp ( -x^2/2 )}{1+x^2}\) which cannot be done analytically.
First, you will estimate the value of the continuous integral by performing a summation of the function at a discrete set of grid points: \(I = \int_{-\infty}^\infty dx \, f(x) \approx \frac{1}{\Omega} \sum_{i=1}^N f(x_i) \;,\) where \(\Omega\) is the appropriate normalization involving \(N\) and \(L\).
Second, you will compute this integral using Monte Carlo with importance sampling. Given an integral \(\int_{-\infty}^{\infty} f(x) dx\), previously we will have to choose a bound \(L\), and discretize the x-axis with \(N\) evenly spaced bins. There are usually two problems with this approach. Firstly, a good \(L\) or \(N\) is not always known beforehand. Secondly, if the function \(f(x)\) is known to have sharp peaks somewhere, naturally one would need more bins around these areas, but it would be expensive to increase the density of bins everywhere.
The idea of importance sampling is that if we want to evaluate \(I = \int_{-\infty}^{\infty} dx \, f(x)\) we can rewrite the integral as \(I = \int_{-\infty}^{\infty} dx \, p(x) \frac{f(x)}{p(x)} = \int_{-\infty}^{\infty} dx \, p(x) g(x)\) where \(p(x)\) is a probability distribution (i.e., positive for all x and normalized such that \(\int_{-\infty}^{\infty} dx \, p(x) = 1\)) and \(g(x) = \frac{f(x)}{p(x)}\) is the estimator.
Using statistics, one recognizes that \[\int_{-\infty}^{\infty} \frac{f(x)}{p(x)} p(x) dx = E\left[\frac{f(x)}{p(x)}\right],\] where \(E\left[\frac{f(x)}{p(x)}\right]\) is the expectation value of \(\frac{f(x)}{p(x)}\) under the probability distribution depicted by \(p(x)\). To actually calculate the expectation value, one uses the following estimate \[E\left[\frac{f(x)}{p(x)}\right] \approx \frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{p(x_i)},\] where \(N\) random numbers \(\{x_i\}\) are sampled according to the probability distribution depicted by \(p(x)\).
Essentially, one turns a integration problem into a sampling problem. It is called importance sampling, because the variance is minimum when one chooses a \(p(x)\) such that \(p(x) \propto f(x)\), in the sense that we sample more points where \(f(x)\) is the largest in magnitude and sample less points where \(f(x)\) is small in magnitude. In reality, this is not practical, the next best thing one can do is to match the shapes and the locations of the peaks of \(p(x)\) and \(f(x)\) as closely as possible.
For our integrand \(f(x)=\frac{\exp \left(-\frac{x^2}{2} \right)}{1+x^2}\), we will choose \(p(x)=K \exp\left(-\frac{x^2}{2\alpha}\right) \;,\) where \(K\) is chosen so that \(p(x)\) is a valid probability distribution (i.e., normalized correctly).
The utility of doing this is that we know how to sample a Gaussian distribution using a Gaussian random number generator.
Thus, for Monte Carlo integration with importance sampling, we will sample random configurations distributed according to the function \(p(x)\), and we will evaluate the integral \(I\) by computing the expectation value of the estimator \(g(x)\): \(I = \langle g(x) \rangle_{p(x)} \;\).