Overview of MCMC Methods (I)

Why We Need MCMC Methods

Under the Bayesian problem setting, all too often, it is difficult to figure out the real posterior distribtuion. To understand this difficulty, we could firstly have a glimpse of the Bayesian equation:

It is always difficult to calculate the denominator, since it is always impossible to enumerate all of the possible $y$ and corresponding $x$ from my understanding. And that’s also why Geoffrey Hinton put forward Convergence Divergence to train Boltzmann Machine. Therefore we needs methods to help us approximate the posterior. Currently there are two methods, one is variational inference, the other is Monte Carlo Markov Chain sampling methods, which is the topic for this article.
Under ordinary inference problem setting, when we need to figure out certain parameters, we should calculate them in the following way:

In MCMC, after sampling adequate number of samples from the posterior distribution, we could approximate the original equation in the following way:

Sampling Methods

Simplest Sampling Methods: Inverse of CDF

1
2
u ~ U(0, 1)
x = CDF_inverse(u)

This sampling method is the one we see most in our undergraduate Probability class. There is a problem haunting it: all too often, it is difficult or even impossible to write out the function of CDF_inverse. This problem compels us to find a more feasible way to execute sampling.

Rejection Sampling

1
2
3
4
5
6
7
8
9
while not enough samples:
x_i ~ q(x)
u ~ U(0, 1)
if u < p(x_i)/(M * q(x_i)) then
accept x_i
else
reject x_i
end if
end while
reject_sampling.png

Rejection sampling does not require knowing the function of inverse of CDF. Instead, it use a sampling function q(x), which often has a comparatively simpler form. The influence of original posterior distribution p(x) is demonstrated in the accaption-rejection judgement. When the value of fraction p(x)/Mq(x) is big, it is easier to accept this samples. However, the major problem of this method is that the frequency of rejection is much higher than that of accpetance, especially when x is of high dimension. In this case, rejection sampling is not efficient, and its adaptation is needed.
There exists adaptive rejection sampling, where adaptation has been done with respect to the sampling function q(x). By making q(x) approximates p(x) as close as possible, the acception rate could be improved dramatically. The illustration for this adaptation is following:
adaptive_rejection.png
Note that in order to make sure that every tagent has value larger than p(x), p(x) or its transformation must be concave.

Importance Sampling

In importance sampling, we introduce q(x) which is comparatively easier to sample than p(x) as that in rejection sampling. Yet q(x) here has different usage. We could transform the original expectation equation into the following form:

$q(x)$ is a new distribution function, and $f(x)\frac{p(x)}{q(x)}$ is a new target function. If we could sample a series of $x^{(i)}$ from the distribution $q(x)$. Then the original integral could be approximated by the following sum:

There comes a question: why is this method named “Importance Sampling”?

That’s because the second part of the expression of $ E[f(x)] $, $p(x^{i})/q(x^{i})$ could be regarded as the weight of $ x^{i} $. Suppose for certain $ x^{a} $, $p(x^{a})$ is big yet the sampling function $q(x^{i})$ is small. It means that under original distribution $p(x)$, $ x^{a} $ is a point that has high probability to be visited. However, under current distribution $q(x)$, $ x^{a} $ has a comparatively lower visiting probability. In order to compensate this, we assign $ x^{a} $ with a higher weight, $p(x^{i})/q(x^{i})$. Although less likely to be sampled, once added to the sampling sets, this point will contribute greatly to the overall expectation of $f(x)$.

Metapolis-Hastings Algorithm

To understand MH sampling, first we should have a glimpse about detailed balance. From the definition of WiKiPedia, the principle of detailed balance is formulated for kinetic systems which are decomposed into elementary processes (collisions, or steps, or elementary reactions): at equilibrium, each elementary process should be equilibrated by its reverse process.

From mathematical perspective, we have

And we further could write it in the following form:

$ \pi_{t}(x^{*}) $ refers to the probability to be in the state $x^{*}$ or $p(y)$ in the former equation, and $K(x \to x^{*})$ denotes the probability to transfer from state $x$ to state $x^{*}$ or $p(y|x)$ in the former equation. We could prove that, if detailed balance is satisfied, the equation $ \pi_{t}(x^{*}) = \int_{x}\pi_{t-1}K(x \to x^{*})dx $ holds.

Then, we get

The following is the pseudo-code for MH sampling.

1
2
3
4
5
6
7
8
Initialize x_0
For i = 0 to N-1:
u ~ U(0, 1)
x* ~ q(x*|x_i)
if u < alpha(x*) = min(1, (pi(x*)q(x|x*))/(pi(x)q(x*|x)):
x_(i+1) = x*
else:
x_(i+1) = x_i

In order to prove that MH sampling satisfies detailed balance, we need to prove $ \pi(x)K(x^{*} \to x) = \pi(x^{*})K(x \to x^*) $