Variational Inference Basics (I)

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed “data”) as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as “unobserved variables”. Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood (sometimes called the “evidence”) of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.)

Evidence Lower Bound

We have a random variable $X$ and a latent variable $Z$.

Note that the main idea behind variational methods is to pick a family of distributions over latent variables $Z$ with its own variational parameters. The distribution of Z, $q(Z)$ is supposed to by comparatively simple. And we have:
\begin{equation}
\begin{aligned}
lnP(X) &= lnP(X,Z) - lnP(Z|X) \\
&= ln(\frac{P(X,Z)}{q(Z)}) - ln(\frac{P(Z|X)}{q(Z)}) \\
&= lnP(X,Z) - lnq(Z) - ln(\frac{P(Z|X)}{q(Z)})
\end{aligned}\tag{1}
\end{equation}
Calculate the expectation with regard to distribution $q(Z)$ on both sides of the (1), we have:
\begin{equation}
\begin{aligned}
\int_{Z}lnP(X)q(Z)dZ =& \int_{Z}lnP(X,Z)q(Z)dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ \\
&- \int_{Z}ln(\frac{P(Z|X)}{q(Z)})q(Z)dZ
\end{aligned}\tag{2}
\end{equation}
After simplification, we have:
\begin{equation}
\begin{aligned}
lnP(X) =& \int_{Z}lnP(X,Z)q(Z)dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ - \int_{Z}ln(\frac{P(Z|X)}{q(Z)})q(Z)dZ
\end{aligned}\tag{3}
\end{equation}
For the right part of (3), the former part (4) is the evidence lower bound, a.k.a ELBO or variational lower bound.
\begin{equation}
\mathcal{L}(q) = \int_{Z}lnP(X,Z)q(Z)dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ\tag{4}
\end{equation}
And the left part (5) is Kullback-Leibler Divergence between $q(Z)$ and $P(Z|X)$.
\begin{equation}
KL(q(Z), P(Z|X)) = \int_{Z}ln\frac{q(Z)}{P(Z|X)}\cdot{q(Z)}dZ\tag{5}
\end{equation}
Here, $\mathcal{L}(q)$ is a function of function. Recall that the left part of (3) is a constant with no $q$, hence, the maximal value (upper bound) of $\mathcal{L}(q)$ is $lnP(X)$, since $KL(q(Z), P(Z|X))\geq{0}$. This upper bound could only be reached when $KL(q(Z), P(Z|X)) = 0$, which also means $q(Z)$ and $P(Z|X)$ are identically distributed.

Problem Statement

Then our problem is presented in the following way:

  • The posterior links the data and a model. It is used in all downstream analyses, such as for the predictive distribution;
  • $P(Z|X)$ is a posterior too complex to figure out but we are interested;
  • We need to find a distribution $q(Z)$ which has a more concise mathematical form;
  • Our target is equivalent to minimize the KL divergence between $P(Z|X)$ and $q(Z)$;
  • According to (3), $lnP(X) = \mathcal{L}(q) + KL(q(Z),P(Z|X))$ and $P(X)$ is a constant, minimizing $KL(q(Z),P(Z|X))$ is equivalent to maximizing the ELBO $\mathcal{L}(q)$.

Here the dimensions of $Z = \{z_1, z_2, …, z_n\}$ in $P(Z|X)$ are dependent of each other under most cases. However, all too often, it will dramatically simplify our problem (like multivariable calculus) when these elements are independent. Therefore, although we have
\begin{equation}
P(Z|X) \neq P(z_1|X)\cdot{P(z_2|X)…\cdot{P(z_M|X)}}\tag{6}
\end{equation}
We could suppose that

Solution

Under the assumption that the variables $z_1, z_2, …, z_M$ are independent, substitute the equation(7) into the definition of ELBO(4).
\begin{equation}
\begin{aligned}
\mathcal{L}(q) &= \int_{Z}lnP(X,Z)\cdot{q(Z)}dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ \\
&= \int_{Z}\prod_{i=1}^{M}q_{i}(z_i)ln(P(X,Z))dZ - \int_{Z}\prod_{i=1}^{M}q_{i}(z_i)\sum_{i=1}^{M}ln(q_i(z_i))dZ
\end{aligned}\tag{8}
\end{equation}

Then we could separate the $\mathcal{L}$ (8) into two parts, for the first part is:
\begin{equation}
\begin{aligned}
(Part 1) &= \int \prod_{i=1}^{M}q_{i}(z_i)ln(P(X,Z))dZ \\
&= \int_{z_1}\int_{z_2}\cdots\int_{z_M}\prod_{i=1}^{M}q_i(z_i)ln(P(X,Z))dZ
\end{aligned}\tag{9}
\end{equation}
This expression is quite complex.

Reference

  1. Machine Learning course of Yida Xu, https://www.youtube.com/watch?v=arMoli91OZE&t=1s