Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed “data”) as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as “unobserved variables”. Variational Bayesian methods are primarily used for two purposes:

1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
2. To derive a lower bound for the marginal likelihood (sometimes called the “evidence”) of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.)

# Evidence Lower Bound

We have a random variable $X$ and a latent variable $Z$.

Note that the main idea behind variational methods is to pick a family of distributions over latent variables $Z$ with its own variational parameters. The distribution of Z, $q(Z)$ is supposed to by comparatively simple. And we have:

\begin{aligned}
lnP(X) &= lnP(X,Z) - lnP(Z|X) \\
&= ln(\frac{P(X,Z)}{q(Z)}) - ln(\frac{P(Z|X)}{q(Z)}) \\
&= lnP(X,Z) - lnq(Z) - ln(\frac{P(Z|X)}{q(Z)})
\end{aligned}\tag{1}

Calculate the expectation with regard to distribution $q(Z)$ on both sides of the (1), we have:

\begin{aligned}
\int_{Z}lnP(X)q(Z)dZ =& \int_{Z}lnP(X,Z)q(Z)dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ \\
&- \int_{Z}ln(\frac{P(Z|X)}{q(Z)})q(Z)dZ
\end{aligned}\tag{2}

After simplification, we have:

\begin{aligned}
lnP(X) =& \int_{Z}lnP(X,Z)q(Z)dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ - \int_{Z}ln(\frac{P(Z|X)}{q(Z)})q(Z)dZ
\end{aligned}\tag{3}

For the right part of (3), the former part (4) is the evidence lower bound, a.k.a ELBO or variational lower bound.

\mathcal{L}(q) = \int_{Z}lnP(X,Z)q(Z)dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ\tag{4}

And the left part (5) is Kullback-Leibler Divergence between $q(Z)$ and $P(Z|X)$.

KL(q(Z), P(Z|X)) = \int_{Z}ln\frac{q(Z)}{P(Z|X)}\cdot{q(Z)}dZ\tag{5}

Here, $\mathcal{L}(q)$ is a function of function. Recall that the left part of (3) is a constant with no $q$, hence, the maximal value (upper bound) of $\mathcal{L}(q)$ is $lnP(X)$, since $KL(q(Z), P(Z|X))\geq{0}$. This upper bound could only be reached when $KL(q(Z), P(Z|X)) = 0$, which also means $q(Z)$ and $P(Z|X)$ are identically distributed.

# Problem Statement

Then our problem is presented in the following way:

• The posterior links the data and a model. It is used in all downstream analyses, such as for the predictive distribution;
• $P(Z|X)$ is a posterior too complex to figure out but we are interested;
• We need to find a distribution $q(Z)$ which has a more concise mathematical form;
• Our target is equivalent to minimize the KL divergence between $P(Z|X)$ and $q(Z)$;
• According to (3), $lnP(X) = \mathcal{L}(q) + KL(q(Z),P(Z|X))$ and $P(X)$ is a constant, minimizing $KL(q(Z),P(Z|X))$ is equivalent to maximizing the ELBO $\mathcal{L}(q)$.

Here the dimensions of $Z = \{z_1, z_2, …, z_n\}$ in $P(Z|X)$ are dependent of each other under most cases. However, all too often, it will dramatically simplify our problem (like multivariable calculus) when these elements are independent. Therefore, although we have

P(Z|X) \neq P(z_1|X)\cdot{P(z_2|X)…\cdot{P(z_M|X)}}\tag{6}

We could suppose that

# Solution

Under the assumption that the variables $z_1, z_2, …, z_M$ are independent, substitute the equation(7) into the definition of ELBO(4).

\begin{aligned}
\mathcal{L}(q) &= \int_{Z}lnP(X,Z)\cdot{q(Z)}dZ - \int_{Z}lnq(Z)\cdot{q(Z)}dZ \\
&= \int_{Z}\prod_{i=1}^{M}q_{i}(z_i)ln(P(X,Z))dZ - \int_{Z}\prod_{i=1}^{M}q_{i}(z_i)\sum_{i=1}^{M}ln(q_i(z_i))dZ
\end{aligned}\tag{8}

Then we could separate the $\mathcal{L}$ (8) into two parts, for the first part is:

\begin{aligned}
(Part 1) &= \int \prod_{i=1}^{M}q_{i}(z_i)ln(P(X,Z))dZ \\
&= \int_{z_1}\int_{z_2}\cdots\int_{z_M}\prod_{i=1}^{M}q_i(z_i)ln(P(X,Z))dZ
\end{aligned}\tag{9}

This expression is quite complex, which means we need some mathematical tricks to simplify it. By rearranging the expression by taking a particular $q_{j}(z_j)$ out of the integral:

\begin{aligned}
(Part1) = \int_{z_j}q_{j}(z_j)(\int \dots \int_{z_{i\neq j}}\prod_{i\neq j}^{M}q_{i}(z_i)ln(P(X, Z)\prod_{i\neq j}^{M}dz_i)dz_j
\end{aligned}\tag{10}

\begin{aligned}
(Part1) = \int_{z_j}q_{j}(z_j)(\int \dots \int_{z_{i\neq j}}\prod_{i\neq j}^{M}ln(P(X, Z)\prod_{i\neq j}^{M}q_{i}(z_i)dz_i)dz_j
\end{aligned}\tag{11}

And to make it more meaningfully, it can be put into an expectation function, and since $\prod_{i\neq j}^{M}q_{i}(z_i)$ is a joint probability density

\begin{aligned}
(Part1) = \int_{z_j}q_{j}(z_j)[\mathbb{E}[ln(P(X, Z))]]dz_j
\end{aligned}\tag{12}

For $Part 2$, firstly recall a simplification strategy (it is quite a common strategy which is also used in EM algorithm):

\begin{aligned}
& \int_{x_1}\int_{x_2}[f(x_1) + f(x_2)]P(x_1, x_2)dx_2dx_1 \\
&= \int_{x_1}\int_{x_2}f(x_1)P(x_1, x_2)dx_2dx_1 + \int_{x_1}\int_{x_2}f(x_2)P(x_1, x_2)dx_2dx_1 \\
&= \int_{x_1}f(x_1)\int_{x_2}P(x_1, x_2)dx_2dx_1 + \int_{x_2}f(x_2)\int_{x_1}P(x_1, x_2)dx_1dx_2 \\
&= \int_{x_1}f(x_1)P(x_1)dx_1 + \int_{x_2}f(x_2)P(x_2)dx_2 \\
&= \sum_{i = 1,2} \mathbb{E}[f(x_i)]
\end{aligned}

Therefore, we can rewrite $Part 2$ as following

\begin{aligned}
(Part2) &= \int_{Z}\prod_{i=1}^{M}q_{i}(z_i)\sum_{i=1}^{M}ln(q_i(z_i))dZ \\
&= \sum_{i=1}^{M}(\int_{z_i}q(z_i)ln(q(z_i))dz_i)
\end{aligned}\tag{13}

If we are only interested in $z_j$, the equation $(13)$ could be rewritten as

\begin{aligned}
(Part2) = \int_{z_j}q(z_j)ln(q(z_j))dz_j + const
\end{aligned}\tag{14}

Combine the result of equation $(12)$ and $(14)$ and focus only on $q_j$, the ELBO is

\begin{aligned}
\mathcal{L}(q) &= Part1 - Part2 \\
&= \int_{z_j}q_{j}(z_j)[\mathbb{E}[ln(P(X, Z))]]dz_j - \int_{z_j}q(z_j)ln(q(z_j))dz_j + const
\end{aligned}\tag{15}

We can define $\mathbb{E}[ln(P(X, Z))]$ as $ln(\tilde{P}_j(X, z_j))$, since all other $z_i$s are integrated out. The result why it is $\tilde{P}_j(z_j)$ instead of $P_j(z_j)$ is that, the real $P_j(z_j)$ is

P_j(z_j) = \int_{z_1}\int_{z_2}\dots\int_{z_M / z_j}P(z_1, \dots, z_M)dz_1\dots dz_M/dz_j

However here, the $\tilde{P}_j(z_j)$ is a pseudo probability of $z_j$, which is used exclusively in Variational Inference:

\tilde{P}_j(z_j) = exp \int_{z_1}\int_{z_2}\dots\int_{z_M / z_j}ln(P(z_1, \dots, z_M))dz_1\dots dz_M/dz_j

After defining this pseudo probability, the ELBO could be expressed as

\mathcal{L}(q) = \int_{z_j}q_j(z_j)ln[\frac{\tilde{P}_j(z_j)}{q_j(z_j)}] + const
\tag{16}
This is the same as $-\mathbb{KL}(\mathbb{E}_{i\neq j}[ln(P(X, Z))]||q_i(z_i))$. In other words, if we need to maximize the ELBO, we need to minimize this KL divergence. Next section will introduce how to minimize the KL divergence between $\mathbb{E}_{i\neq j}[ln(P(X, Z))]$ and $q_i(z_i)$.

# Reference

1. Machine Learning course of Yida Xu, https://www.youtube.com/watch?v=arMoli91OZE&t=1s