Learning with probabilities often involves complex distributions in high dimensional spaces which makes them difficult to approach analytically. Nevertheless, by taking samples, it becomes possible to estimate almost any quantity of interest empirically [Robert2005, Neal1993]. Most notably, these quantities of interest are often expressed as expectations:

$$\mathbb{E}_{x\sim p(x)}\left[h(x)\right]=\int_{\mathcal{X}}h(x)p(x)dx$$

where $x\in\mathcal{X}$, and $x\sim p(x)$ represents the fact that $x$ is a sample from the distribution $p$. Using iid samples $x_{1},x_{2},\dots,x_{N}$, the above expectation can be estimated with the average of $h(x)$:

$$\hat{\mathbb{E}}_{x\sim p(x)}\left[h(x)\right]=\frac{1}{N}\sum_{i=1}^{N}h(x_{i})$$

However, taking the required samples in the distribution $p$ can be difficult. Even when the density $p(x)$ is tractable, there is often no easy way to take samples from $p$ (in many cases, the practitioner must deal with the additional difficulty of only knowing $p(x)$ up to a constant). It is then interesting to temporarily circumvent the problem. A first approach is to take samples from a different distribution $q$ using a selection or weighting scheme to ensure that the resulting samples are distributed according to the target distribution $p$, the principle behind rejection sampling and importance sampling. A second possibility is to use a Markov Chain Monte Carlo (MCMC) method such as Metropolis-Hastings or Gibbs sampling which relies on a Markov chain to produce samples.

Rejection sampling, importance sampling, the Metropolis-Hastings algorithm and Gibbs sampling are presented below.

## Rejection sampling

Let us suppose that the distribution $p$ is bounded by $Mq$ for $M$ a constant and $q$ some distribution from which samples can be taken easily. We can then generate samples $(x,u)$ from the joint distribution $\mathcal{U}(\mathcal{X}\times[0;Mq(x)])$: taking a sample $x$ from $q$, and then taking a sample $u$ in $[0;Mq(x)]$.

These samples correspond to uniformly distributed points below the graph of $Mq(x)$. The distribution $p$ can then be recovered by accepting only samples such that $u < p(x)$. This process is explained in Figure 1 and given in Algorithm 1.

**Figure 1:**In rejection sampling, taking a sample $x$ from $q$ and a sample $u$ from $\mathcal{U}[0;Mq(x)]$, results in a uniform distribution of points $(x,u)$ below the graph of $Mq(x)$. Samples from $p$ can then be obtained by accepting only the samples such that $u < p(x)$.

Inputs: |
$p(x)$, the target distribution. $q(x)$, an instrumental distribution easy to sample from. $M$, a constant such that $\forall x\in\mathcal{X},p(x)\leq Mq(x)$. |

Outputs: |
$x$, a sample from the distribution p. |

Variables: |
$u$, an auxiliary random variable. |

until a sample is accepted

$x\sim q(x)$

$u\sim\mathcal{U}[0;Mq(x)]$

if $u < p(x)$ accept the sample $x$

otherwise reject the sample

end

If the envelope $Mq(x)$ is too far from $p(x)$, a large number of samples from $q(x)$ will be needed before one is accepted. Thus the efficiency of rejection sampling depends on whether the acceptance ratio $\frac{\text{number of accepted samples}}{\text{total}}$ can be kept sufficiently high.

Furthermore, rejection sampling does not behave well in high dimensional spaces. Remember that in high dimension, a hypercube of side $0.99$ only contains a small fraction of the volume of a larger cube of side $1$. Similarly, in high dimensional spaces, the density $p(x)$ tends to represent only a small fraction of the envelope $Mq(x)$ leading to very low acceptance rates.

## Importance sampling

It is possible to sample from any distribution without rejecting samples, using a sampling distribution $q$ and using the importance sampling identity, i.e.

$$\int_{\mathcal{X}}h(x)p(x)dx=\int_{\mathcal{X}}h(x)\frac{p(x)}{q(x)}q(x)dx$$

Accordingly, an estimate for the above expectation can be obtained by taking iid samples $x_{1},x_{2},\dots,x_{N}$ from $q$ and computing:

$$\sum_{i=1}^{N}h(x)\frac{p(x)}{q(x)}$$

where $\frac{p(x)}{q(x)}$ are the importance sampling weights.

Importance sampling has the advantage of not rejecting any sample, therefore all samples are used as part of the estimation. However, if the sampling distribution $q$ does not fit $p$ closely enough, important regions of $p$ can be completely ignored and lead to an infinite variance of the estimator.

Although importance sampling can be very efficient, finding a suitable distribution $q$ which is both easy to sample from and close enough to the target distribution $f$ is sometimes impossible which is why more complex sampling schemes are sometimes needed.

## Metropolis-Hastings algorithm

Let us consider Markov chains which admit a unique stationary distribution $\pi$. By designing a Markov chain such that $\pi(x)=p(x)$, we can start at a random position $\mathbf{x}^{(0)}$ and run the Markov chain until it converges to its stationary distribution to produce a sample from $p(x)$. f

To ensure that $\pi(x)=p(x)$, it is sufficient that the transition probabilities $T(x\rightarrow x’)$ from one state to the next satisfy the condition of detailed balance, i.e. :

$$\forall x,\forall x’,\, p(x)T(x\rightarrow x’)=p(x’)T(x’\rightarrow x)$$

where $T(x\rightarrow x’)$ is the transition probability of the Markov chain from state $x$ to state $x’$. A Markov chain which satisfies the above property and is ergodic can be proved to converge to the distribution $p(x)$ as time goes to infinity. The Metropolis-Hastings algorithm which is based on this principle is given in Algorithm 2.

Inputs: |
$p(x)$, the target distribution. $T(\mathbf{x}\rightarrow\mathbf{x}’)$, the proposal probability of jumping from state $\mathbf{x}$ to state $\mathbf{x}’$. $\mathbf{x}^{(0)}$, the initial state of the Markov chain. $B$, a burn-in period. $K$, a number of iterations between collected samples. |

Outputs: |
$\mathcal{S}=\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}$, a set of samples from the distribution $p$. |

Variables: |
$\mathbf{x}^{(t)}$, the state of the Markov chain at time t. |

$\mathcal{S}:={}$

for $t=0$ to $B+N\times K-1$:

while $\mathbf{x}’$ is not accepted:

$\mathbf{x}’\sim T(\mathbf{x}^{(t)}\rightarrow\mathbf{x}’)$

$a\sim\mathcal{U}[0;1]$

if $a < \dfrac{p(\mathbf{x}’)T(\mathbf{x}’\rightarrow\mathbf{x})}{p(\mathbf{x})T(\mathbf{x}\rightarrow\mathbf{x}’)}$: accept $\mathbf{x}’$

else: reject $\mathbf{x}’$

$\mathbf{x}^{(t+1)}:=\mathbf{x}’$

if $t\geq B$ and $\left((t-B)\mod K\right)=0$:

$\mathcal{S}:=\mathcal{S}\cup\{\mathbf{x}^{(t+1)}\}$

return $\mathcal{S}$

end

In practice, the convergence to $p$ as time goes to infinity means that a Markov chain must run for many iterations before a sample can reasonably be assumed to be distributed according to the target distribution. Note that, even if $x^{(t+1)},x^{(t+2)},\dots,x^{(t+N)}$ are theoretically better samples in terms of convergence, nearby samples are not independent. This means that it is necessary to wait a number of steps between each sample to ensure that the Markov chain has enough time to produce independent samples.

Despite the fact that only a fraction of the algorithm’s iterations actually result in a valid sample, the somewhat high computational cost is counterbalanced by the fact that the Metropolis-Hastings algorithm does not suffer from the curse of dimensionality. This makes the Metropolis-Hastings algorithm and its variants especially useful when confronted with complex distributions in high dimension.

## Gibbs sampling

When trying to sample from a joint distribution $p(\mathbf{x})=p(x_{1},x_{2},\dots,x_{D})$ it is sometimes easy to sample from the conditional distributions $p(x_{i}|x_{1},x_{2},\dots x_{i-1},x_{i+1},\dots,x_{D})$. The Gibbs sampling algorithm is a variant of the Metropolis-Hastings algorithm which proposes to repeatedly sample each variable given the others. This results in a Markov chain which converges to the joint distribution $p(x_{1},x_{2},\dots,x_{D})$.

Gibbs sampling is often a very efficient sampling algorithm when the conditional distributions are available.

A visualization of the Gibbs sampling algorithm is shown in Figure 2, the full algorithm is given in Algorithm 3.

**Figure 2:**Visualization of the Gibbs sampling algorithm for a joint distribution $p(x,y)$. The algorithm starts at a random position $(x^{(0)},y^{(0)})$ and then alternatively samples according to $p(y|x)$ and $p(x|y)$.

Inputs: |
$p(x_{i}|x_{1},x_{2},\dots x_{i-1},x_{i+1},\dots,x_{D})$, the conditional probabilities of the target distribution $p(\mathbf{x})$. $\mathbf{x}^{(0)}=x_{1}^{(0)},x_{2}^{(0)},\dots,x_{D}^{(0)}$, the initial state of the Markov chain. $K$, a number of iterations before taking a sample. |

Outputs: |
$\mathbf{x}$, a sample from the distribution $p$. |

Variables: |
$\mathbf{x}^{(t)}=x_{1}^{(t)},x_{2}^{(t)},\dots,x_{D}^{(t)}$, the state of the Markov chain at time $t$. |

for $t=1$ to $K$:

for $i=1$ to $D$:

$x_{i}^{(t)}=p(x_{i}|x_{1}^{(i)},x_{2}^{(i)},\dots x_{i-1}^{(i)},x_{i+1}^{(i-1)},\dots,x_{D}^{(i-1)})$

$\mathbf{x}:=x_{1}^{(K)},x_{2}^{(K)},\dots,x_{D}^{(K)}$

return $\mathbf{x}$

end