Learning with a probabilistic approach often reduces to the problem of density estimation: trying to find a probability distribution $p(\mathbf{x})$ which is likely to have generated the dataset $\mathcal{D}$.

The search for a suitable $p(\mathbf{x})$ is usually limited to a specific family of probability distributions such as the Gaussian family $\mathcal{N}(\mu,\sigma^{2})$ of mean $\mu$ and variance $\sigma^{2}$. In practice, the parameters can be regrouped in a set of parameters $\theta$; for the Gaussian distribution, we have for instance $\theta=\{\mu,\sigma^{2}\}$. The problem is then to find the likely parameters $\theta$ given that $p_{\theta}(\mathbf{x})$ should have generated the dataset $\mathcal{D}$.

One can easily understand that not all parameter values are equally likely. For instance, if the dataset consists of points $x$ between $100$ and $101$, the standard normal distribution $\mathcal{N}(0,1)$ centered on $0$ is a very unlikely candidate.

Although the above notations may seem to be specific to unsupervised learning, density estimation also applies to supervised learning. The goal for a dataset $\mathcal{D}=\{(\mathbf{x}_{1},\mathbf{y}_{1}),\dots,(\mathbf{x}_{N},\mathbf{y}_{N})\}$ is then to find a conditional distribution $p(\mathbf{y}|\mathbf{x})$ which is likely to have generated each $\mathbf{y}_{i}$ given $\mathbf{x}_{i}$.

We now look into several approaches which can be used to estimate distributions.

## KL-divergence and likelihood

In the previous chapter, we considered several loss functions, each adapted to a particular problem. In the context of density estimation, we can use the KL-divergence, which is given by:

$$d_{\text{KL}}(p,q)=-\sum_{\mathbf{x}}\log\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right)p(\mathbf{x})$$

where the sum runs over all possible values of $\mathbf{x}$.

The KL-divergence can be used as a measure of difference between distributions but it is not symmetric (i.e. in the general case $d_{\text{KL}}(p,q)\neq d_{\text{KL}}(q,p)$) and does not respect the triangular inequality, therefore it is not a distance.

If we consider the empirical data distribution $p_{\mathcal{D}}$ defined by the training dataset $\mathcal{D}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}\}$, where each sample has a probability $p_{\mathcal{D}}(\mathbf{x}_{i})=\frac{1}{N}$, we can then try to fit a model $p_{\theta}$ to this data distribution by minimizing the KL-divergence, i.e. solving

$$\theta^{*}=\argmin_{\theta}d_{\text{KL}}(p_{\mathcal{D}},p_{\theta})$$

Note that the KL-divergence can be rewritten as $$d_{\text{KL}}(p_{\mathcal{D}},p_{\theta})=\sum_{\mathbf{x}}\log p_{\mathcal{D}}(\mathbf{x})p_{\mathcal{D}}(\mathbf{x})-\underbrace{\sum_{\mathbf{x}}\log p_{\theta}(\mathbf{x})p_{\mathcal{D}}(\mathbf{x})}_{\text{log-likelihood}}$$

where the first term does not depend on $\theta$ and the second term is referred to as the log-likelihood, a concept which will be reviewed thoroughly in the following sections. From the above equation, it follows that minimizing the KL-divergence is equivalent to maximizing the log-likelihood, i.e. :

$$\theta^{*}=\argmax_{\theta}\sum_{\mathbf{x}}\log p_{\theta}(\mathbf{x})p_{\mathcal{D}}(\mathbf{x})$$

or equivalently, using the definition of $p_{\mathcal{D}}$:

$$\theta^{*}=\argmax_{\theta}\sum_{\mathbf{x}\in\mathcal{D}}\log p_{\theta}(\mathbf{x})$$

The problem of density estimation can therefore be solved by minimization of the KL-divergence or equivalently, with the maximization of the log-likelihood.

## Bayes’ rule

In the previous section, we tried to find the best parameter $\theta$ to minimize the KL-divergence. The parametrization of the distribution by $\theta$ is noted $p_{\theta}(\mathbf{x})$ in the optimization perspective, however with a Bayesian perspective, $\theta$ is seen as a random variable and the model then corresponds to the probability of $\mathbf{x}$ given $\theta$, i.e. $p(\mathbf{x}|\theta)$. Bayes’ rule can then be used to find the probability of parameter values $\theta$ given a dataset $\mathcal{D}$:

$$\underbrace{p(\theta|\mathcal{D})}_{\text{posterior}}=\frac{\overbrace{p(\mathcal{D}|\theta)}^{\text{likelihood}}\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{\sum_{\theta}p(\mathcal{D}|\theta)p(\theta)}_{\text{evidence}}}.$$

## Likelihood

$p(\mathcal{D}|\theta)$ is the likelihood of the dataset $\mathcal{D}$ under the model. It consists in the probability of the dataset $\mathcal{D}$ under a specific model parametrized by $\theta$, i.e. the belief that the model parametrized by $\theta$ could have generated $\mathcal{D}$. If we assume that points in the dataset are iid, $p(\mathcal{D}|\theta)$ is equal to the product of the point-wise probabilities, i.e. $p(\mathcal{D}|\theta)=\prod_{\mathbf{x}\in\mathcal{D}}p(\mathbf{x}|\theta)$. With the notation of $\theta$ as a random variable, the likelihood of a single sample $p(\mathbf{x}|\theta)$ is in fact the model’s probability distribution $p_{\theta}(\mathbf{x})$, e.g. for a $D$-dimensional Gaussian family, we would have

$$p(\mathbf{x}|\theta)=\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})=\frac{1}{(2\pi)^{D/2}\left|\boldsymbol{\Sigma}\right|^{1/2}}\exp\left\{ -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{T}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\} $$

where $$\boldsymbol{\mu}$$ is the means parameter, and $\boldsymbol{\Sigma}$ is the $D\times D$ covariance matrix, and $\left|\boldsymbol{\Sigma}\right|$ is the determinant of $\boldsymbol{\Sigma}$.

## Posterior

$p(\theta|\mathcal{D})$ corresponds to the belief that $\theta$ is a likely parameter value of the distribution $p(x)$, given the dataset $\mathcal{D}$. When we are only interested in the best possible parameter value, maximizing the posterior leads to the most probable value of the parameter \theta given the dataset $\mathcal{D}$. However, the posterior is a probability distribution and therefore gives a probability to all possible values of $\theta$. This is especially useful to assess the variance of an estimation.

## Evidence

$\sum_{\theta}p(\mathcal{D}|\theta)p(\theta)$ is of little practical importance and can simply be seen as a normalization constant to ensure that the probabilities sum up to $1$.

## Prior

$p(\theta)$ corresponds to the a-priori probability of $\theta$, that is, the belief we have that $\theta$ is a reasonable parameter, before having seen the dataset. This can seem a bit paradoxical which is why we will return to this question shortly.