Choosing a prior

Course completion

In the limit of infinitely many observations and for a prior which is non-zero everywhere, the posterior distribution tends to the likelihood itself. Conversely, when the dataset is empty, the posterior distribution is equal to the prior. In other words, maximum a-posteriori depends strongly on the prior when there are few observations and is close to maximum likelihood if there are many.

Let us consider the problem of estimating the probability of a coin toss resulting in “heads”. The problem can be seen as a problem of estimating the parameter $\theta$ of a Bernoulli distribution:
$$P(\mathbf{x}=1) = \theta$$
$$P(\mathbf{x}=0) = 1-\theta,$$
where $\mathbf{x}$ is a random Bernoulli variable such that $1$ correspond to heads and $0$ to tails. In this example, a reasonable prior could be a distribution peaked around $\theta=0.5$ decreasing towards $1$ and towards $0$, making explicit our belief that most coins have almost equal probability of coming heads or tails, and that it would probably be very difficult to find a coin which always falls on the same side (we assume of course that the coin in question does not have a face on both sides, as is often the case when the problem occurs in practice). Figure 3 gives three examples of priors which may be suited to this problem.

Figure 3: Three possible choices of prior for a Bernoulli distribution. $\mathcal{B}eta(10,10)$ (left), $\mathcal{B}eta(2,2)$ (middle), $\mathcal{B}eta(1,1)$ or equivalently uniform distribution (right).

The Beta distribution in Figure 3 has a special relation to the Bernoulli distribution, namely it is a conjugate prior of the Bernoulli distribution. Conjugate priors have the interesting property of ensuring that the posterior is in the same family of distributions as the prior. If the prior is given by a Beta distribution and the likelihood is a Bernoulli distribution (as in our example) then the posterior is also a Beta distribution.

It is important to realize that the choice of a prior is a subjective one by definition. If the practitioner does not want to make this choice, or if all values of $\theta$ are in-differentiable, it is common practice to choose the uniform distribution which does not depend on the parameter $\theta$ and assigns equal probability to all possible values.

However, the uniform distribution is not a non-informative prior because it carries information about the structure of the parameter space. Namely, if we have $\theta\in\mathbb{R}$, a uniform prior represents the belief that there is as much probability density in the interval $]0,1[$ than in any other interval $]z,z+1[$, when any interval contains in fact as many real numbers as $\mathbb{R}$ itself.

Next: Example – Maximum likelihood for the Gaussian