A probability can be seen as a number between $0$ and $1$ indicating a degree of belief. This is referred to as the Bayesian view, in contrast with the frequentist view in which a probability represents the average number of times an event occurs, in the limit of infinitely many experiments. In support of the Bayesian view, [Cox1946] argues that any system of beliefs consistent with common sense must satisfy the rules of probability which we state informally in the table below.

Basic notions | |
---|---|

$P(A)$ | represents the belief that some event $A$ will happen. |

$P(\bar{A})$ | represents the belief that $A$ will not happen. |

$P(A)=0$ | represents impossibility of $A$. |

$P(A)=1$ | represents certainty of $A$. |

$P(A\cap B)$ | represents the belief that $A$ and $B$ will both happen. |

$P(A\cup B)$ | represents the belief that either $A$ or $B$ will happen (possibly both). |

$P(A|B)$ | represents the belief that $A$ will happen given that $B$ has happened. |

And now a few rules | |

$P(\bar{A})=1-P(A)$ | Complementary event. |

$P(A\cap B)=P(A)P(B)$ if and only if $A$ and $B$ are independent. | Independent events |

$P(A\cup B)=P(A)+P(B)-P(A\cap B)$ | Union of events. |

$P(A\cap B)=P(A|B)P(B)$ | Conditional probability. |

Random variables | |

$P(X=x)$ | represents the probability that a random variable $X$ will take the value $x$. |

$P(x)$ | is a shorthand for $P(X=x)$ when there is no ambiguity on the random variable. |

$P(X=x,Y=y)$ | is called the joint probability of $X$ and $Y$. It can be thought of as P(X=x\cap Y=y). |

$P(x,y)$ | is a shorthand for $P(x=X,y=Y)$ when there is no ambiguity on the random variables. |

Rules on random variables | |

$P(x,y)=P(x)P(y)$ if and only if $X$ and $Y$ are independent. | Independent random variables. |

$P(x,y)=P(x|y)P(y)=P(y|x)P(x)$ | Product rule. |

$P(x)=\sum_{y}P(x,y)$ | Sum rule. |

$P(y|x)=\dfrac{P(x|y)P(y)}{P(x)}$ | Bayes’ rule. |

**Table 1:**Basic notions and rules of probability theory.

Note that the rules of frequentist and Bayesian probabilities are the same, but the Bayesian interpretation has the advantage of being much more widely applicable, and without the need for complicated arguments to justify why the question concerns a repeatable experiment.

Consider for instance the event $\text{A=}$”The world will end in 2012″. It is quite natural to consider the belief we have in the realization of this event as a probability. We can then reflect on $P(A)$ using the rules of probability to connect it to other events, maybe considering conditional probabilities such as $P(A|B)$ with $\text{B=}$”The doomsday clock will move closer to midnight in 2012″ (it did in fact move to 23:55). In the frequentist interpretation, it is difficult to see $A$ or $B$ as repeatable events and therefore to study them with a probabilistic perspective without an abstruse argument considering Quantum theory and alternalte universes (In which case it is even harder to assess what would happen in these alternate realities).

Importantly, the world did not end in 2012 and when the frequentist can only say that we observed one of two possible outcomes, the Bayesian can now assert with confidence $P(A)=0$. In the Bayesian framework, it is natural to update our beliefs when confronted with experimental evidence, i.e. to learn from data. We now present how to leverage this possibility in the context of ML.