We have seen how a learning algorithm can be posed as an optimization one. However, learning from data benefits greatly from a probabilistic perspective. Namely, Bayesian probability theory gives a sound mathematical framework for updating models based on data observations.

In this chapter, we start by giving a quick review of basic notions in probability theory and then present how to estimate distributions in the Bayesian framework. This leads us to review the concepts of maximum likelihood and maximum a posteriori. We give an example in the case of the Gaussian family and give the example of polynomial regression a probabilistic perspective. This theoretical framework then allows us to introduce the possibility of learning representations with probabilistic models. We describe the Expectation Maximization (EM) algorithm and how it can be applied to Gaussian mixtures. Finally, we revisit optimization by considering the specific problem of maximizing the likelihood of a probabilistic model and show how a suitable metric (the Fisher metric) can lead to the natural gradient which improves the ordinary gradient descent procedure by making it invariant w.r.t. parametrization.

## Lessons:

- Notions in probability theory
- Sampling from complex distributions
- Density estimation
- Maximum a-posteriori and maximum likelihood
- Choosing a prior
- Example: Maximum likelihood for the Gaussian
- Example: Probabilistic polynomial regression
- Latent variables and Expectation Maximization
- Example: Gaussian mixtures and EM
- Optimization revisited in the context of maximum likelihood
- Summary