- From a Bayesian probabilistic perspective, it is natural to update our beliefs with data.
- It is sometimes necessary to use methods such as rejection sampling, importance sampling, the Metropolis-Hastings algorithm or Gibbs sampling to sample from complex distributions.
- Probabilistic models can be trained by minimizing the KL-divergence between the empirical data distribution and the model distribution.
- Equivalently, a probabilistic model can be trained by maximizing the log-likelihood of a dataset under the model.
- Bayes’ formula gives a method for choosing the best parameters given data: maximum-a-posteriori.
- The prior distribution gives probabilities to model parameters before having seen a dataset.
- When the prior is considered uniform, maximum-a-posteriori is equivalent to maximum-likelihood.
- Probabilistic models can have latent variables which can be understood as unobserved explanatory factors.
- Models with latent variables can be trained with the EM algorithm which alternates between computing the expected latent variables given the current maximum likelihood estimate, and maximizing the log-likelihood given affectations of the latent variables.
- Training Gaussian mixtures with EM can be seen as a probabilistic generalization of the K-means clustering algorithm.
- The log-likelihood gradient in the Euclidean metric is affected by parametrization.
- The natural gradient based on the Fisher metric is invariant by re-parametrization and can introduce further invariances during optimization.