The probabilistic models seen so far can give an interesting interpretation of a dataset, but we have not yet discussed how to leverage these algorithms to find new representations. This can be done in the context of estimating distributions with the help of latent variables [Ghahramani2004]. Latent variables are sometimes called hidden variables or unobserved variables by opposition to the observed variables $\mathbf{x}=x_{1},\dots,x_{D}$.

A simple way to introduce latent variables in a probability distribution is to use a joint distribution over both observed and hidden variables which can then be marginalized over the hidden variables, as in

$$p(\mathbf{x})=\sum_{\mathbf{h}}p(\mathbf{x},\mathbf{h}),$$

where the sum over $\mathbf{h}$ is to be understood as a sum over all possible values of $\mathbf{h}$. As expected, the resulting distribution is on the observed variable $\mathbf{x}$ alone.

When the model is trained, the latent variables can be seen as representing explanatory variables which best help to model the distribution on the observed variables.

A problem arises when trying to learn the joint distribution $p(\mathbf{x},\mathbf{h})$ because by definition, the dataset $\mathcal{D}$ only contains samples from the observed variable $\mathbf{x}$, and not samples $\mathbf{x},\mathbf{h}$ as would be required to learn a joint distribution $p(\mathbf{x}|\mathbf{h})$. As a result, it is often impossible to find a closed-form solution to the maximum likelihood of models with latent variables.

Nonetheless, latent variable models can usually be trained with a variant of the Expectation Maximization (EM) algorithm [Dempster1977, Borman2004] which alternates between two steps (in fact, we present here a point-estimate variant of EM called classification EM. See [Gupta2011] for a complete description of EM):

- (expectation) Compute the inference distribution $p(\mathbf{h}|\mathbf{x})$; construct samples $\mathbf{x},\mathbf{h}$ with $\mathbf{h}$ the most probable value of $\mathbf{h}$ according to $p(\mathbf{h}|\mathbf{x})$.
- (maximization) maximize the likelihood of $p(\mathbf{x},\mathbf{h})$ given the samples $\mathbf{x},\mathbf{h}$ obtained previously.

Although it is not easy to derive this algorithm, it can be shown to converge to a local minimum [Dempster1977, Wu1983].

Note than EM is not the only way to train models with latent variables and it is sometimes preferable to use an other optimization algorithm such as gradient descent.