In supervised learning, the objective is to approximate an unknown function $f^{*}$ from a number of observations $(\mathbf{x},\mathbf{y})$ with the assumption that $\mathbf{y}=f^{*}(\mathbf{x})$. These observations are called *learning examples* and are usually compiled into a set $\mathcal{D}=\{(\mathbf{x}_{1},\mathbf{y}_{1}),\dots(\mathbf{x}_{N},\mathbf{y}_{N})\}$ called a *dataset*.

$$\hat{f}(\mathbf{x}_{i})\approx\mathbf{y}_{i}$$

is called a

*supervised learning problem*. The set $\mathcal{D}$ is referred to as the

*training dataset*or simply training set. The class of function $\mathcal{H}$ is called the

*hypothesis space*. The expected output $\mathbf{y}$ is often referred to as the

*label*or the

*target*.

Supervised learning problems arise in many settings but can often be reduced to either a classification problem when the label $\mathbf{y}$ is a natural number $\mathbf{y}\in\mathbb{N}$ or, a regression problem when $\mathbf{y}$ is a real number $\mathbf{y}\in\mathbb{R}^{D}$. We now define these two problems formally as optimization problems.

$$\hat{f}=\argmin_{f\in\mathcal{H}}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\ind\{\mathbf{y}\neq f(\mathbf{x})\}$$

where $\ind$ is the indicator function equal to $1$ when the argument evaluates to true and $0$ otherwise.

$$\hat{f}=\argmin_{f\in\mathcal{H}}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\left[\mathbf{y}-f(\mathbf{x})\right]^{2}$$

In unsupervised learning we consider a dataset $\mathcal{D}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}\}$ where there is no label. The goal can be to better understand the structure of the dataset $\mathcal{D}$ or, to learn new representations and improve performance in a supervised setting.

Clustering, dimensionality reduction, and density estimation are common unsupervised learning problems and are described below. Note that we do not give a complete and formal definition of these problems, but rather try to give examples of the corresponding optimization problems.

**Clustering**

The objective of a clustering algorithm is to group similar observations into clusters, such that points inside a cluster are similar to each other. Formally, a clustering algorithm returns a partition of observations into disjoint sets $\mathcal{C}_{1},\dots,\mathcal{C}_{K}$ called clusters. The model is often given by a set of points $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$ called exemplars or centroids such that each $\mathbf{c}_{i}$ is representative of the elements in the cluster $\mathcal{C}_{i}$. The objective is then e.g. to minimize the average distance from points in a cluster to their representing centroid, i.e. :

$$\{\hat{\mathbf{c}}_{1},\hat{\mathbf{c}}_{2},\dots,\hat{\mathbf{c}}_{K}\}=\argmin_{\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}}\sum_{i=1}^{K}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\left\Vert \mathbf{x}-\mathbf{c}_{i}\right\Vert .$$

**Dimensionality reduction**

The purpose of dimensionality reduction is to find a representation of lower dimensionality for the dataset $\mathcal{D}$. This can be done in several ways, for instance by assuming that the training samples are all on a sub-manifold of the input space or by trying to find a variable $\mathbf{y}$ with $\dim\mathbf{y}<\dim\mathbf{x}$ such that $\mathbf{x}$ can be reconstructed from $\mathbf{y}$, thus trying to solve the following optimization problem:
$$\{\hat{\mathbf{y}}_{1},\hat{\mathbf{y}}_{2},\dots,\hat{\mathbf{y}}_{N}\},\hat{f}=\argmin_{\{\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}\},f\in\mathcal{H}}\sum_{i=1}^{N}\left\Vert \mathbf{x}_{i}-f(\mathbf{y}_{i})\right\Vert $$
Note that the optimization problem is then on the function $f$ and the values $\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}$ which are not provided.
**Density estimation**

Given the training dataset $\mathcal{D}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}\}$, the goal of density estimation is to find a probability density which could have generated the dataset, often with the assumption that the training samples are iid. This is usually done with the maximization of the log-likelihood of $\mathcal{D}$ under a parametrized family of distributions $p_{\theta}(\mathbf{x})$ i.e. choosing $p_{\hat{\theta}}(\mathbf{x})$ such that

$$\hat{\theta}=\argmax_{\theta}\sum_{\mathbf{x}\in\mathcal{D}}\log p_{\theta}(\mathbf{x}).$$

Density estimation will be reviewed in more detail in a following section.