Let us now consider a simple classification problem for the dataset in Figure 1. The dataset has two classes and the input $\mathbf{x}$ is in two dimensions. We use $\times$ to represent points belonging to the first class and $\circ$ to represent points in the second class.

**Figure 2:**A classification dataset with two classes. The graph shows two possible separating hyperplanes.

The objective in a classification problem is to find a *separating surface* which places points from one class on one side of this surface, and points from the other class on the other side. Note that the above dataset is *linearly separable* meaning that it is possible to find a hyperplane (in this case a line) which separates the target classes. Not all datasets are linearly separable and Figure 2 below gives examples of separating surfaces for such datasets.

**Figure 3:**Three linearly non-separable classification datasets and a possible separating surface.

Because our dataset is linearly separable, we can then resort to a linear model to perform classification (When confronted with a new dataset, it is often a good idea to check linear separability with a linear model before trying to use more complex models),e.g.:

$$f(\mathbf{x})=\mathbf{w}^{T}\mathbf{x}+\mathbf{a}$$

where the parameters of the model are $\mathbf{a}$ and $\mathbf{w}$. Note that even though we are in a binary classification problem, $f(\mathbf{x})$ is in $\mathbb{R}$ and not in $\{0,1\}$. In practice, the classification decision is made by using $\sign(f(\mathbf{x}))$ instead of $f(\mathbf{x})$ itself. The real value can then be seen as a measure of confidence in the result. With the above model, it is common to optimize a proxy of the classification problem, i.e using the MSE which is continuously differentiable instead of the misclassification rate:

$$\hat{\mathbf{a}},\hat{\mathbf{w}}=\argmin_{\mathbf{a},\mathbf{w}}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\left[\mathbf{y}-(\mathbf{w}^{T}\mathbf{x}+\mathbf{a})\right]^{2}$$

The problem can then be solved with gradient descent (see the Optimization course).

Figure 1 gives examples of separating surfaces for such datasets.