# Supervised Example: Linear classification

Course completion
37%
$\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\sigm}{sigm} \DeclareMathOperator{\softmax}{softmax} \DeclareMathOperator{\sign}{sign}$

Let us now consider a simple classification problem for the dataset in Figure 1. The dataset has two classes and the input $\mathbf{x}$ is in two dimensions. We use $\times$ to represent points belonging to the first class and $\circ$ to represent points in the second class. Figure 2: A classification dataset with two classes. The graph shows two possible separating hyperplanes.

The objective in a classification problem is to find a separating surface which places points from one class on one side of this surface, and points from the other class on the other side. Note that the above dataset is linearly separable meaning that it is possible to find a hyperplane (in this case a line) which separates the target classes. Not all datasets are linearly separable and Figure 2 below gives examples of separating surfaces for such datasets. Figure 3: Three linearly non-separable classification datasets and a possible separating surface.

Because our dataset is linearly separable, we can then resort to a linear model to perform classification (When confronted with a new dataset, it is often a good idea to check linear separability with a linear model before trying to use more complex models),e.g.:
$$f(\mathbf{x})=\mathbf{w}^{T}\mathbf{x}+\mathbf{a}$$
where the parameters of the model are $\mathbf{a}$ and $\mathbf{w}$. Note that even though we are in a binary classification problem, $f(\mathbf{x})$ is in $\mathbb{R}$ and not in $\{0,1\}$. In practice, the classification decision is made by using $\sign(f(\mathbf{x}))$ instead of $f(\mathbf{x})$ itself. The real value can then be seen as a measure of confidence in the result. With the above model, it is common to optimize a proxy of the classification problem, i.e using the MSE which is continuously differentiable instead of the misclassification rate:
$$\hat{\mathbf{a}},\hat{\mathbf{w}}=\argmin_{\mathbf{a},\mathbf{w}}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\left[\mathbf{y}-(\mathbf{w}^{T}\mathbf{x}+\mathbf{a})\right]^{2}$$
The problem can then be solved with gradient descent (see the Optimization course).

Figure 1 gives examples of separating surfaces for such datasets.

Next: Unsupervised Example – Clustering and K-means