Unsupervised Example: Clustering and K-means

Course completion

The $K$-means algorithm is a clustering algorithm in which centroids $\mathbf{c}_{i}$ are simply the arithmetic mean of points in the cluster $\mathcal{C}_{i}$. Accordingly, the loss to minimize is the average distance of a point $\mathbf{x}$ to the nearest centroid:
$$L(\mathcal{D})=\sum_{i=1}^{K}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\left\Vert \mathbf{x}-\mathbf{c}_{i}\right\Vert .$$

The $K$-means algorithm only requires a number of clusters $K$ and random initial centroids as input. It then alternates between two steps:

  1. Place in cluster $\mathcal{C}_{i}$, the points nearest to the centroid $\mathbf{c}_{i}$.
  2. Set each centroid $\mathbf{c}_{i}$ to the arithmetic mean of the points in cluster $\mathcal{C}_{i}$.

Below we give the complete algorithm and show an example of convergence on a toy dataset.

Algorithm (K-means clustering)
Inputs: $K$, the number of clusters.
$\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$, initial set of centroids at step $0$.
Outputs: $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$, final set of centroids.
Variables: $C=\{\mathcal{C}_{1},\mathcal{C}_{2},\dots,\mathcal{C}_{K}\}$, the set of sets $\mathcal{C}_{i}$ containing the points closest to $\mathbf{c}_{i}$.
until satisfied:
    for $i$ from $1$ to $K$:
        $\mathcal{C}_{i}:=\{\mathbf{x}\in\mathcal{D}|\mathbf{c_{i}}$ is the centroid closest to $ \mathbf{x}\}$
    for $i$ from $1$ to $K$:
return $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$.

Figure 4: Convergence of the $K$-means algorithm on a toy $2$-dimensional clustering dataset with $K=3$. The dataset set is given in (a). The centroids (denoted by $\circ$) and the points closest to them (denoted by $\times$) are represented with the same color in steps 1 to 3.

Although we show an example of convergence, the $K$-means algorithm does not always converge to an appropriate solution as it can converge to a local minimum.

The $K$-means algorithm can be suited to extract features for classification, however probabilistic models can be much more powerful to capture complex structure in data.

Next: Supervised Example – Polynomial regression