The $K$-means algorithm is a clustering algorithm in which centroids $\mathbf{c}_{i}$ are simply the arithmetic mean of points in the cluster $\mathcal{C}_{i}$. Accordingly, the loss to minimize is the average distance of a point $\mathbf{x}$ to the nearest centroid:

$$L(\mathcal{D})=\sum_{i=1}^{K}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\left\Vert \mathbf{x}-\mathbf{c}_{i}\right\Vert .$$

The $K$-means algorithm only requires a number of clusters $K$ and random initial centroids as input. It then alternates between two steps:

- Place in cluster $\mathcal{C}_{i}$, the points nearest to the centroid $\mathbf{c}_{i}$.
- Set each centroid $\mathbf{c}_{i}$ to the arithmetic mean of the points in cluster $\mathcal{C}_{i}$.

Below we give the complete algorithm and show an example of convergence on a toy dataset.

Inputs: |
$K$, the number of clusters. $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$, initial set of centroids at step $0$. |

Outputs: |
$\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$, final set of centroids. |

Variables: |
$C=\{\mathcal{C}_{1},\mathcal{C}_{2},\dots,\mathcal{C}_{K}\}$, the set of sets $\mathcal{C}_{i}$ containing the points closest to $\mathbf{c}_{i}$. |

until satisfied:

for $i$ from $1$ to $K$:

$\mathcal{C}_{i}:=\{\mathbf{x}\in\mathcal{D}|\mathbf{c_{i}}$ is the centroid closest to $ \mathbf{x}\}$

for $i$ from $1$ to $K$:

$\mathbf{c}_{i}=\frac{1}{\left|\mathcal{C}_{i}\right|}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\mathbf{x}$

return $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$.

end

**Figure 4:**Convergence of the $K$-means algorithm on a toy $2$-dimensional clustering dataset with $K=3$. The dataset set is given in (a). The centroids (denoted by $\circ$) and the points closest to them (denoted by $\times$) are represented with the same color in steps 1 to 3.

Although we show an example of convergence, the $K$-means algorithm does not always converge to an appropriate solution as it can converge to a local minimum.

The $K$-means algorithm can be suited to extract features for classification, however probabilistic models can be much more powerful to capture complex structure in data.