A learning algorithm consists in an optimization algorithm applied to the parameters of a model in order to minimize a specific loss. Nonetheless, it is often the case that the learning algorithm itself depends on some parameters being set, as e.g. the complexity of the model (the degree $K$ in the previous example) or the learning rate of a gradient descent procedure.

Such parameters which are outside of the main optimization procedure are called *hyper-parameters*. Accordingly, the problem of choosing suitable values for the hyper-parameters is called *hyper-parameter selection* or *model selection* and consists in an optimization problem in which learning models is considered as a sub-problem.

Although we need to optimize the hyper-parameters w.r.t the performance on some dataset, we cannot choose model complexity according to the training set because it would lead to poor generalization. In our example, choosing the best degree $K$ according to the training set would inevitably lead to choosing higher degrees for the polynomial, even though they do not make for a better fit on the testing set.

However, the testing set is meant to be used for evaluating the performance of a model on unseen data and cannot therefore be used during hyper-parameter selection. If it were, the optimization process would choose values particularly suited to maximize performance on the test set and thus artificially increase the test set performance.

To solve this problem, the solution usually retained is to use a third dataset called a *validation dataset* to optimize the hyper-parameters. The testing error can then be used safely to evaluate performance.