In the case of polynomial regression, an interesting possibility is to consider $y=f(x)$ as a conditional probability $p(y|x)$. This density is then given by the polynomial $\sum_{i=0}^{k}a_{i}x^{i}$ to which we add a normally distributed error term $\mathcal{N}(0,\sigma^{2})$:

$$p(y|x) = \sum_{i=0}^{k}a_{i}x^{i}+\mathcal{N}(0,\sigma^{2})$$

$$ = \mathcal{N}(\sum_{i=0}^{k}a_{i}x^{i},\sigma^{2}).$$

The model can be seen as the probability of measuring some value $y$ given $x$. For each $x$, the measurement $y$ is distributed according to a normal law of mean $\sum_{i=0}^{k}a_{i}x^{i}$ and of variance $\sigma^{2}$.

In this example, the quantity to maximize can be computed analytically by replacing $p(y|x)$ by the definition of a normal law.

$$\log p(\mathcal{D}) = \sum_{i=1}^{N}\log p(y_{i}|x_{i})$$

$$ = \sum_{i=1}^{N}\log\left[\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{(y_{i}-\sum_{i=0}^{k}a_{i}x^{i})^{2}}{2\sigma^{2}})\right]$$

$$ = N\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right)-\sum_{i=1}^{N}\frac{(y_{i}-\sum_{i=0}^{k}a_{i}x^{i})^{2}}{2\sigma^{2}}.$$

Adding and multiplying by the appropriate constant terms (here we consider $\sigma^{2}$ to be a constant, the problem being on $\mu$) which do not change the maximization problem, and taking into account the minus sign which transforms the maximization problem into a minimization problem, the quantity to minimize is

$$\sum_{i=1}^{N}\left[y_{i}-\sum_{i=0}^{k}a_{i}x^{i}\right]^{2} =\sum_{i=1}^{N}\left[y_{i}-\hat{f}(x_{i})\right]{}^{2}$$

which is exactly the MSE seen in the Optimization to ML course. Therefore, minimizing the mean squared error when fitting polynomials consists in learning the above probabilistic model.