Evolution of Distribution Algorithms (EDAs)

Course completion

EDAs are mathematically principled evolutionary algorithm which do away with the biological analogy. As the name suggests EDAs are based on an unsupervised learning topic: density estimation, which will be reviewed thoroughly in the machine learning section. EDAs achieve state of the art performance in the black-box optimization setting where the goal is to optimize a function $f$ without any knowledge of how $f$ computes its values.

EDAs propose to represent individuals as samples from a probability distribution: the so called proposal distribution. A population is then a set of iid samples from this distribution. In the preceding section, we saw that the mutation and cross-over operators served to define small possible movements around a current population. The EDA approach has the advantage of transforming the problem of moving towards better populations in the input space (which may not be well behaved) to a proxy problem which is usually well behaved: moving towards better proposal distributions.

Formally the algorithm generates a new population by taking $\mu$ samples from a proposal distribution $P_{\theta^{t}}(\mathbf{x})$. These $\mu$ individuals are then ranked according to their $f$-values and the $\sigma$ best individuals are used to update the proposal distribution with a log-likelihood gradient ascent step, i.e.
$$P_{\theta^{t+1}}(\mathbf{x})=P_{\theta^{t}}(\mathbf{x})+\delta\hspace{-1pt}t\nabla\log P_{\theta^{t}}(\mathbf{x})$$
where $\delta\hspace{-1pt}t$ is the step size of the gradient ascent update. The algorithm can then run for a number of steps, until a satisfactory solution is found. The figure below gives an example of EDA update for a Gaussian proposal distribution.

Although the purpose of the algorithm is to maximize $\mathbb{E}\left[f(\mathbf{x})\right]$, it consists in the maximization of a surrogate objective: $\mathbb{E}\left[w(\mathbf{x})\right]$, where the weight function $w(\mathbf{x})$ is equal to $1$ for the $\sigma$ best individuals, and $0$ otherwise. This has the advantage of making the approach invariant w.r.t. monotone transformations of the objective function $f$.

The maximization of the surrogate objective $\mathbb{E}\left[w(\mathbf{x})\right]$ is done with gradient ascent:
$$\nabla\mathbb{E}\left[w(\mathbf{x})\right] = \nabla\int_{\dom f}w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$
$$ = \int_{\dom f}w(\mathbf{x})\nabla P_{\theta^{t}}(\mathbf{x})$$
$$ = \int_{\dom f}\nabla\log P_{\theta^{t}}(\mathbf{x})w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$
where taking $\sigma$ samples from $w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$ can be done by taking samples from $P_{\theta^{t}}(\mathbf{x})$ and keeping only the $\sigma$ best according to $f$.

This general framework can be adapted to optimize functions in $\mathbb{R}^{D}$ e.g. with CMA-ES [Hansen2008], or in discrete spaces such as $\{0,1\}^{D}$ with PBIL [Baluja1994]. It has the advantage of allowing a move towards several proposal solutions at once, contrary to methods such as hill climbing. The PBIL algorithm for optimization over $\{0,1\}^{D}$ is given below.

Figure 5: The steps of an EDA given a fitness function (a) and a Gaussian proposal distribution at time $t$ (b). First, samples are drawn from the proposal distribution (c). The $\sigma$ best samples are then selected according to their $f$-values (d). Finally, the likelihood of the selected samples w.r.t. the parameters (e) can be increased with a step of log-likelihood gradient ascent leading to a new proposal distribution at time $t+1$ (f).

Algorithm (PBIL)
Inputs: $f$, a function with $\dom f=\{0,1\}^{D}$.
$N$, number of proposal samples.
$N_{s}$, number of samples to select at each step.
$\delta\hspace{-1pt}t$, the step-size.
$m$, the probability of a mutation.
$\alpha$, the mutation shift.
Outputs: $\hat{\mathbf{x}}$, approximation of a global minimum.
Variables: $(p_{t,1},p_{t,1},\dots,p_{t,D})$, the parameters of an independent Bernoulli proposal distribution at time $t$: $P_{\theta^{t}}(\mathbf{x})=p_{t,1}^{x_{1}}(1-p_{t,1})^{(1-x_{1})}\times\dots\times p_{t,D}^{x_{D}}(1-p_{t,D})^{(1-x_{D})}$ for $p_{t,i}$ the probability that $x_{i}=1$ at time $t$.
$\mathbf{x}^{(1)},\dots,\mathbf{x}^{(N)}$, the proposal samples at time $t$, not to be confused with $x_{1},\dots,x_{D}$ the components of a vector $\mathbf{x}$.
until satisfied:
    $\mathbf{x}^{(1)}\sim P_{\theta^{t}}(\mathbf{x}),\dots,\mathbf{x}^{(N)}\sim P_{\theta^{t}}(\mathbf{x})$
    rank samples ensuring $\mathbf{x}^{(1)}\leq\dots\leq\mathbf{x}^{(N)}$
    // update probability vector
    for $i$ from $1$ to $N_{s}$:
        for $j$ from $1$ to $D$:
    // mutate probability vector
    for $j$ from $1$ to $D$:
        if $\mathcal{U}([0;1]) < m$:
        then $p_{t+1,i}:=p_{t+1,i}\times(1-\alpha)+\mathcal{U}(\{0,1\})\times\alpha$
return best solution $\hat{\mathbf{x}}:=\mathbf{x}^{(1)}$.

Next: Summary