# Evolution of Distribution Algorithms (EDAs)

Course completion
88%
$\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\sigm}{sigm} \DeclareMathOperator{\softmax}{softmax} \DeclareMathOperator{\sign}{sign}$

EDAs are mathematically principled evolutionary algorithm which do away with the biological analogy. As the name suggests EDAs are based on an unsupervised learning topic: density estimation, which will be reviewed thoroughly in the machine learning section. EDAs achieve state of the art performance in the black-box optimization setting where the goal is to optimize a function $f$ without any knowledge of how $f$ computes its values.

EDAs propose to represent individuals as samples from a probability distribution: the so called proposal distribution. A population is then a set of iid samples from this distribution. In the preceding section, we saw that the mutation and cross-over operators served to define small possible movements around a current population. The EDA approach has the advantage of transforming the problem of moving towards better populations in the input space (which may not be well behaved) to a proxy problem which is usually well behaved: moving towards better proposal distributions.

Formally the algorithm generates a new population by taking $\mu$ samples from a proposal distribution $P_{\theta^{t}}(\mathbf{x})$. These $\mu$ individuals are then ranked according to their $f$-values and the $\sigma$ best individuals are used to update the proposal distribution with a log-likelihood gradient ascent step, i.e.
$$P_{\theta^{t+1}}(\mathbf{x})=P_{\theta^{t}}(\mathbf{x})+\delta\hspace{-1pt}t\nabla\log P_{\theta^{t}}(\mathbf{x})$$
where $\delta\hspace{-1pt}t$ is the step size of the gradient ascent update. The algorithm can then run for a number of steps, until a satisfactory solution is found. The figure below gives an example of EDA update for a Gaussian proposal distribution.

Although the purpose of the algorithm is to maximize $\mathbb{E}\left[f(\mathbf{x})\right]$, it consists in the maximization of a surrogate objective: $\mathbb{E}\left[w(\mathbf{x})\right]$, where the weight function $w(\mathbf{x})$ is equal to $1$ for the $\sigma$ best individuals, and $0$ otherwise. This has the advantage of making the approach invariant w.r.t. monotone transformations of the objective function $f$.

The maximization of the surrogate objective $\mathbb{E}\left[w(\mathbf{x})\right]$ is done with gradient ascent:
$$\nabla\mathbb{E}\left[w(\mathbf{x})\right] = \nabla\int_{\dom f}w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$
$$= \int_{\dom f}w(\mathbf{x})\nabla P_{\theta^{t}}(\mathbf{x})$$
$$= \int_{\dom f}\nabla\log P_{\theta^{t}}(\mathbf{x})w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$
where taking $\sigma$ samples from $w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$ can be done by taking samples from $P_{\theta^{t}}(\mathbf{x})$ and keeping only the $\sigma$ best according to $f$.

This general framework can be adapted to optimize functions in $\mathbb{R}^{D}$ e.g. with CMA-ES [Hansen2008], or in discrete spaces such as $\{0,1\}^{D}$ with PBIL [Baluja1994]. It has the advantage of allowing a move towards several proposal solutions at once, contrary to methods such as hill climbing. The PBIL algorithm for optimization over $\{0,1\}^{D}$ is given below. Figure 5: The steps of an EDA given a fitness function (a) and a Gaussian proposal distribution at time $t$ (b). First, samples are drawn from the proposal distribution (c). The $\sigma$ best samples are then selected according to their $f$-values (d). Finally, the likelihood of the selected samples w.r.t. the parameters (e) can be increased with a step of log-likelihood gradient ascent leading to a new proposal distribution at time $t+1$ (f).

Algorithm (PBIL)
 Inputs: $f$, a function with $\dom f=\{0,1\}^{D}$. $N$, number of proposal samples. $N_{s}$, number of samples to select at each step. $\delta\hspace{-1pt}t$, the step-size. $m$, the probability of a mutation. $\alpha$, the mutation shift. Outputs: $\hat{\mathbf{x}}$, approximation of a global minimum. Variables: $(p_{t,1},p_{t,1},\dots,p_{t,D})$, the parameters of an independent Bernoulli proposal distribution at time $t$: $P_{\theta^{t}}(\mathbf{x})=p_{t,1}^{x_{1}}(1-p_{t,1})^{(1-x_{1})}\times\dots\times p_{t,D}^{x_{D}}(1-p_{t,D})^{(1-x_{D})}$ for $p_{t,i}$ the probability that $x_{i}=1$ at time $t$. $\mathbf{x}^{(1)},\dots,\mathbf{x}^{(N)}$, the proposal samples at time $t$, not to be confused with $x_{1},\dots,x_{D}$ the components of a vector $\mathbf{x}$.
begin
$p_{0,1},\dots,p_{0,D}:=\frac{1}{2},\dots,\frac{1}{2}$
until satisfied:
$\mathbf{x}^{(1)}\sim P_{\theta^{t}}(\mathbf{x}),\dots,\mathbf{x}^{(N)}\sim P_{\theta^{t}}(\mathbf{x})$
rank samples ensuring $\mathbf{x}^{(1)}\leq\dots\leq\mathbf{x}^{(N)}$
// update probability vector
for $i$ from $1$ to $N_{s}$:
for $j$ from $1$ to $D$:
$p_{t+1,i}:=p_{t,i}\times(1-\delta\hspace{-1pt}t)+x_{j}^{(i)}\times\delta\hspace{-1pt}t$
// mutate probability vector
for $j$ from $1$ to $D$:
if $\mathcal{U}([0;1]) < m$:
then $p_{t+1,i}:=p_{t+1,i}\times(1-\alpha)+\mathcal{U}(\{0,1\})\times\alpha$
return best solution $\hat{\mathbf{x}}:=\mathbf{x}^{(1)}$.
end

Next: Summary