EDAs are mathematically principled evolutionary algorithm which do away with the biological analogy. As the name suggests EDAs are based on an unsupervised learning topic: *density estimation*, which will be reviewed thoroughly in the machine learning section. EDAs achieve state of the art performance in the black-box optimization setting where the goal is to optimize a function $f$ without any knowledge of how $f$ computes its values.

EDAs propose to represent individuals as samples from a probability distribution: the so called proposal distribution. A population is then a set of iid samples from this distribution. In the preceding section, we saw that the mutation and cross-over operators served to define small possible movements around a current population. The EDA approach has the advantage of transforming the problem of moving towards better populations in the input space (which may not be well behaved) to a proxy problem which is usually well behaved: moving towards better proposal distributions.

Formally the algorithm generates a new population by taking $\mu$ samples from a proposal distribution $P_{\theta^{t}}(\mathbf{x})$. These $\mu$ individuals are then ranked according to their $f$-values and the $\sigma$ best individuals are used to update the proposal distribution with a log-likelihood gradient ascent step, i.e.

$$P_{\theta^{t+1}}(\mathbf{x})=P_{\theta^{t}}(\mathbf{x})+\delta\hspace{-1pt}t\nabla\log P_{\theta^{t}}(\mathbf{x})$$

where $\delta\hspace{-1pt}t$ is the step size of the gradient ascent update. The algorithm can then run for a number of steps, until a satisfactory solution is found. The figure below gives an example of EDA update for a Gaussian proposal distribution.

Although the purpose of the algorithm is to maximize $\mathbb{E}\left[f(\mathbf{x})\right]$, it consists in the maximization of a surrogate objective: $\mathbb{E}\left[w(\mathbf{x})\right]$, where the weight function $w(\mathbf{x})$ is equal to $1$ for the $\sigma$ best individuals, and $0$ otherwise. This has the advantage of making the approach invariant w.r.t. monotone transformations of the objective function $f$.

The maximization of the surrogate objective $\mathbb{E}\left[w(\mathbf{x})\right]$ is done with gradient ascent:

$$\nabla\mathbb{E}\left[w(\mathbf{x})\right] = \nabla\int_{\dom f}w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$

$$ = \int_{\dom f}w(\mathbf{x})\nabla P_{\theta^{t}}(\mathbf{x})$$

$$ = \int_{\dom f}\nabla\log P_{\theta^{t}}(\mathbf{x})w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$

where taking $\sigma$ samples from $w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$ can be done by taking samples from $P_{\theta^{t}}(\mathbf{x})$ and keeping only the $\sigma$ best according to $f$.

This general framework can be adapted to optimize functions in $\mathbb{R}^{D}$ e.g. with CMA-ES [Hansen2008], or in discrete spaces such as $\{0,1\}^{D}$ with PBIL [Baluja1994]. It has the advantage of allowing a move towards several proposal solutions at once, contrary to methods such as hill climbing. The PBIL algorithm for optimization over $\{0,1\}^{D}$ is given below.

**Figure 5:**The steps of an EDA given a fitness function (a) and a Gaussian proposal distribution at time $t$ (b). First, samples are drawn from the proposal distribution (c). The $\sigma$ best samples are then selected according to their $f$-values (d). Finally, the likelihood of the selected samples w.r.t. the parameters (e) can be increased with a step of log-likelihood gradient ascent leading to a new proposal distribution at time $t+1$ (f).

Inputs: |
$f$, a function with $\dom f=\{0,1\}^{D}$. $N$, number of proposal samples. $N_{s}$, number of samples to select at each step. $\delta\hspace{-1pt}t$, the step-size. $m$, the probability of a mutation. $\alpha$, the mutation shift. |

Outputs: |
$\hat{\mathbf{x}}$, approximation of a global minimum. |

Variables: |
$(p_{t,1},p_{t,1},\dots,p_{t,D})$, the parameters of an independent Bernoulli proposal distribution at time $t$: $P_{\theta^{t}}(\mathbf{x})=p_{t,1}^{x_{1}}(1-p_{t,1})^{(1-x_{1})}\times\dots\times p_{t,D}^{x_{D}}(1-p_{t,D})^{(1-x_{D})}$ for $p_{t,i}$ the probability that $x_{i}=1$ at time $t$. $\mathbf{x}^{(1)},\dots,\mathbf{x}^{(N)}$, the proposal samples at time $t$, not to be confused with $x_{1},\dots,x_{D}$ the components of a vector $\mathbf{x}$. |

$p_{0,1},\dots,p_{0,D}:=\frac{1}{2},\dots,\frac{1}{2}$

until satisfied:

$\mathbf{x}^{(1)}\sim P_{\theta^{t}}(\mathbf{x}),\dots,\mathbf{x}^{(N)}\sim P_{\theta^{t}}(\mathbf{x})$

rank samples ensuring $\mathbf{x}^{(1)}\leq\dots\leq\mathbf{x}^{(N)}$

// update probability vector

for $i$ from $1$ to $N_{s}$:

for $j$ from $1$ to $D$:

$p_{t+1,i}:=p_{t,i}\times(1-\delta\hspace{-1pt}t)+x_{j}^{(i)}\times\delta\hspace{-1pt}t$

// mutate probability vector

for $j$ from $1$ to $D$:

if $\mathcal{U}([0;1]) < m$:

then $p_{t+1,i}:=p_{t+1,i}\times(1-\alpha)+\mathcal{U}(\{0,1\})\times\alpha$

return best solution $\hat{\mathbf{x}}:=\mathbf{x}^{(1)}$.

end