In this post we'll be looking at a notion of optimality for manifolds from Chigirev and Bialek.

Problem setting

Suppose that we have data $x$ that follows the distribution $p(x)$ and we want to find an encoding of the data $z$ that is lower dimensional than $x$ but still captures the important information in $x$. We will want to find an encoding distribution, $q(z|x)$ and a function $f(z)$ that maps the encoding to the data such that $f(z)$ (measured by distortion) while keeping a fixed capacity needed to transmit the encoding (measured by mutual information).

Distortion

There will be some loss invovled in this process which we will measure with the concept of distortion. Suppose that we have a function that measures how similar two points are, $d(x,f(z))$. Then the distortion is defined as:

$$ \begin{align} D = \mathbb{E}_{p(x)q(z|x)}\left[d(x,f(z))\right] \end{align} $$

In order for the math to work out nicely, we can choose $d=\frac{1}{2}\|x-f(z)\|^2$.

Mutual information

We also want to ensure that the encoding is efficient, i.e. the mutual information between the encoding and the data is maximized. This is defined as:

$$ \begin{align} I = \mathbb{E}_{p(x)q(z|x)}\left[\log \frac{q(z|x)}{q(z)}\right] \end{align} $$

where $q(z) = \int p(x)q(z|x)dx$ is the marginal under the model.

Optimal manifold

We want to find the optimal $f$ so that the distortion is minimized while the mutual information is fixed. Chigirev and Bialek pose this as the solution of the optimization problem

$$ \begin{align} F(f,q) = D(f,q) + \lambda I(q) \end{align} $$

We can solve this problem without using functional calculus by parametrizing $f$ and $q$ with the parameters $\theta$ and $\phi$ respecitvely and then set gradients to 0:

$$ \begin{align} F(\theta,\phi) = D(\theta,\phi) + \lambda I(\phi) \end{align} $$

Optimal decoder function

Lets derive the gradients of $F$ with respect to $\theta$ first.

$$ \begin{align} \nabla_\theta F(\theta,\phi) &= \nabla_\theta D(\theta,\phi) \\ &= \nabla_\theta \mathbb{E}_{p(x)q(z|x)}\left[\frac{1}{2}\|x-f_\theta(z)\|^2\right] \\ &= \mathbb{E}_{p(x)q(z|x)}\left[\nabla_\theta \frac{1}{2}\|x-f_\theta(z)\|^2\right] \\ &= \mathbb{E}_{p(x)q(z|x)}\left[(x-f_\theta(z))\nabla_\theta f_\theta(z)\right] \\ &= \mathbb{E}_{q(z)p(x|z)}\left[(x-f_\theta(z))\nabla_\theta f_\theta(z)\right] \\ &= \mathbb{E}_{q(z)}\left[(\mathbb{E}_{p(x|z)}[x]-f_\theta(z))\nabla_\theta f_\theta(z)\right] \end{align} $$

Therefore, in order to get a gradient of $0$, we would need

$$ \begin{align} f_\theta(z) = \mathbb{E}_{p(x|z)}[x] \end{align} $$

where $p(x|z) = \frac{p(x)q(z|x)}{q(z)}$ is the conditional under the model.

Optimal encoder distribution

Next, lets look at the gradient with respect to $\phi$:

$$ \begin{align} \nabla_\phi F(\theta,\phi) &= \nabla_\phi D(\theta,\phi) + \lambda I(\phi) \\ &= \nabla_\phi \mathbb{E}_{p(x)q_\phi(z|x)}\left[\frac{1}{2}\|x-f(z)\|^2 + \lambda \log \frac{q_\phi(z|x)}{q_\phi(z)}\right] \\ &= \mathbb{E}_{p(x)q_\phi(z|x)}\left[\nabla_\phi \log q_\phi(z|x)(\frac{1}{2}\|x-f(z)\|^2 + \lambda \log \frac{q_\phi(z|x)}{q_\phi(z)})\right] + \lambda \underbrace{\mathbb{E}_{p(x)q_\phi(z|x)}\left[\nabla_\phi \log \frac{q_\phi(z|x)}{q_\phi(z)}\right]}_{0} \end{align} $$

The last part of the equation is 0 because

$$ \begin{align} \mathbb{E}_{p(x)q_\phi(z|x)}[\nabla_\phi \log q_\phi(z|x)] &= \mathbb{E}_{p(x)}\int \nabla_\phi q_\phi(z|x)dz \\ &= \mathbb{E}_{p(x)}\nabla_\phi 1 \\ &= 0 \end{align} $$

and

$$ \begin{align} \mathbb{E}_{p(x)q_\phi(z|x)}[\nabla_\phi \log q_\phi(z)] &= \mathbb{E}_{q_\phi(z)}[\nabla_\phi \log q_\phi(z)] \\ &= \nabla_\phi 1 \\ &= 0 \end{align} $$

So in order to get a gradient of $0$, we would need

$$ \begin{align} \log q_\phi(z|x) = -\frac{1}{2\lambda}\|x - f(z)\|^2 + \log q_\phi(z) - \log Z(x) \end{align} $$

where $\log Z(x)$ is a normalization constant (any constant that doesn't depend on $z$ will have an expectation of $0$ and can be added without changing the gradient). Note that $-\frac{1}{2\lambda}\|x - f(z)\|^2 = \log N(x|f(z),\lambda I) + \text{const}$. This implies that the similarity function $d(x,f(z))$ can be interpreted as the negative log likelihood of $x$. This means that the distribution $p(x|z)$ that appears in the expression for $f$ is $N(x|f(z),\lambda I)$.

So in summary, we have that the optimal $f$ and $q$ are given by

$$ \begin{align} f(z) &= \mathbb{E}_{N(x|f(z),\lambda I)}[x] \\ q(z|x) &= \frac{N(x|f(z),\lambda I)q(z)}{Z(x)} \end{align} $$

A way to interpret this is that the generating process for the data is $z\sim q(z)$ and then $x\sim N(x|f(z),\lambda I)$.

General distance function

We can keep the interpretation that the optimal encoding distribution is a posterior distribution by identifying the similarity function $d(x,f(z))$ as the negative log likelihood of a data generating process, $-\log p_\lambda(x|f(z))$. The optimal $q(z|x)$ is still the posterior:

$$ \begin{align} q(z|x) &= \frac{p_\lambda(x|z)q(z)}{Z(x)} \end{align} $$

However, the optimal manifold might not be something we can solve for analytically. $f(z)$ will be a function that satisfies

$$ \begin{align} 0 &= \mathbb{E}_{p(x)q(z|x)}\left[\nabla_\theta \log p_\lambda(x|f(z))\right] \end{align} $$

Optimal manifold

Problem setting

Distortion

Mutual information

Optimal manifold

Optimal decoder function

Optimal encoder distribution

General distance function

Published

Category