In this post we'll be looking at a notion of optimality for manifolds from Chigirev and Bialek.
Problem setting
Suppose that we have data \(x\) that follows the distribution \(p(x)\) and we want to find an encoding of the data \(z\) that is lower dimensional than \(x\) but still captures the important information in \(x\). We will want to find an encoding distribution, \(q(z|x)\) and a function \(f(z)\) that maps the encoding to the data such that \(f(z)\) (measured by distortion) while keeping a fixed capacity needed to transmit the encoding (measured by mutual information).
Distortion
There will be some loss invovled in this process which we will measure with the concept of distortion. Suppose that we have a function that measures how similar two points are, \(d(x,f(z))\). Then the distortion is defined as:
In order for the math to work out nicely, we can choose \(d=\frac{1}{2}\|x-f(z)\|^2\).
Mutual information
We also want to ensure that the encoding is efficient, i.e. the mutual information between the encoding and the data is maximized. This is defined as:
where \(q(z) = \int p(x)q(z|x)dx\) is the marginal under the model.
Optimal manifold
We want to find the optimal \(f\) so that the distortion is minimized while the mutual information is fixed. Chigirev and Bialek pose this as the solution of the optimization problem
We can solve this problem without using functional calculus by parametrizing \(f\) and \(q\) with the parameters \(\theta\) and \(\phi\) respecitvely and then set gradients to 0:
Optimal decoder function
Lets derive the gradients of \(F\) with respect to \(\theta\) first.
Therefore, in order to get a gradient of \(0\), we would need
where \(p(x|z) = \frac{p(x)q(z|x)}{q(z)}\) is the conditional under the model.
Optimal encoder distribution
Next, lets look at the gradient with respect to \(\phi\):
The last part of the equation is 0 because
and
So in order to get a gradient of \(0\), we would need
where \(\log Z(x)\) is a normalization constant (any constant that doesn't depend on \(z\) will have an expectation of \(0\) and can be added without changing the gradient). Note that \(-\frac{1}{2\lambda}\|x - f(z)\|^2 = \log N(x|f(z),\lambda I) + \text{const}\). This implies that the similarity function \(d(x,f(z))\) can be interpreted as the negative log likelihood of \(x\). This means that the distribution \(p(x|z)\) that appears in the expression for \(f\) is \(N(x|f(z),\lambda I)\).
So in summary, we have that the optimal \(f\) and \(q\) are given by
A way to interpret this is that the generating process for the data is \(z\sim q(z)\) and then \(x\sim N(x|f(z),\lambda I)\).
General distance function
We can keep the interpretation that the optimal encoding distribution is a posterior distribution by identifying the similarity function \(d(x,f(z))\) as the negative log likelihood of a data generating process, \(-\log p_\lambda(x|f(z))\). The optimal \(q(z|x)\) is still the posterior:
However, the optimal manifold might not be something we can solve for analytically. \(f(z)\) will be a function that satisfies