Representation learning, generative models, and the manifold hypothesis

Suppose you are given a dataset and want to find its best representation. What does “best” even mean? This question sits at the center of my research and connects generative modeling, differential geometry, and information theory in ways that I think are underappreciated. In this post, I want to lay out the thread that connects these areas and explain why I believe coordinate systems are the right lens for thinking about representation learning.

The manifold hypothesis and dimensionality reduction

The manifold hypothesis states that high-dimensional data tends to concentrate near a low-dimensional manifold embedded in ambient space. If true, a good representation should capture the manifold’s intrinsic coordinates while discarding directions that are pure noise. PCA is the simplest example: it finds linear directions of maximum variance and discards the rest. But what is PCA really doing from a geometric perspective?

PCA can be interpreted as a linear normalizing flow trained for maximum likelihood on data, where the Jacobian matrix has orthogonal columns. The matrix structure is the result of asserting that the flow is the solution to a dimensionality reduction problem. The coordinate with maximum variance intuitively captures the most information about the data, and the orthogonality between coordinates ensures that the information decomposes cleanly. One important property of PCA that is often taken for granted is that the total information in the linear flow can be decomposed exactly into the information of its individual coordinates.

The question that drives my research is whether there is a way to generalize this property beyond linear flows.

From PCA to nonlinear coordinate systems

Consider data from a noisy circle. Intuitively, a good one-dimensional representation is the angle around the circle because it identifies points up to noise. Now consider data whose distribution looks like a tube in \(\mathbb{R}^3\) with an elliptical cross section. There is still an obvious one-dimensional representation (distance along the tube), but we also need to decide how to decompose the noise dimensions. For the elliptical cross section, aligning with the principal axes seems natural.

What makes these choices good? In both cases, we can decompose the data into independent factors: the signal coordinate and the noise coordinates. The key observation is that this decomposition is possible precisely because the coordinate curves are orthogonal with respect to a metric that reflects the data distribution.

But for more complicated distributions, orthogonal coordinates may not exist. This is where the geometry becomes essential.

Normalizing flows as coordinate systems

A normalizing flow \(F: \mathcal{Z} \to \mathcal{X}\) trained via maximum likelihood learns a diffeomorphism from a simple latent space to the data space. The inverse \(F^{-1}\) gives coordinates on the data space. These coordinates are not arbitrary. They are chosen (implicitly, through training) to make the pushforward of the data distribution as close to the prior as possible.

The pullback metric of the Euclidean metric on \(\mathcal{X}\) through \(F\) is \(G = J^\top J\) where \(J\) is the Jacobian of \(F\). This metric tells us how the flow distorts space. The eigenvalues of \(G\) at a point \(z\) describe how much each latent direction stretches or compresses data space at that point.

A natural question is: can we choose \(F\) so that the pullback metric is diagonal? If so, the coordinates given by \(F^{-1}\) would be orthogonal in data space, and (as in PCA) the information would decompose exactly into the information of each coordinate.

Curvature as an obstruction to factored representations

This is where Riemannian geometry provides a sharp answer. A metric can be diagonalized by a coordinate transformation if and only if certain components of the Riemann curvature tensor vanish. Specifically, the off-diagonal sectional curvatures of the pullback metric must be zero.

For a metric derived from a probability density function (through the Fisher information metric or through the pullback of a normalizing flow), this curvature is in general nonzero. This means that for most data distributions, there is no coordinate system that simultaneously makes all factors independent. Any factored representation will necessarily lose some information.

This is not just a theoretical nuisance. It tells us something fundamental about the structure of the representation learning problem: the geometry of the data determines an upper bound on how well any factored representation can perform.

Dimensionality reduction through optimal coordinate dropping

Given these constraints, how should we choose which coordinates to drop? I have been developing a framework that formulates dimensionality reduction as an optimal control problem. The idea is as follows.

Given a diffeomorphism \(F: \mathcal{Z} \to \mathcal{X}\), we can “drop” a coordinate by projecting onto a coordinate subspace in \(\mathcal{Z}\) and mapping back to \(\mathcal{X}\). Rather than doing this instantaneously, we consider a time-indexed family of maps that smoothly transports a point along the coordinate curve to the submanifold where that coordinate is zero. The energy of this transport defines a cost function:

\[J_t[\gamma](x_t) = \int_0^t \left\| \frac{dP_s(x_s)}{ds} \right\|^2 ds\]

where \(P_s = F \circ \Pi_s \circ F^{-1}\) and \(\Pi_s\) is the time-indexed coordinate dropping operator. Taking the infimum over all transport maps \(\gamma\) gives a value function that measures the minimum energy needed to project a point onto the coordinate submanifold.

The expected value of this cost over the data distribution measures how much information is lost by dropping coordinate \(k\). The optimal diffeomorphism \(F\) that minimizes this expected cost produces coordinate systems where the dropped direction truly captures the least important variation. Remarkably, the optimality conditions for this problem recover the requirement that coordinate curves be geodesics of the pullback metric, connecting back to the differential geometry machinery.

The bigger picture

The thread connecting these ideas is that good representations are good coordinate systems, and the quality of a coordinate system is determined by its geometry. Normalizing flows provide a natural class of coordinate systems through their learned diffeomorphisms. The pullback metric captures how these coordinates interact with data space geometry. And the curvature of this metric determines fundamental limits on how well we can factorize the representation.

This perspective also explains why different generative model architectures succeed in different settings. Models that implicitly learn metrics with low curvature (relative to the data manifold) will find representations that factor more cleanly. Models that operate on data with high intrinsic curvature will necessarily learn more entangled representations, and no amount of architectural engineering can overcome this geometric obstruction.

The software libraries I have built, generax for flow-based generative models, linsdex for linear SDE inference, and local_coordinates for Riemannian geometry computations, are tools for investigating these questions computationally. Together, they let us build generative models, analyze the geometry of the resulting representations, and understand the fundamental limits of what those representations can achieve.