The linear SDE viewpoint on flow matching and diffusion

Flow matching is probably the most important contribution to the field of normalizing flows in the last few years. It allows us to train continuous normalizing flows in a simulation free way while also avoiding the pitfalls that come with working with likelihoods. In this post, we will go over how to construct a continuous normalizing flow that generates a target distribution from any user specified prior, how to train it using flow matching, and how the resulting framework is naturally expressed through linear SDEs. This last point is important because it shows that flow matching and diffusion models are two perspectives on the same underlying mathematical object.

Constructing a CNF from conditional paths

Say that we are trying to learn a parametric approximation of an unknown probability distribution that we can sample from. The first flow matching paper showed how we can construct a continuous normalizing flow that generates this target from any user specified prior. Let \(p_1:=p_\text{data}\) be the target distribution and let \(p_0\) be a user specified prior. We assume that there is a probability path between \(p_0\) and \(p_1\), denoted by \(p_t\), that is generated by the flow of a time dependent vector field \(V_t\). Our goal is to find the equation for \(V_t\).

To do this, we assume that \(p_t\) is a marginal probability distribution over \(x_t\) and some random variable \(y\):

\[p_t(x_t) = \int p_y(y)p_{t|y}(x_t|y) dy\]

where \(p_y(y)\) and \(p_{t\vert y}(x_t\vert y)\) are chosen so that \(p_{t=0} = p_0\) and \(p_{t=1} = p_\text{data}\). Next, we assume that there is a CNF that generates \(p_{t\vert y}(x_t\vert y)\) using a vector field \(\tilde{V}_t(x_t\vert y)\) that takes \(y\) as a parameter. The key insight is that we can write \(V_t(x_t)\) as the expected value of \(\tilde{V}_t(x_t\vert y)\) over the posterior distribution of \(y\) given \(x_t\):

\[V_t(x_t) = \int \frac{p_y(y)p_{t|y}(x_t|y)}{p_t(x_t)}\tilde{V}_t(x_t|y) dy = \mathbb{E}_{p_{t|y}(y|x_t)}[\tilde{V}_t(x_t|y)]\]

We can verify this by checking the continuity equation:

\[\begin{align} \frac{\partial p_t}{\partial t} &= \frac{d}{dt}\int p_y(y)p_{t|y}(x_t|y) dy \\ &= -\int p_y(y)\text{Div}(p_{t|y}(x_t|y)\tilde{V}_t(x_t|y))dy \\ &= -\text{Div}(p_t \underbrace{\int \frac{1}{p_t}p_y(y)p_{t|y}(x_t|y)\tilde{V}_t(x_t|y)dy}_{V_t}) \end{align}\]

The flow matching objective

Now that we have a way to construct a target vector field \(V_t\), we can train a parametric model \(W_t(x_t;\theta)\) to match it. The flow matching loss is:

\[\mathcal{L}_{\text{FM}}(\theta) = \int_0^1 \mathbb{E}_{p_t(x_t)}\left[\left\|V_t(x_t) - W_t(x_t;\theta)\right\|^2\right] dt\]

The crucial observation is that this loss can be decomposed as:

\[\mathcal{L}_{\text{FM}}(\theta) = C_1 - C_2 + \underbrace{\int_0^1 \mathbb{E}_{p_y(y)p_t(x_t|y)}\left[\|\tilde{V}_t(x_t|y) - W_t(x_t;\theta)\|^2\right]dt}_{\mathcal{L}_{\text{CFM}}(\theta)}\]

where \(C_1\) and \(C_2\) are constants. So we can minimize the flow matching loss by minimizing the conditional flow matching loss \(\mathcal{L}_{\text{CFM}}(\theta)\), which only requires samples from the conditional path \(p_t(x_t\vert y)\) and its vector field \(\tilde{V}_t(x_t\vert y)\).

Gaussian conditional probability paths

If we choose \(p_0 = \mathcal{N}(0,I)\), then we can set \(p_y(y) = p_\text{data}(y)\) and \(p_{t\vert y}(x_t\vert y) = \mathcal{N}(x_t; \mu_t(y), \Sigma_t(y))\) where \(\mu_t\) and \(\Sigma_t\) are differentiable functions satisfying:

\[\mu_{t=0}(y) = 0,\quad \Sigma_{t=0}(y) = I, \quad \mu_{t=1}(y) = y,\quad \Sigma_{t=1}(y) = 0\]

Since we can sample \(x_t \sim p_{t\vert y}(x_t\vert y)\) using the reparameterization \(x_t = \mu_t(y) + \Sigma_t(y)^{1/2}x_0\) where \(x_0 \sim \mathcal{N}(0,I)\), the conditional vector field is:

\[\tilde{V}_t(x_t|y) = \frac{d \mu_t(y)}{dt} + \frac{d \Sigma_t(y)^{1/2}}{dt}\Sigma_t^{-1/2}(x_t - \mu_t(y))\]

The simplest choice satisfying the boundary conditions is \(\mu_t(y) = ty\) and \(\Sigma_t(y) = (1-t)^2I\), which gives the optimal transport conditional path:

\[x_t = (1-t)x_0 + ty, \quad \tilde{V}_t(x_t|y) = y - x_0\]

This leads to an extremely simple training algorithm:

def flow_matching_objective(data_batch):
  batch_size = data_batch.shape[0]
  t = uniform_sample(0, 1, batch_size)
  x0 = normal_sample(0, 1, batch_size)
  xt = x0 + t * (data_batch - x0)
  Vt = data_batch - x0
  Wt = model(t, xt)
  loss = (Vt - Wt) ** 2
  return loss.mean()

The connection to diffusion models through SDEs

This is where things get interesting. The vector field \(V_t\) that generates the probability path is related to the score function \(\nabla \log p_t\), and this relationship reveals that flow matching and diffusion models are two views of the same object.

From SDEs to vector fields

If we have an SDE of the form \(dx = f_t dt + g_t dW\), its probability path satisfies the Fokker-Planck equation:

\[\frac{\partial p_t}{\partial t} = -\text{Div}\left(\underbrace{\left(f_t - \frac{g_t^2}{2}\nabla \log p_t\right)}_{V_t}p_t\right)\]

So every SDE gives rise to a deterministic vector field \(V_t = f_t - \frac{g_t^2}{2}\nabla \log p_t\) that generates the same probability path. This is the probability flow ODE.

From vector fields to the score

Going the other direction, for our Gaussian conditional probability paths, the score function of the conditional Gaussian is:

\[\nabla \log \mathcal{N}(x_t|\mu_t,\Sigma_t) = -\Sigma_t^{-1}(x_t - \mu_t)\]

Because both the score function and the vector field satisfy the same posterior expectation property:

\[\nabla \log p_t(x_t) = \int p_t(y|x_t)\nabla \log p_t(x_t|y)dy, \quad V_t(x_t) = \int p_t(y|x_t)\tilde{V}_t(x_t|y)dy\]

we can relate the marginal vector field to the marginal score. For the optimal transport path with \(\mu_t(y) = ty\) and \(\Sigma_t = (1-t)^2I\), the full derivation yields:

\[V_t(x_t) = \frac{1}{t}\left(x_t + (1-t)\nabla \log p_t(x_t)\right)\]

This is the key equation that unifies the two perspectives. Any model that predicts the vector field implicitly predicts the score, and vice versa. Training with the flow matching objective is equivalent to learning the score function up to a known, time-dependent transformation.

Linear SDEs as the unifying abstraction

The Gaussian conditional probability paths that underlie both flow matching and diffusion models are exactly the transition distributions of linear SDEs. A linear SDE of the form:

\[dx_t = \alpha_t x_t \, dt + \sigma_t \, dW_t\]

has Gaussian transitions \(p(x_t \vert x_s) = \mathcal{N}(\mu_{t\vert s}, \gamma_{t\vert s}^2 I)\) where:

\[\mu_{t|s} = \psi_{t|s} x_s, \quad \gamma_{t|s}^2 = \sigma^2 \psi_{t|s}^2 \int_s^t (\psi_{r|s})^{-2} dr\]

and \(\psi_{t\vert s}\) is the transition kernel satisfying \(\frac{d\psi_{t\vert s}}{dt} = \psi_{t\vert s}\alpha_t\) with \(\psi_{t\vert t} = 1\).

Different choices of \(\alpha_t\) and \(\sigma_t\) recover different well-known models:

Variance preserving (DDPM/score matching): \(\alpha_t = -\frac{1}{2}\beta_t\), \(\sigma_t = \sqrt{\beta_t}\)
Variance exploding: \(\alpha_t = 0\), \(\sigma_t = \sigma(t)\)
Optimal transport (flow matching): recovered in the limit \(\sigma_t \to 0\)

The linear SDE framework also naturally provides the machinery for conditioning on endpoints through the Doob h-transform. Given a linear SDE conditioned on reaching \(x_T\) at time \(T\), the conditioned drift becomes:

\[dx_{t|T} = \left(\alpha_t x_t + \sigma_t^2 \nabla \log p(x_T | x_t)\right) dt + \sigma_t \, dW_t\]

where \(\nabla \log p(x_T \vert x_t) = \frac{1}{\gamma_{T\vert t}^2}(x_T - \mu_{T\vert t})\) is available in closed form because the transitions are Gaussian. This is exactly the bridge process used in diffusion model sampling.

The punchline is that flow matching and diffusion models are not competing approaches. They are different parameterizations of inference in the same family of linear Gaussian stochastic processes. The choice between predicting the vector field, the score, or the clean data \(y_1\) is a choice of neural network output parameterization, not a choice of model class.