One-Step Diffusion Generation via Curriculum Trajectory Matching

The problem with one-step generation

Diffusion models are powerful but slow. Starting from pure Gaussian noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}),$ a standard DDPM traces a reverse Markov chain across $T \approx 1000$ timesteps to recover a clean image $\mathbf{x}_0$. Each step is a neural network evaluation, and the cost adds up.

The natural desire is to collapse this into a single step — one forward pass from noise to image. Recent approaches like consistency models and progressive distillation have made real progress here. But I keep coming back to a nagging question: why do one-step models still fall short on fine detail and diversity?

My intuition is that the problem is geometric. The diffusion trajectory traces a specific path through probability space. Standard distillation asks a student model to map one endpoint to the other, but never explicitly teaches it anything about the path in between. The model is expected to invent a shortcut, and the shortcut tends to be blurry.

A different framing

What if instead of skipping the trajectory, we taught the model to internalize it first?

The idea is to train the model through a curriculum that mirrors the structure of the forward diffusion process — but in reverse:

Stage 1: teach the model to reproduce the marginal distribution at $t = 998$.
Stage 2: teach it to reproduce the marginal at $t = 997$.
Continue down to $t = 0$, adding one timestep per stage.

Each stage adds only one step of denoising responsibility. The model is never asked to jump farther than it has learned. By the time training is complete, the full trajectory geometry has been absorbed — and at inference the model can traverse it in a single forward pass.

The math

Background: the forward process

Under the DDPM noise schedule ${\beta_t}{t=1}^T$, let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod{s=1}^t \alpha_s$. The forward marginal is available in closed form at any timestep:

\[q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right)\]

This is the key fact that makes the curriculum tractable: we can always compute the target distribution at any stage without running the full forward chain.

What one-step generation means, formally

A one-step generator is a map $G_\theta : \mathbb{R}^d \to \mathbb{R}^d$ satisfying

\[(G_\theta)_\# \,\mathcal{N}(\mathbf{0}, \mathbf{I}) = p_\text{data}\]

That is, $G_\theta$ must push forward the Gaussian prior to the data distribution. The standard distillation objective tries to achieve this by minimizing

\[\mathcal{L}_\text{distill}(\theta) = \mathbb{E}_{\mathbf{z},\,\mathbf{x}_0 \sim p_\text{teacher}(\cdot|\mathbf{z})}\!\left[\,\| G_\theta(\mathbf{z}) - \mathbf{x}_0 \|^2\right]\]

This is geometrically blind — the student only ever sees the endpoints.

Curriculum matching objective

At curriculum stage $k$, with target timestep $t_k = T - k\lfloor T/K \rfloor$, the loss I have in mind is:

\[\mathcal{L}_k(\theta) = \mathbb{E}_{\mathbf{z},\,\mathbf{x}_0}\!\left[\, D_\text{KL}\!\left( q(\mathbf{x}_{t_k} \mid \mathbf{x}_0) \;\Big\|\; p_\theta^{(k)}(\,\cdot\,|\,\mathbf{z}) \right) \right]\]

Since the target is Gaussian, this KL collapses to a weighted $\ell_2$:

\[\mathcal{L}_k(\theta) = \mathbb{E}\!\left[\, \frac{1}{2(1-\bar{\alpha}_{t_k})} \Bigl\| G_\theta^{(k)}(\mathbf{z}) - \sqrt{\bar{\alpha}_{t_k}}\,\mathbf{x}_0 \Bigr\|^2 \right]\]

The weight $\frac{1}{1-\bar{\alpha}_{t_k}}$ is important — it grows large as $t_k \to 0$, putting more pressure on precision near the data manifold where it matters most.

Consistency across stages

A curriculum alone is not enough. Without any coupling between stages, the model might learn each one in isolation and lose what it learned before. I want to add a regularizer that forces stage $k$’s output to be geometrically coherent with stage $k-1$’s.

Concretely, take the model’s output at stage $k-1$ and apply one teacher denoising step to get a target for stage $k$:

\[\hat{\mathbf{x}}_{t_k} = \frac{1}{\sqrt{\alpha_{t_{k-1}}}}\!\left( G_\theta^{(k-1)}(\mathbf{z}) - \frac{\beta_{t_{k-1}}}{\sqrt{1-\bar{\alpha}_{t_{k-1}}}}\; \boldsymbol{\epsilon}_\phi\!\left(G_\theta^{(k-1)}(\mathbf{z}),\, t_{k-1}\right) \right)\]

where $\boldsymbol{\epsilon}_\phi$ is the frozen teacher. The regularizer is then simply

\[\mathcal{R}_k(\theta) = \mathbb{E}\!\left[\,\bigl\| G_\theta^{(k)}(\mathbf{z}) - \hat{\mathbf{x}}_{t_k} \bigr\|^2\right]\]

and the full loss at stage $k$ is

\[\boxed{ \mathcal{L}_k^\text{total}(\theta) = \underbrace{\mathcal{L}_k(\theta)}_{\text{marginal matching}} + \lambda\;\underbrace{\mathcal{R}_k(\theta)}_{\text{trajectory consistency}} }\]

Why I think this could work

The diffusion trajectory is not an arbitrary path — it has structure. At each $t$, the marginal $q(\mathbf{x}t)$ sits at a precise location in the interpolation between $\mathcal{N}(\mathbf{0},\mathbf{I})$ and $p\text{data}$. The curriculum teaches this interpolation in order, rather than demanding the model discover it implicitly.

An analogy I find useful: learning to write. You don’t learn a letterform by being shown the final result and told to reproduce it. You learn the strokes, in sequence. Once the motor program is internalized, execution becomes fast. The trajectory is the motor program here.

There is also a connection to optimal transport. The sequence ${q(\mathbf{x}t)}{t=T}^{0}$ is a displacement interpolation between the prior and the data distribution. Teaching this interpolation in natural order may yield a one-step map with lower transport cost — and lower transport cost tends to mean less blur and better coverage of modes.

How it relates to prior work

This sits in a space between a few existing ideas.

Consistency models (Song et al., 2023) enforce a self-consistency condition: $f_\theta(\mathbf{x}t, t) = f\theta(\mathbf{x}_{t’}, t’)$ along any trajectory. My approach is complementary — I focus on matching marginal distributions at each timestep rather than enforcing self-consistency. In principle the two constraints could be combined.

Progressive distillation (Salimans & Ho, 2022) halves the number of sampling steps in each round by training a student to match two teacher steps in one. The curriculum here is more granular: rather than folding steps together, it adds them one at a time. This should provide more stable gradient signal in early stages.

Rectified flow (Liu et al., 2022) forces the trajectory to be a straight line from noise to data. My approach does not constrain the path’s shape — it follows the natural curvature of the diffusion marginals, which may be important for capturing the geometry of complex data distributions.

Open questions I’m sitting with

Does the order actually matter? I’m training from high $t$ to low $t$, following the causal direction of the forward process. But would a reversed curriculum, or a randomized one, do just as well? My intuition says the natural order matters, but I don’t have a proof.

Is $\ell_2$ the right loss at low $t$? Near the data manifold, the marginal $q(\mathbf{x}_t)$ is no longer well-approximated as Gaussian, and $\ell_2$ supervision is known to produce blurry outputs. A perceptual or adversarial loss at late stages seems worth exploring.

How coarse can the curriculum be? In principle one stage per timestep is ideal, but $K = 100$ or even $K = 50$ coarse stages are far more practical. It is not obvious how much of the trajectory structure is lost by coarsening, and where the quality cliff lies.

Does the consistency regularizer need to be global? The current form only ties adjacent stages. A stronger version might enforce that the model’s entire learned trajectory up to stage $k$ is globally coherent — closer in spirit to the consistency constraint in Song et al. — though this would be significantly more expensive to compute.

Where this leaves me

The honest answer is: this is a thought, not yet a result. I find the framing compelling because it gives the one-step objective a clear geometric grounding — something that most distillation approaches lack. Whether the compounding gains from curriculum training actually outweigh the additional complexity is an empirical question I haven’t answered yet.

But I think the directional intuition is right. A model that has been taught to respect the geometry of the diffusion trajectory should generalize better to one-step generation than one that has only ever seen endpoints. That seems worth testing.