Diffusion Drifting - from score point of view
The Drifting Model introduces a clever training objective for one-step generative models, grounded in the idea of a drifting field that pushes generated samples toward the data distribution. In this post, I want to derive this objective from a more classical angle: starting from KL divergence, passing through score matching, and showing how kernel density estimation (KDE) naturally leads to the mean-shift field that appears at this paper.
KL Divergence
We want \(q_\theta \approx p_r\) where \(p_r\) is the distribution of true images and \(q_\theta\) is the distribution after pushforward model \(q_{\theta} = {f_{\theta}}_{\#} p_{\text{prior}}\) and \(p_{\text{prior}} = \mathcal{N}(0,I)\). The natural objective is:
\[D_{\mathrm{KL}}(q_\theta \| p_r) = \mathbb{E}_{\boldsymbol{\epsilon} \sim p_\text{noise}}\!\left[\log q_\theta(f_\theta(\boldsymbol{\epsilon})) - \log p_r(f_\theta(\boldsymbol{\epsilon}))\right]\]We can take monte carlo estimation of the expectation:
\[D_{\mathrm{KL}}(q_\theta \| p_r) \approx \frac{1}{N}\sum_{i=1}^{N}\left[\log q_\theta(\mathbf{x}_i) - \log p_r(\mathbf{x}_i)\right], \quad \mathbf{x}_i = f_\theta(\boldsymbol{\epsilon}_i)\]So for each sample \(\mathbf{x}_i\), the KL loss contribution is:
\[\ell(\mathbf{x}_i) = \log q_\theta(\mathbf{x}_i) - \log p_r(\mathbf{x}_i)\]Taking gradient w.r.t. each \(\mathbf{x}_i\):
\[\nabla_{\mathbf{x}_i}\left[\log q_\theta(\mathbf{x}_i) - \log p_r(\mathbf{x}_i)\right] = \mathbf{s}_q(\mathbf{x}_i) - \mathbf{s}_r(\mathbf{x}_i)\]To reduce the KL, each \(\mathbf{x}_i\) performs gradient descent:
\[\mathbf{x}_i \leftarrow \mathbf{x}_i - \eta\bigl(\mathbf{s}_q(\mathbf{x}_i) - \mathbf{s}_r(\mathbf{x}_i)\bigr) = \mathbf{x}_i + \eta\underbrace{\bigl(\mathbf{s}_r(\mathbf{x}_i) - \mathbf{s}_q(\mathbf{x}_i)\bigr)}_{V_{p,q}(\mathbf{x}_i)}\]This is exactly the drifting field \(V_{p,q}(\mathbf{x}) = \mathbf{s}_r(\mathbf{x}) - \mathbf{s}_q(\mathbf{x})\), which is zero iff \(p_r = q_\theta\).
KDE Makes Scores Sample-Based
However, we only have samples, not densities. For a generated sample \(\mathbf{x} \sim q_\theta\), we approximate \(q_\theta\) using other generated samples \(\{y^-_i\} \sim q_\theta\) as negatives, and the data distribution \(p_r\) using real samples \(\{y^+_j\} \sim p_r\) as positives:
\[q_{\theta}(\mathbf{x}) \approx \frac{1}{N}\sum_i k(\mathbf{x}, y^-_i), \qquad p_r(\mathbf{x}) \approx \frac{1}{M}\sum_j k(\mathbf{x}, y^+_j), \qquad k(\mathbf{x}, \mathbf{y}) = \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2\tau^2}\right)\]The KDE score is:
\[\nabla_\mathbf{x} \log q_{\theta}(\mathbf{x}) = \frac{\sum_i \nabla_\mathbf{x}\, k(\mathbf{x}, y^-_i)}{\sum_i k(\mathbf{x}, y^-_i)}\]Since \(\nabla_\mathbf{x}\, k(\mathbf{x}, y^-_i) = -\dfrac{\mathbf{x}-y^-_i}{\tau^2} k(\mathbf{x}, y^-_i)\), letting \(\tilde{k}(\mathbf{x}, y^-_i) = \dfrac{k(\mathbf{x}, y^-_i)}{\sum_i k(\mathbf{x}, y^-_i)}\) be the softmax-normalized weights:
\[\nabla_\mathbf{x} \log q_{\theta}(\mathbf{x}) = \frac{-\frac{1}{\tau^2}\sum_i k(\mathbf{x}, y^-_i)(\mathbf{x} - y^-_i)}{\sum_i k(\mathbf{x}, y^-_i)} = \frac{1}{\tau^2}\sum_i \tilde{k}(\mathbf{x}, y^-_i)(y^-_i - \mathbf{x}) =: \frac{1}{\tau^2} V^-_q(\mathbf{x})\]This is the mean-shift vector — a weighted average pulling \(\mathbf{x}\) toward nearby generated samples \(\{y^-_i\} \sim q_\theta\). Identically for \(p_r\) with real samples \(\{y^+_j\} \sim p_r\):
\[\nabla_\mathbf{x} \log p_r(\mathbf{x}) = \frac{1}{\tau^2}\sum_j \tilde{k}(\mathbf{x}, y^+_j)(y^+_j - \mathbf{x}) =: \frac{1}{\tau^2} V^+_p(\mathbf{x})\]The Drifting Field
Substituting into the score difference:
\[V_{p,q}(\mathbf{x}) = \mathbf{s}_r(\mathbf{x}) - \mathbf{s}_q(\mathbf{x}) \approx \frac{1}{\tau^2}\!\left(V^+_p(\mathbf{x}) - V^-_q(\mathbf{x})\right)\]This is Eq. (10) of the paper, with \(V^+_p\) attracting \(\mathbf{x}\) toward real data and \(V^-_q\) repelling from generated samples. Anti-symmetry \(V_{p,q} = -V_{q,p}\) follows directly from the score difference structure, guaranteeing \(V_{p,q} = \mathbf{0} \Leftrightarrow p_r = q_\theta\).
The two normalized kernels combine into a joint weight (Eq. 11):
\[V_{p,q}(\mathbf{x}) = \mathbb{E}_{y^+\sim p_r,\, y^-\sim q_\theta}\!\left[\tilde{k}(\mathbf{x}, y^+)\,\tilde{k}(\mathbf{x}, y^-)\,(y^+ - y^-)\right]\]Training Loss
If the model can accurately push the prior distribution to image distribution, at equilibrium \(V_{p,q}(\mathbf{x}) = \mathbf{0}\), so \(\mathbf{x} + V_{p,q}(\mathbf{x}) = \mathbf{x}\). This motivates chasing a frozen drifted target at each training step:
\[\mathcal{L} = \mathbb{E}_{\boldsymbol{\epsilon}}\left\|f_\theta(\boldsymbol{\epsilon}) - \operatorname{sg}\!\left(f_\theta(\boldsymbol{\epsilon}) + V_{p,q}(f_\theta(\boldsymbol{\epsilon}))\right)\right\|^2 = \mathbb{E}\!\left[\|V_{p,q}\|^2\right]\]Minimizing \(\mathcal{L}\) drives \(V_{p,q} \to \mathbf{0}\), which by the score equivalence above drives \(q_\theta \to p_r\).