Intro

As a second-year undergraduate in mathematics, I’ve recently started self-studying stochastic differential equations since it’s interesting and applicable to machine learning. While most resources dive quickly into measure-theoretic probability, I found it challenging to find concise, technically accurate summaries that bridge undergraduate calculus/probability and the core machinery of Itô calculus. This post is my attempt to fill that gap. It’s written primarily for fellow undergraduates who know basic probability but haven’t yet taken a full course in stochastic calculus.

Brownian Motion

Before introducing stochastic differential equations, it is helpful to recall the deterministic case.

An ordinary differential equation (ODE) has the form

\[\frac{df}{dt} = b(f,t), \quad f(t_0) = f_0,\]

or in differential notation,

\[df = b(f,t)\, dt.\]

Here $b(\cdot,t)$ is the deterministic drift coefficient, and for suitable conditions on $b$, there exists a unique deterministic solution $f(t)$.

Ordinary differential equations often provide an effective framework for modeling many real-world systems, such as population growth. However, these models are fully deterministic. Given the same initial conditions, they always produce the same trajectory.

In reality, most systems are subject to random fluctuations due to environmental noise, measurement errors, or other unpredictable influences. For instance, stock prices are affected by unpredictable market events

To capture such randomness in a continuous-time setting, we can extend the deterministic model by adding a stochastic noise term:

\[dX_t = \mu(X_t, t)\, dt + \sigma(X_t, t)\, dW_t, \quad X_0 = x_0.\]

The noise term is $\sigma(X_t, t)\, dW_t$, which is driven by increments of a stochastic process $W_t$. In principle, many different processes could be used (for example, increments of a compensated Poisson process to model jump discontinuities). But in application, we often choose standard Brownian motion because it possesses several onvenient properties that make the resulting Itô calculus more tractable and powerful.

These key properties include:

Gaussian increments: $W_t - W_s \sim \mathcal{N}(0, t-s)$ for $t > s$,
Independent increments: $W_{t_{k+1}} - W_{t_k}$ are mutually independent for disjoint intervals,
Almost surely continuous paths: $t \mapsto W_t(\omega)$ is continuous for almost every $\omega$.

A crucial consequence of the Gaussian increments property is the scaling of the noise in small time intervals. Over an infinitesimal interval $dt$, we have

\[dW_t = W_{t+dt} - W_t \sim \mathcal{N}(0, dt)\]

Thus, $dW_t$ has mean zero and variance $dt$, so its typical magnitude (standard deviation) is $\sqrt{dt}$.

This gives rise to a useful informal rule in Itô calculus (which you’ll see soon) is that $(dW_t)^2 \approx dt$, while higher powers like $(dt)^2$ or $dt \cdot dW_t$ are negligible.

These informal rules — $(dW_t)^2 \approx dt$, $dt \cdot dW_t \approx 0$, $(dt)^2 \approx 0$ — will be made rigorous later, but they hint at why the stochastic integral is very different from traditional calculus.

Itô’s Formula

Let’s suppose there is a function $f(t,X_t)$ includes random variabes $X_t$.

In order to calculate $df(t,X_t)$, we can apply second-order Taylor expansion since it includes 2 variables:

\[df = \frac{\partial f}{\partial t} \, dt + \frac{\partial f}{\partial x} \, dX + \frac{1}{2} \frac{\partial^2 f}{\partial x^2} (dX)^2 + \frac{\partial^2 f}{\partial t \partial x} \, dt \, dX + \frac{1}{2} \frac{\partial^2 f}{\partial t^2} (dt)^2 + o((dt)^2 + (dX)^2)\]

Remember for the multiplication table, $dt \cdot dt = dt \cdot dW_t = 0$, $dW_t \cdot dW_t = dt$.

Then for an Itô process, we can immediately deduce such formula

\[df(t,X_t) = \frac{\partial f}{\partial t}(t,X_t) \, dt + \frac{\partial f}{\partial x}(t,X_t) \, dX_t + \frac{1}{2} \frac{\partial^2 f}{\partial x^2}(t,X_t) \, (dX_t)^2,\]

Substituting $dX_t = \mu \, dt + \sigma \, dW_t$ yields

\[df(t,X_t) = \left( f_t + \mu f_x + \frac{1}{2} \sigma^2 f_{xx} \right) dt + \sigma f_x \, dW_t.\]

Example: Geometric Brownian Motion

Consider the SDE for asset prices

\[dS_t = \mu S_t \, dt + \sigma S_t \, dW_t, \quad S_0 > 0.\]

Apply Itô’s formula to $f(s) = \log s$:

\[f_s = \frac{1}{s}, \quad f_{ss} = -\frac{1}{s^2}.\]

Then

\[d(\log S_t) = \left( \mu - \frac{1}{2} \sigma^2 \right) dt + \sigma \, dW_t.\]

Integrating,

\[\log S_t = \log S_0 + \left( \mu - \frac{1}{2} \sigma^2 \right) t + \sigma W_t.\]

Thus

\[S_t = S_0 \exp\left( \left( \mu - \frac{1}{2} \sigma^2 \right) t + \sigma W_t \right)\]

$S_t$ is log-normally distributed with

\[\mathbb{E}[S_t] = S_0 e^{\mu t}\]

Reverse SDE

Given SDE $dX_t = \mu(X_t, t)\, dt + \sigma(X_t, t)\, dW_t, \quad X_0 = x_0.$

This is called the forward process because we can alternatively rewrite this as:

\[p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_t) = \mathcal {N}(\mu(x_t,t)\, dt, \sigma(x_t, t)^2\, dt) \propto \exp\left(-\frac{(\mathbf{x}_{t+\Delta t} - \mathbf{x}_t - \mu(x_t,t) dt)^2}{2\sigma(x_t, t)^2 dt}\right)\]

Now we want to find the reverse process: $p(\mathbf{x}_t \mid \mathbf{x}_{t+\Delta t}) = \frac{p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_t) p(\mathbf{x}_t)}{p(\mathbf{x}_{t+\Delta t})}$

First take the logarithm:

\[\log p(\mathbf{x}_t \mid \mathbf{x}_{t+\Delta t}) = \log p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_t) + \log p(\mathbf{x}_t) - \log p(\mathbf{x}_{t+\Delta t})\] \[\log p(\mathbf{x}_{t+\Delta t} \mid \mathbf{x}_t) = \log p(\mathbf{x}_t \mid \mathbf{x}_{t+\Delta t}) + \log p(\mathbf{x}_{t+\Delta t}) - \log p(\mathbf{x}_t)\] \[\log p(\mathbf{x}_{t+\Delta t}) = \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}} \log p(\mathbf{x}_t) \cdot (\mathbf{x}_{t+\Delta t} - \mathbf{x}_t) + O(dt)\] \[\log p(\mathbf{x}_{t+\Delta t}) - \log p(\mathbf{x}_t) = \nabla_{\mathbf{x}} \log p(\mathbf{x}_t) \cdot (\mu(x_t,t) dt + \sigma(x_t,t) \sqrt{dt} \, \mathbf{z}) + O(dt)\]

substitute back: $\begin{align} \log p(\mathbf{x}_t \mid \mathbf{x}_{t+\Delta t}) = -\frac{(\mathbf{x}_{t+\Delta t} - \mathbf{x}_t - \mu(x_t,t) dt)^2}{2\sigma(x_t, t)^2 dt} - \nabla_{\mathbf{x}} \log p(\mathbf{x}_t) \cdot (\mathbf{x}_{t+\Delta t} - \mathbf{x}_t) + \text{const} \end{align}$

\[\log p(\mathbf{x}_t \mid \mathbf{x}_{t+\Delta t}) = -\frac{(\Delta \mathbf{x} - \mu dt)^2}{2\sigma^2 dt} - \nabla_{\mathbf{x}} \log p(\mathbf{x}_t) \cdot \Delta \mathbf{x} + \text{const}\]