CUDA quantitative finance Monte Carlo GPU

Building a GPU Monte Carlo Options Pricer from Scratch

NVIDIA A10G · custom CUDA kernels · variance reduction · Greeks · Heston · Sobol QMC

I built this because GPU Monte Carlo is one of the few places where "embarrassingly parallel" is not just something you say to sound smart. Every simulated price path is completely independent. There is no communication between paths during simulation. That is the textbook GPU workload, and it is genuinely how production pricing engines at hedge funds and banks work.

I wanted to build the whole thing myself: the CUDA kernels, the RNG, the reductions, the variance reduction, the Greeks, the exotic option types, stochastic volatility. No high-level libraries doing the hard parts. The result is 1611x over CPU at 10M paths, 129 passing tests, and a validation chain that proves correctness before any speed number gets claimed.

1611x

GPU vs CPU at 10M paths

11B

paths/sec on A10G

1380x

variance reduction, control variate

10,918x

delta variance reduction, pathwise

150x

lower RMSE, Sobol vs plain MC

129

tests passing

Why Monte Carlo, and why it is hard to do right

The case for simulation over formulas, and where naive implementations silently go wrong.

For a European call option there is a closed-form answer: Black-Scholes. You plug in five numbers and you get a price. Monte Carlo on a European option proves nothing interesting because you already have the exact answer for free.

The reason to build an MC engine is everything else. Asian options pay based on the average price of the underlying over the life of the contract. Barrier options knock in or out if the price crosses a level at any point along the path. Basket options reference a weighted portfolio of correlated assets. None of these have a clean general closed form. Monte Carlo handles all of them with the same loop, which is why it is the standard tool in any real exotic pricing desk.

The estimator is straightforward:

price = exp(-r * T) * (1/N) * sum_i payoff(path_i)

Statistical error shrinks as 1/sqrt(N). To halve the error you need four times the paths. That slow convergence is exactly why variance reduction is the interesting engineering challenge here, and it is what I spent the most time on.

The asset model

Geometric Brownian Motion with an exact discrete step and zero discretization error in the baseline.

The baseline model assumes the asset follows Geometric Brownian Motion. The exact discrete step from time t to t + dt is:

S_{t+dt} = S_t * exp( (r - 0.5*sigma^2)*dt + sigma*sqrt(dt)*Z )

where r is the risk-free rate, sigma is volatility, and Z ~ N(0,1). This is an exact solution to the GBM stochastic differential equation, not a numerical approximation. There is no discretization error in the baseline. For a European option, a single draw of Z takes you straight to maturity. For path-dependent options, you walk step by step and carry only the running aggregate the payoff requires.

CUDA kernel design

Grid-stride loop, a never-store-paths memory strategy, and a two-level warp-shuffle reduction.

Grid-stride loop

Each thread owns a batch of paths via a grid-stride loop rather than exactly one path. This keeps occupancy high regardless of N and amortizes launch overhead across the batch:

for (int path = blockIdx.x * blockDim.x + threadIdx.x;
         path < N;
         path += gridDim.x * blockDim.x) {
    simulate_and_accumulate(path);
}

Never store full paths

This is the most consequential design decision in the whole engine. Storing N x steps floats would be enormous and the simulation would become memory-bound. Instead, each thread keeps only the running aggregate it needs and discards each time step the moment it is done with it:

European: only the final price S_T, one draw gets you there
Asian: a running sum of S_t to form the path average at the end
Barrier: a single boolean flag updated at each step
Basket: running sums for both correlated assets carried in parallel

This makes the simulation compute-bound instead of memory-bound. On a GPU, compute-bound is where you actually win. Device memory bandwidth never becomes the bottleneck.

Warp-shuffle reduction

After simulation, each thread holds a local payoff accumulator. Collapsing across threads happens in two levels. First, a warp-shuffle folds 32 threads into one entirely in registers with no shared memory involved at all:

for (int offset = 16; offset > 0; offset >>= 1)
    val += __shfl_down_sync(0xffffffff, val, offset);

Each warp's lane 0 then writes to shared memory. A second pass reduces across warps within the block. Each block writes one partial sum to global memory and a final pass collapses the partials. The kernel accumulates both sum(payoff) and sum(payoff^2) in double precision in the same pass, so price and standard error both come out without a second kernel launch.

Random number generation

The part most people get wrong. Correlated streams across threads silently bias prices with no crash and no error message.

Monte Carlo correctness depends entirely on the statistical independence of the random streams across threads. The failure mode is completely silent: correlated streams produce biased prices that look totally plausible. There is no crash, no assertion, no warning. The bias is often small enough that you would not notice by eye but large enough to matter in a real trade.

Naive seeding like seed = thread_id with a sequential PRNG produces correlated streams. Prices look reasonable and are wrong.

I used Philox 4x32-10 via cuRAND. It is a counter-based generator. Each thread gets its stream addressed by (seed, thread_id) and the generator maps that pair to a uniform draw via a keyed bijection. Independence is guaranteed structurally, not by hoping that different initial states diverge far enough.

The reproducibility is a side effect I use directly throughout the project. A fixed seed gives identical results bit-for-bit across runs. That is not just useful for debugging. It is the exact property that makes Common Random Numbers work for the Greeks, where two separate pricing runs need to share path noise so it cancels in the finite difference.

Variance reduction

Four techniques that each shrink estimator noise. This is what separates a pricing engine from a tutorial.

Antithetic variates

For every normal draw Z, I also simulate the path using -Z and average the two payoffs. Because the paths are negatively correlated, the average has lower variance than either one alone. The cost is nearly zero: one extra simulation per thread, no new RNG calls. Works best for monotone payoffs like European and Asian calls.

Control variate: 1380x variance reduction

The idea is to correct the MC estimate using a correlated quantity whose true expectation is known analytically. The showcase here is the arithmetic Asian option. The arithmetic average of a GBM path has no closed form. The geometric average does. The two are highly correlated because they are computed on the same simulated path.

The corrected estimator:

price_cv = price_mc + beta * (geometric_analytic - geometric_mc)

where beta is estimated from the sample covariance. When the correlation is high, the correction nearly eliminates the Monte Carlo noise. At 1M paths the variance reduction is 1380x, meaning you reach the same standard error with roughly 1/1380th the paths.

Sobol quasi-Monte Carlo: 150x lower RMSE

Plain Monte Carlo uses pseudo-random draws that cluster and leave gaps by chance. Sobol sequences fill the sample space more evenly by construction. The convergence rate improves from O(1/sqrt(N)) toward O(1/N). On a log-log error-vs-N plot the slope gets noticeably steeper.

method	measured slope	theory
plain MC	-0.472	-0.500
Sobol QMC	-0.900	-1.000

At 1M paths, Sobol RMSE is 150x lower than plain MC. cuRAND provides scrambled Sobol32 generators that slot into the same kernel interface as Philox with minimal changes.

Common Random Numbers

CRN is not a standalone pricing method. It is used inside the Greek computation. When you need the difference of two MC prices (bumped minus base), running both with the same Philox seed makes the path noise cancel in the difference rather than add. The delta standard error with CRN is 771x lower than naive bump-and-revalue where the two runs have independent noise and the difference is dominated by it.

Greeks

Five risk sensitivities, three computation methods, variance reductions up to 10,918x.

Greeks are the partial derivatives of the option price with respect to its inputs. They are what a trader actually hedges with day to day, and getting them right via simulation is harder than getting the price right. A noisy delta estimate is essentially useless for hedging.

greek	derivative	meaning
Delta	`dP/dS0`	how many shares to hold as a hedge
Gamma	`d2P/dS0^2`	rate of change of delta
Vega	`dP/dsigma`	sensitivity to volatility
Theta	`dP/dT`	time decay per day
Rho	`dP/dr`	sensitivity to interest rates

Bump-and-revalue with CRN

Price at S0 and at S0 + h using the same Philox seed. The path noise cancels in the finite difference. Without CRN, both runs have independent noise and the difference is swamped by it. With CRN the delta standard error is 771x lower. All five Greeks computed this way match the Black-Scholes analytic Greeks for European options at the tested tolerances.

Pathwise delta: 10,918x variance reduction

Instead of pricing twice and differencing, I differentiate through the GBM step directly. For a European call the pathwise delta is:

delta = exp(-r*T) * indicator(S_T > K) * (S_T / S0)

Single GPU pass, no second simulation. Variance reduction over naive finite difference is 10,918x. The tradeoff is that the method requires the payoff to be differentiable almost everywhere. It fails for discontinuous payoffs like digitals, where bump-and-revalue is the fallback.

Path-dependent options

Asian, barrier, and basket options: the cases that actually justify building an MC engine in the first place.

Arithmetic Asian

The payoff is max(avg(S_t) - K, 0) where the average runs over the full simulated path. No closed form for arithmetic averaging under GBM. Each thread accumulates the running sum of S_t at each time step and computes the average at path end. The geometric Asian, which does have a closed form, is computed in parallel on the same path and used directly as the control variate.

Barrier knock-in and knock-out

A barrier option pays a European-style payoff only if the underlying does or does not cross a barrier level during the contract. Each thread carries a boolean and updates it at each step:

touched = touched || (S_t >= barrier);

Knock-out zeroes the payoff when touched == true. Knock-in zeroes it when touched == false. There is a useful identity: knock-in price plus knock-out price equals the vanilla European price for the same parameters. The engine checks this identity as a correctness test, and it passes.

Basket (two-asset correlated)

A basket option pays on a weighted sum of two correlated assets: max(w1*S1 + w2*S2 - K, 0). Correlated normals come from Cholesky decomposition:

Z1 = N(0,1)
Z2 = rho*Z1 + sqrt(1 - rho^2)*N(0,1)

Each thread carries running state for both assets simultaneously. The engine sweeps correlation from -0.9 to 0.9 in one shot. At every level, the arithmetic basket price must sit above the geometric basket price by Jensen's inequality. The engine validates this at four correlation levels and it holds every time.

Heston stochastic volatility

Constant vol is wrong in every real market. Heston makes vol itself a random process and produces a realistic implied vol skew.

The GBM assumption of constant volatility is violated in practice. Real implied vol surfaces are not flat: out-of-the-money puts trade richer than at-the-money options in equity markets. The Heston model fixes this by making variance itself a stochastic mean-reverting process:

dS = r*S*dt + sqrt(V)*S*dW_S
dV = kappa*(theta - V)*dt + xi*sqrt(V)*dW_V
corr(dW_S, dW_V) = rho

V is the instantaneous variance, mean-reverting toward long-run level theta at speed kappa. xi is vol-of-vol, controlling how much variance itself moves. rho is the spot-vol correlation and is the parameter that generates the skew. Negative rho means when the stock falls, volatility spikes, which is the observed equity behavior.

The GPU kernel simulates both processes simultaneously using the full-truncation Euler scheme to prevent variance going negative:

V_next = max(V,0) + kappa*(theta - max(V,0))*dt + xi*sqrt(max(V,0)*dt)*Z_V

Validation against the characteristic function

The Heston model has a semi-analytic solution via the characteristic function: a Fourier integral that gives the exact price under Heston dynamics. I validate the GPU MC prices against this formula at five strikes. The CF formula itself is cross-checked against QuantLib's Heston engine first. The full chain passes at all five strikes. The implied vol smile from negative rho shows up clearly in the output: OTM puts are richer than OTM calls, exactly as observed in real equity markets.

Validation

Correctness is proved, not assumed. No speed number gets reported until the full chain is green.

I made a rule at the start: no benchmark number gets written down until the correctness suite passes. A fast engine that prices wrong is worse than a slow one that prices right.

QuantLib validates the semi-analytic formulas. Those formulas validate every GPU MC price. Mathematical identities provide independent cross-checks at each layer.

Black-Scholes analytic price validates all European MC prices within a few standard errors
Put-call parity (call - put = S0 - K*exp(-r*T)) checked independently of any model
KO + KI = European validates the barrier kernel
Geometric Asian closed form validates the Asian kernel and anchors the control variate
Heston CF formula, itself cross-checked against QuantLib, validates Heston simulation at five strikes
Jensen's inequality (arithmetic basket above geometric basket) validates the basket kernel at four correlation levels
Convergence slope test: MC error vs N on log-log axes must have slope close to -0.500. Measured: -0.503
Fixed-seed reproducibility: identical results bit-for-bit across runs with the same seed

Results

Everything measured on NVIDIA A10G via Modal. All figures validated before being reported.

metric	result	note
MC convergence slope	-0.503	theory: -0.500
Sobol convergence slope	-0.900	vs MC -0.472
Heston CF validation	PASS	5 strikes, matches QuantLib
Basket Jensen identity	PASS	4 correlation levels
Put-call parity	PASS	all option types
KO + KI = European	PASS	barrier identity
Tests passing	129

The 1611x is measured against a single-threaded NumPy baseline at 10M paths on a plain European call with Philox MC on an A10G. Speedup grows with path count, that's where GPU parallelism pays off.