Building a GPU Monte Carlo Options Pricer from Scratch
NVIDIA A10G · custom CUDA kernels · variance reduction · Greeks · Heston · Sobol QMC
I built this because GPU Monte Carlo is one of the few places where "embarrassingly parallel" is not just something you say to sound smart. Every simulated price path is completely independent. There is no communication between paths during simulation. That is the textbook GPU workload, and it is genuinely how production pricing engines at hedge funds and banks work.
I wanted to build the whole thing myself: the CUDA kernels, the RNG, the reductions, the variance reduction, the Greeks, the exotic option types, stochastic volatility. No high-level libraries doing the hard parts. The result is 1611x over CPU at 10M paths, 129 passing tests, and a validation chain that proves correctness before any speed number gets claimed.
Why Monte Carlo, and why it is hard to do right
The case for simulation over formulas, and where naive implementations silently go wrong.
For a European call option there is a closed-form answer: Black-Scholes. You plug in five numbers and you get a price. Monte Carlo on a European option proves nothing interesting because you already have the exact answer for free.
The reason to build an MC engine is everything else. Asian options pay based on the average price of the underlying over the life of the contract. Barrier options knock in or out if the price crosses a level at any point along the path. Basket options reference a weighted portfolio of correlated assets. None of these have a clean general closed form. Monte Carlo handles all of them with the same loop, which is why it is the standard tool in any real exotic pricing desk.
The estimator is straightforward:
price = exp(-r * T) * (1/N) * sum_i payoff(path_i)
Statistical error shrinks as 1/sqrt(N). To halve the error you need four times the paths. That slow convergence is exactly why variance reduction is the interesting engineering challenge here, and it is what I spent the most time on.
The asset model
Geometric Brownian Motion with an exact discrete step and zero discretization error in the baseline.
The baseline model assumes the asset follows Geometric Brownian Motion. The exact discrete step from time t to t + dt is:
S_{t+dt} = S_t * exp( (r - 0.5*sigma^2)*dt + sigma*sqrt(dt)*Z )
where r is the risk-free rate, sigma is volatility, and Z ~ N(0,1). This is an exact solution to the GBM stochastic differential equation, not a numerical approximation. There is no discretization error in the baseline. For a European option, a single draw of Z takes you straight to maturity. For path-dependent options, you walk step by step and carry only the running aggregate the payoff requires.
CUDA kernel design
Grid-stride loop, a never-store-paths memory strategy, and a two-level warp-shuffle reduction.
Grid-stride loop
Each thread owns a batch of paths via a grid-stride loop rather than exactly one path. This keeps occupancy high regardless of N and amortizes launch overhead across the batch:
for (int path = blockIdx.x * blockDim.x + threadIdx.x;
path < N;
path += gridDim.x * blockDim.x) {
simulate_and_accumulate(path);
}
Never store full paths
This is the most consequential design decision in the whole engine. Storing N x steps floats would be enormous and the simulation would become memory-bound. Instead, each thread keeps only the running aggregate it needs and discards each time step the moment it is done with it:
- European: only the final price S_T, one draw gets you there
- Asian: a running sum of S_t to form the path average at the end
- Barrier: a single boolean flag updated at each step
- Basket: running sums for both correlated assets carried in parallel
This makes the simulation compute-bound instead of memory-bound. On a GPU, compute-bound is where you actually win. Device memory bandwidth never becomes the bottleneck.
Warp-shuffle reduction
After simulation, each thread holds a local payoff accumulator. Collapsing across threads happens in two levels. First, a warp-shuffle folds 32 threads into one entirely in registers with no shared memory involved at all:
for (int offset = 16; offset > 0; offset >>= 1)
val += __shfl_down_sync(0xffffffff, val, offset);
Each warp's lane 0 then writes to shared memory. A second pass reduces across warps within the block. Each block writes one partial sum to global memory and a final pass collapses the partials. The kernel accumulates both sum(payoff) and sum(payoff^2) in double precision in the same pass, so price and standard error both come out without a second kernel launch.
Random number generation
The part most people get wrong. Correlated streams across threads silently bias prices with no crash and no error message.
Monte Carlo correctness depends entirely on the statistical independence of the random streams across threads. The failure mode is completely silent: correlated streams produce biased prices that look totally plausible. There is no crash, no assertion, no warning. The bias is often small enough that you would not notice by eye but large enough to matter in a real trade.
Naive seeding like seed = thread_id with a sequential PRNG produces correlated streams. Prices look reasonable and are wrong.
I used Philox 4x32-10 via cuRAND. It is a counter-based generator. Each thread gets its stream addressed by (seed, thread_id) and the generator maps that pair to a uniform draw via a keyed bijection. Independence is guaranteed structurally, not by hoping that different initial states diverge far enough.
The reproducibility is a side effect I use directly throughout the project. A fixed seed gives identical results bit-for-bit across runs. That is not just useful for debugging. It is the exact property that makes Common Random Numbers work for the Greeks, where two separate pricing runs need to share path noise so it cancels in the finite difference.
Variance reduction
Four techniques that each shrink estimator noise. This is what separates a pricing engine from a tutorial.
Antithetic variates
For every normal draw Z, I also simulate the path using -Z and average the two payoffs. Because the paths are negatively correlated, the average has lower variance than either one alone. The cost is nearly zero: one extra simulation per thread, no new RNG calls. Works best for monotone payoffs like European and Asian calls.
Control variate: 1380x variance reduction
The idea is to correct the MC estimate using a correlated quantity whose true expectation is known analytically. The showcase here is the arithmetic Asian option. The arithmetic average of a GBM path has no closed form. The geometric average does. The two are highly correlated because they are computed on the same simulated path.
The corrected estimator:
price_cv = price_mc + beta * (geometric_analytic - geometric_mc)
where beta is estimated from the sample covariance. When the correlation is high, the correction nearly eliminates the Monte Carlo noise. At 1M paths the variance reduction is 1380x, meaning you reach the same standard error with roughly 1/1380th the paths.
Sobol quasi-Monte Carlo: 150x lower RMSE
Plain Monte Carlo uses pseudo-random draws that cluster and leave gaps by chance. Sobol sequences fill the sample space more evenly by construction. The convergence rate improves from O(1/sqrt(N)) toward O(1/N). On a log-log error-vs-N plot the slope gets noticeably steeper.
| method | measured slope | theory |
|---|---|---|
| plain MC | -0.472 | -0.500 |
| Sobol QMC | -0.900 | -1.000 |
At 1M paths, Sobol RMSE is 150x lower than plain MC. cuRAND provides scrambled Sobol32 generators that slot into the same kernel interface as Philox with minimal changes.
Common Random Numbers
CRN is not a standalone pricing method. It is used inside the Greek computation. When you need the difference of two MC prices (bumped minus base), running both with the same Philox seed makes the path noise cancel in the difference rather than add. The delta standard error with CRN is 771x lower than naive bump-and-revalue where the two runs have independent noise and the difference is dominated by it.
Greeks
Five risk sensitivities, three computation methods, variance reductions up to 10,918x.
Greeks are the partial derivatives of the option price with respect to its inputs. They are what a trader actually hedges with day to day, and getting them right via simulation is harder than getting the price right. A noisy delta estimate is essentially useless for hedging.
| greek | derivative | meaning |
|---|---|---|
| Delta | dP/dS0 | how many shares to hold as a hedge |
| Gamma | d2P/dS0^2 | rate of change of delta |
| Vega | dP/dsigma | sensitivity to volatility |
| Theta | dP/dT | time decay per day |
| Rho | dP/dr | sensitivity to interest rates |
Bump-and-revalue with CRN
Price at S0 and at S0 + h using the same Philox seed. The path noise cancels in the finite difference. Without CRN, both runs have independent noise and the difference is swamped by it. With CRN the delta standard error is 771x lower. All five Greeks computed this way match the Black-Scholes analytic Greeks for European options at the tested tolerances.
Pathwise delta: 10,918x variance reduction
Instead of pricing twice and differencing, I differentiate through the GBM step directly. For a European call the pathwise delta is:
delta = exp(-r*T) * indicator(S_T > K) * (S_T / S0)
Single GPU pass, no second simulation. Variance reduction over naive finite difference is 10,918x. The tradeoff is that the method requires the payoff to be differentiable almost everywhere. It fails for discontinuous payoffs like digitals, where bump-and-revalue is the fallback.
Path-dependent options
Asian, barrier, and basket options: the cases that actually justify building an MC engine in the first place.
Arithmetic Asian
The payoff is max(avg(S_t) - K, 0) where the average runs over the full simulated path. No closed form for arithmetic averaging under GBM. Each thread accumulates the running sum of S_t at each time step and computes the average at path end. The geometric Asian, which does have a closed form, is computed in parallel on the same path and used directly as the control variate.
Barrier knock-in and knock-out
A barrier option pays a European-style payoff only if the underlying does or does not cross a barrier level during the contract. Each thread carries a boolean and updates it at each step:
touched = touched || (S_t >= barrier);
Knock-out zeroes the payoff when touched == true. Knock-in zeroes it when touched == false. There is a useful identity: knock-in price plus knock-out price equals the vanilla European price for the same parameters. The engine checks this identity as a correctness test, and it passes.
Basket (two-asset correlated)
A basket option pays on a weighted sum of two correlated assets: max(w1*S1 + w2*S2 - K, 0). Correlated normals come from Cholesky decomposition:
Z1 = N(0,1) Z2 = rho*Z1 + sqrt(1 - rho^2)*N(0,1)
Each thread carries running state for both assets simultaneously. The engine sweeps correlation from -0.9 to 0.9 in one shot. At every level, the arithmetic basket price must sit above the geometric basket price by Jensen's inequality. The engine validates this at four correlation levels and it holds every time.
Heston stochastic volatility
Constant vol is wrong in every real market. Heston makes vol itself a random process and produces a realistic implied vol skew.
The GBM assumption of constant volatility is violated in practice. Real implied vol surfaces are not flat: out-of-the-money puts trade richer than at-the-money options in equity markets. The Heston model fixes this by making variance itself a stochastic mean-reverting process:
dS = r*S*dt + sqrt(V)*S*dW_S dV = kappa*(theta - V)*dt + xi*sqrt(V)*dW_V corr(dW_S, dW_V) = rho
V is the instantaneous variance, mean-reverting toward long-run level theta at speed kappa. xi is vol-of-vol, controlling how much variance itself moves. rho is the spot-vol correlation and is the parameter that generates the skew. Negative rho means when the stock falls, volatility spikes, which is the observed equity behavior.
The GPU kernel simulates both processes simultaneously using the full-truncation Euler scheme to prevent variance going negative:
V_next = max(V,0) + kappa*(theta - max(V,0))*dt + xi*sqrt(max(V,0)*dt)*Z_V
Validation against the characteristic function
The Heston model has a semi-analytic solution via the characteristic function: a Fourier integral that gives the exact price under Heston dynamics. I validate the GPU MC prices against this formula at five strikes. The CF formula itself is cross-checked against QuantLib's Heston engine first. The full chain passes at all five strikes. The implied vol smile from negative rho shows up clearly in the output: OTM puts are richer than OTM calls, exactly as observed in real equity markets.
Validation
Correctness is proved, not assumed. No speed number gets reported until the full chain is green.
I made a rule at the start: no benchmark number gets written down until the correctness suite passes. A fast engine that prices wrong is worse than a slow one that prices right.
QuantLib validates the semi-analytic formulas. Those formulas validate every GPU MC price. Mathematical identities provide independent cross-checks at each layer.
- Black-Scholes analytic price validates all European MC prices within a few standard errors
- Put-call parity (
call - put = S0 - K*exp(-r*T)) checked independently of any model - KO + KI = European validates the barrier kernel
- Geometric Asian closed form validates the Asian kernel and anchors the control variate
- Heston CF formula, itself cross-checked against QuantLib, validates Heston simulation at five strikes
- Jensen's inequality (arithmetic basket above geometric basket) validates the basket kernel at four correlation levels
- Convergence slope test: MC error vs N on log-log axes must have slope close to -0.500. Measured: -0.503
- Fixed-seed reproducibility: identical results bit-for-bit across runs with the same seed
Results
Everything measured on NVIDIA A10G via Modal. All figures validated before being reported.
| metric | result | note |
|---|---|---|
| MC convergence slope | -0.503 | theory: -0.500 |
| Sobol convergence slope | -0.900 | vs MC -0.472 |
| Heston CF validation | PASS | 5 strikes, matches QuantLib |
| Basket Jensen identity | PASS | 4 correlation levels |
| Put-call parity | PASS | all option types |
| KO + KI = European | PASS | barrier identity |
| Tests passing | 129 |
The 1611x is measured against a single-threaded NumPy baseline at 10M paths on a plain European call with Philox MC on an A10G. Speedup grows with path count, that's where GPU parallelism pays off.