Wave

Transcranial focused ultrasound (TFUS) planning is a 3D wave simulation problem. Given a patient's head and a transducer placement, the goal is to estimate the 3D pressure field. The skull is spatially inhomogeneous, so it bends and attenuates the wavefront.

A full-wave solver such as k-Wave can simulate this field, but takes minutes to run.

DeepTFUS is a learned surrogate for this simulation. It takes a $256^3$ pseudo-CT volume (CT-like skull density derived from MRI) and the transducer geometry, then predicts the 3D pressure field in seconds. The TFUScapes dataset is created from k-Wave rollouts

Baseline architecture details

The transducer enters DeepTFUS as a point cloud, is Fourier-encoded, and is pooled to a conditioning vector. That vector conditions the U-Net through dynamic convolutions in the encoder, bi-directional cross-attention at encoder levels, and FiLM modulation in the decoder.

Starting from Mason Wang's public reproduction² of the original paper and keeping the same dataset/loss, I replaced the U-Net with an NSNO⁴ operator chain. The new backbone improves field shape and focal-volume overlap, but the model still smooths over the finest wave-scale texture.

Three held-out examples:

NSNO prediction (right) vs ground truth (left). The number on each button is NSNO's focal_mm error on that sample, chosen to span easy / medium / hard cases.

Why try NSNO

Field update. DeepTFUS already conditions the U-Net on the transducer, but the pressure field itself is still produced by a convolutional encoder-decoder. I wanted to test a neural-operator⁵ update instead. NSNO implements that idea with spectral convolutions that apply Fourier-domain weights to modes across the whole volume.

Frequency content. Frequency space shows one failure more directly. TFUScapes is generated by k-Wave at 500 kHz, so the acoustic wavelength is 3 mm and the time-averaged amplitude has a $\lambda/2 = 1.5\text{ mm} = 3$ voxel scale on the 0.5 mm grid. That band is visible near the focus but small in total field energy, so the loss can improve broad shape while leaving it weak. Standard neural networks also tend to learn low frequencies first³.

Axial pressure-amplitude slice through the GT focal voxel: ground truth. — Top: axial slices through the GT focal voxel for GT (left) and U-Net reproduction (right). Bottom: radial 3D power spectra of the same two fields.

Axial pressure-amplitude slice through the GT focal voxel: U-Net reproduction. — Top: axial slices through the GT focal voxel for GT (left) and U-Net reproduction (right). Bottom: radial 3D power spectra of the same two fields.

This plot is a diagnostic for intuition, not the focus. In the slice, GT has sharper wave-scale texture and the U-Net output is wider and smoother. In the spectrum, the corresponding $\lambda/2$ band is weaker. Below $x = 0.5$ cycles/voxel, the radial average is meaningful but above it, the Fourier cube starts to clip directions, so treat that region cautiously.

The NSNO backbone

The replacement is a 3D adaptation of an NSNO⁴ operator chain.

Neumann-series intuition. In scattering theory, a wave in an inhomogeneous medium can be expanded as a background wave plus first-order, second-order, and higher-order scattering corrections. NSNO is biased towards this pattern with a short chain of learned correction operators whose outputs are summed, rather than physical scattering integrals.

Learned implementation. This implementation keeps three terms. The first operator, $G_0$, receives the source field $f$ (the transducer points splatted onto the volume). The later operators receive fields gated by $q$, a per-voxel medium-contrast coefficient derived from the pseudo-CT. The outputs $v_0$, $v_1$, and $v_2$ sum to the final pressure field.

Diagram below: the tiles are real activation slices from a held-out subject. All 2D activations/images in this post are slices through a 3D volume.

Each $G_i$ is a U-shaped operator. Its main spatial-mixing blocks are spectral convolutions: FFT the activation, keep the low-frequency coefficients inside a $12\times12\times12$ Fourier box, multiply those modes by learned complex weights, then IFFT back. By the convolution theorem, that is a band-limited convolution with full-volume reach. The spectral path is low-pass because it only acts on retained Fourier modes. Each block also adds a parallel $1\times1\times1$ pointwise convolution, so local channel mixing does not have to pass through the truncated Fourier branch.

Inside one learned operator block

Forward passes

Each row is a held-out subject. Stages read left to right, with ground truth at the far right.

In these examples, $v_0$ is broad and signed, $v_1$ adds the focal lobe, and $v_2$ is a small negative refinement.

Results

NSNO was trained from scratch on TFUScapes on a B200 via Modal, using only the paper's weighted MSE + gradient-consistency loss. Evaluation uses the 597-sample held-out test split. Latency was measured on a B200 via Modal and a 4090 via RunPod.

Metric definitions

rel_l2 · is the whole field shape right? (normalized L2 error)
focal_mm · where does the focus land? (distance in mm from predicted to GT peak voxel)
max_p_err · how strong is the peak? (relative peak-amplitude error)
focal_iou · is the focal volume the right shape? (3D overlap of half-max iso-volumes)

Reported values are mean ± std except focal_iou, which uses the median because the distribution is skewed.

model	rel_l2	focal_mm	max_p_err	focal_iou	B200	4090
DeepTFUS (paper)¹	0.414 ± 0.086	2.89 ± 2.14	0.199 ± 0.158	n/a	n/a	11.4 s
reproduction base²	0.384 ± 0.079	6.45 ± 4.57	0.225 ± 0.116	0.144	96 ms	211 ms
reproduction var. D	0.422 ± 0.082	4.19 ± 2.93	0.283 ± 0.105	0.121	96 ms	211 ms
reproduction var. E	0.401 ± 0.081	5.27 ± 3.41	0.130 ± 0.095	0.152	96 ms	211 ms
NSNO	0.296 ± 0.060	3.07 ± 2.76	0.079 ± 0.067	0.503	68 ms	179 ms

The DeepTFUS row uses the original paper's reported numbers (~38M params). The reproduction rows are Mason's public 3.4M-param variants, which are smaller by design. Base uses the paper's loss; variants D and E use different loss tweaks. All three were re-evaluated on the same B200 with the same eval code as NSNO. Their published metrics² match within rounding.

The qualitative slice view matches the table: the reproduction over-expands the focal region, while NSNO is closer to the GT lobe.

GT / reproduction / NSNO slice comparison on a held-out subject — Axial slices through one held-out subject: ground truth, reproduction U-Net, and NSNO. Visual reference for the table above.

Against Mason's reproduction variants, NSNO improves all four accuracy metrics. Against the original DeepTFUS paper, it improves rel_l2 and max_p_err, but not focal_mm. The clearest change is focal volume: focal_iou, the 3D overlap between predicted and GT half-max iso-volumes, rises from 0.15 to 0.50.

focal_iou visualization: GT half-max contour vs predicted half-max contour, reproduction vs NSNO on a held-out sample — Predicted and GT half-max iso-contours overlaid on the axial slice through the focal voxel. Left: reproduction U-Net (contour balloons outside GT). Right: NSNO (contour tracks GT closely).

Cost and speed. NSNO has 56M params: ~16× the reproduction U-Net (3.4M) and ~1.5× the paper architecture (~38M). It still runs ~30% faster than the reproduction rows on B200 and ~15% faster on 4090. In this implementation, the spectral-conv backbone is faster than the 3D U-Net stack at $256^3$, despite the larger parameter count.

What the backbone swap doesn't fix

Radial 3D power spectra on a held-out subject: GT vs reproduction vs NSNO. — Radial spectrum plot with NSNO added to the GT and reproduction curves.

Below $x \approx 0.5$ cycles/voxel, where the radial spectrum is cleanest, NSNO mostly tracks the reproduction and both miss the GT shoulder at $\lambda/2 = 3$ voxels. The apparent NSNO advantage at higher frequencies is harder to interpret because that region is where the radial average is clipped by the Fourier cube.

For this task, the architecture swap improves field and focal geometry; it does not solve the $\lambda/2$ texture failure. The next step is to target that failure directly, with operator design and losses that measure wave-scale structure rather than only aggregate field error.

Notes

Reading the radial spectrum

The plot radially averages $|\mathcal{F}(\text{amplitude})|^2$ by spatial-frequency magnitude, then normalizes the curve so DC sits at 1. The x-axis is cycles per voxel. Along each Fourier axis, Nyquist is 0.5 cycles/voxel. In 3D, the Fourier domain is a cube, so the corner Nyquist is $\sqrt{3}/2 \approx 0.866$.

The bin at magnitude $k$ is a thin spherical shell of radius $k$. Up to $k = 0.5$, the shell fits entirely inside the cube, so every direction in Fourier space is represented evenly. Past $k = 0.5$, the shell intersects the cube faces and fewer directions remain. By $k \to \sqrt{3}/2$, only the eight corners contribute. Beyond $x \approx 0.5$, the curve mixes true spectral content with this clipping geometry, so it stops being a clean spectrum estimate.

Spectral-conv implementation details

Each layer has $O(k^3 \cdot C_\text{in} \cdot C_\text{out})$ parameters, where $k = 12$ is the truncation. This depends on the retained Fourier box, not the spatial grid size. The parallel $1\times1\times1$ pointwise convolution is the direct path for high-frequency content across the block; the spectral path discards modes above 12. The two paths are summed before the GELU.

Why $v_0$ is not literally a Green's function

The Neumann-series analogy would make $v_0$ the source wave propagating through a homogeneous reference medium. The trained model does not enforce that. $G_0$ has no access to $q$, so it learns whatever q-independent field helps the final sum. It's not a literal decomposition.

How this relates to the NSNO paper

Chen et al.⁴ build and benchmark NSNO on 2D Helmholtz problems with a UNO backbone. This project borrows the three-term Neumann/Born-series chain and adapts it to 3D pressure-amplitude prediction on TFUScapes. So this is an adaptation of the idea but not line-by-line faithful.

References

Srivastav et al. A Skull-Adaptive Framework for AI-Based 3D Transcranial Focused Ultrasound Simulation. arXiv:2505.12998, 2025.
Mason Wang. Reproducing DeepTFUS. masonjwang.com, 2026.
Rahaman et al. On the Spectral Bias of Neural Networks. arXiv:1806.08734, 2018.
Chen et al. NSNO: Neumann Series Neural Operator for Solving Helmholtz Equations in Inhomogeneous Medium. arXiv:2401.13494, 2024.
Kovachki et al. Neural Operator: Learning Maps Between Function Spaces. arXiv:2108.08481, 2021.