Wave

Replacing the U-Net in DeepTFUS with a neural operator built for wave physics. 24 hour project.
May 2026

Transcranial focused ultrasound (TFUS) planning is a 3D wave simulation problem. Given a patient's head and a transducer placement, the goal is to estimate the 3D pressure field. The skull is spatially inhomogeneous, so it bends and attenuates the wavefront.

A full-wave solver such as k-Wave can simulate this field, but takes minutes to run.

DeepTFUS is a learned surrogate for this simulation. It takes a $256^3$ pseudo-CT volume (CT-like skull density derived from MRI) and the transducer geometry, then predicts the 3D pressure field in seconds. The TFUScapes dataset is created from k-Wave rollouts

Baseline architecture details

The transducer enters DeepTFUS as a point cloud, is Fourier-encoded, and is pooled to a conditioning vector. That vector conditions the U-Net through dynamic convolutions in the encoder, bi-directional cross-attention at encoder levels, and FiLM modulation in the decoder.

Starting from Mason Wang's public reproduction2 of the original paper and keeping the same dataset/loss, I replaced the U-Net with an NSNO4 operator chain. The new backbone improves field shape and focal-volume overlap, but the model still smooths over the finest wave-scale texture.

Three held-out examples:

NSNO prediction (right) vs ground truth (left). The number on each button is NSNO's focal_mm error on that sample, chosen to span easy / medium / hard cases.

Why try NSNO

Field update. DeepTFUS already conditions the U-Net on the transducer, but the pressure field itself is still produced by a convolutional encoder-decoder. I wanted to test a neural-operator5 update instead. NSNO implements that idea with spectral convolutions that apply Fourier-domain weights to modes across the whole volume.

Frequency content. Frequency space shows one failure more directly. TFUScapes is generated by k-Wave at 500 kHz, so the acoustic wavelength is 3 mm and the time-averaged amplitude has a $\lambda/2 = 1.5\text{ mm} = 3$ voxel scale on the 0.5 mm grid. That band is visible near the focus but small in total field energy, so the loss can improve broad shape while leaving it weak. Standard neural networks also tend to learn low frequencies first3.

Axial pressure-amplitude slice through the GT focal voxel: ground truth. Axial pressure-amplitude slice through the GT focal voxel: U-Net reproduction.
Radial 3D power spectra of pressure-amplitude on a held-out subject: GT vs reproduction.
Top: axial slices through the GT focal voxel for GT (left) and U-Net reproduction (right). Bottom: radial 3D power spectra of the same two fields.

This plot is a diagnostic for intuition, not the focus. In the slice, GT has sharper wave-scale texture and the U-Net output is wider and smoother. In the spectrum, the corresponding $\lambda/2$ band is weaker. Below $x = 0.5$ cycles/voxel, the radial average is meaningful but above it, the Fourier cube starts to clip directions, so treat that region cautiously.


The NSNO backbone

The replacement is a 3D adaptation of an NSNO4 operator chain.

Neumann-series intuition. In scattering theory, a wave in an inhomogeneous medium can be expanded as a background wave plus first-order, second-order, and higher-order scattering corrections. NSNO is biased towards this pattern with a short chain of learned correction operators whose outputs are summed, rather than physical scattering integrals.

Learned implementation. This implementation keeps three terms. The first operator, $G_0$, receives the source field $f$ (the transducer points splatted onto the volume). The later operators receive fields gated by $q$, a per-voxel medium-contrast coefficient derived from the pseudo-CT. The outputs $v_0$, $v_1$, and $v_2$ sum to the final pressure field.

Diagram below: the tiles are real activation slices from a held-out subject. All 2D activations/images in this post are slices through a 3D volume.

pseudo-CT (1, 256³) conv q transducer (512, 3) splat f G₀ UNO v₀ × −k²·q·v₀ G₁ UNO v₁ × −k²·q·v₁ G₂ UNO v₂ + + Σ vᵢ · output

Each $G_i$ is a U-shaped operator. Its main spatial-mixing blocks are spectral convolutions: FFT the activation, keep the low-frequency coefficients inside a $12\times12\times12$ Fourier box, multiply those modes by learned complex weights, then IFFT back. By the convolution theorem, that is a band-limited convolution with full-volume reach. The spectral path is low-pass because it only acts on retained Fourier modes. Each block also adds a parallel $1\times1\times1$ pointwise convolution, so local channel mixing does not have to pass through the truncated Fourier branch.

Inside one learned operator block

G₁ · one UNO block x · input Enc 1 256³ · 8 ch f₁ Dec 1 256³ · 8 ch y · vᵢ output Enc 2 128³ · 16 ch f₂ Dec 2 128³ · 16 ch Bottleneck 64³ · 32 ch f₃
Zoom into f₂ x 128³ · 16 ch FFT |x̂| log spectrum keep 12³ x̂_low · ŷ_low IFFT y_FFT + s₂ W · learned

Forward passes

Each row is a held-out subject. Stages read left to right, with ground truth at the far right.

+ + G₀ ·q G₁ ·q G₂
A000439980.0 mm
f v₀ −k²·q·v₀ v₁ −k²·q·v₁ v₂ Σ vᵢ
target
A000393910.7 mm
f v₀ −k²·q·v₀ v₁ −k²·q·v₁ v₂ Σ vᵢ
target
A000337475.4 mm
f v₀ −k²·q·v₀ v₁ −k²·q·v₁ v₂ Σ vᵢ
target
f v₀ −k²·q·v₀ v₁ −k²·q·v₁ v₂ Σ vᵢ target

In these examples, $v_0$ is broad and signed, $v_1$ adds the focal lobe, and $v_2$ is a small negative refinement.


Results

NSNO was trained from scratch on TFUScapes on a B200 via Modal, using only the paper's weighted MSE + gradient-consistency loss. Evaluation uses the 597-sample held-out test split. Latency was measured on a B200 via Modal and a 4090 via RunPod.

Metric definitions
  • rel_l2 · is the whole field shape right? (normalized L2 error)
  • focal_mm · where does the focus land? (distance in mm from predicted to GT peak voxel)
  • max_p_err · how strong is the peak? (relative peak-amplitude error)
  • focal_iou · is the focal volume the right shape? (3D overlap of half-max iso-volumes)

Reported values are mean ± std except focal_iou, which uses the median because the distribution is skewed.

model rel_l2 focal_mm max_p_err focal_iou B200 4090
DeepTFUS (paper)10.414 ± 0.0862.89 ± 2.140.199 ± 0.158n/an/a11.4 s
reproduction base20.384 ± 0.0796.45 ± 4.570.225 ± 0.1160.14496 ms211 ms
reproduction var. D0.422 ± 0.0824.19 ± 2.930.283 ± 0.1050.12196 ms211 ms
reproduction var. E0.401 ± 0.0815.27 ± 3.410.130 ± 0.0950.15296 ms211 ms
NSNO0.296 ± 0.0603.07 ± 2.760.079 ± 0.0670.50368 ms179 ms

The DeepTFUS row uses the original paper's reported numbers (~38M params). The reproduction rows are Mason's public 3.4M-param variants, which are smaller by design. Base uses the paper's loss; variants D and E use different loss tweaks. All three were re-evaluated on the same B200 with the same eval code as NSNO. Their published metrics2 match within rounding.

The qualitative slice view matches the table: the reproduction over-expands the focal region, while NSNO is closer to the GT lobe.

GT / reproduction / NSNO slice comparison on a held-out subject
Axial slices through one held-out subject: ground truth, reproduction U-Net, and NSNO. Visual reference for the table above.

Against Mason's reproduction variants, NSNO improves all four accuracy metrics. Against the original DeepTFUS paper, it improves rel_l2 and max_p_err, but not focal_mm. The clearest change is focal volume: focal_iou, the 3D overlap between predicted and GT half-max iso-volumes, rises from 0.15 to 0.50.

focal_iou visualization: GT half-max contour vs predicted half-max contour, reproduction vs NSNO on a held-out sample
Predicted and GT half-max iso-contours overlaid on the axial slice through the focal voxel. Left: reproduction U-Net (contour balloons outside GT). Right: NSNO (contour tracks GT closely).

Cost and speed. NSNO has 56M params: ~16× the reproduction U-Net (3.4M) and ~1.5× the paper architecture (~38M). It still runs ~30% faster than the reproduction rows on B200 and ~15% faster on 4090. In this implementation, the spectral-conv backbone is faster than the 3D U-Net stack at $256^3$, despite the larger parameter count.


What the backbone swap doesn't fix

Radial 3D power spectra on a held-out subject: GT vs reproduction vs NSNO.
Radial spectrum plot with NSNO added to the GT and reproduction curves.

Below $x \approx 0.5$ cycles/voxel, where the radial spectrum is cleanest, NSNO mostly tracks the reproduction and both miss the GT shoulder at $\lambda/2 = 3$ voxels. The apparent NSNO advantage at higher frequencies is harder to interpret because that region is where the radial average is clipped by the Fourier cube.

For this task, the architecture swap improves field and focal geometry; it does not solve the $\lambda/2$ texture failure. The next step is to target that failure directly, with operator design and losses that measure wave-scale structure rather than only aggregate field error.


Notes

Reading the radial spectrum

The plot radially averages $|\mathcal{F}(\text{amplitude})|^2$ by spatial-frequency magnitude, then normalizes the curve so DC sits at 1. The x-axis is cycles per voxel. Along each Fourier axis, Nyquist is 0.5 cycles/voxel. In 3D, the Fourier domain is a cube, so the corner Nyquist is $\sqrt{3}/2 \approx 0.866$.

The bin at magnitude $k$ is a thin spherical shell of radius $k$. Up to $k = 0.5$, the shell fits entirely inside the cube, so every direction in Fourier space is represented evenly. Past $k = 0.5$, the shell intersects the cube faces and fewer directions remain. By $k \to \sqrt{3}/2$, only the eight corners contribute. Beyond $x \approx 0.5$, the curve mixes true spectral content with this clipping geometry, so it stops being a clean spectrum estimate.

Spectral-conv implementation details

Each layer has $O(k^3 \cdot C_\text{in} \cdot C_\text{out})$ parameters, where $k = 12$ is the truncation. This depends on the retained Fourier box, not the spatial grid size. The parallel $1\times1\times1$ pointwise convolution is the direct path for high-frequency content across the block; the spectral path discards modes above 12. The two paths are summed before the GELU.

Why $v_0$ is not literally a Green's function

The Neumann-series analogy would make $v_0$ the source wave propagating through a homogeneous reference medium. The trained model does not enforce that. $G_0$ has no access to $q$, so it learns whatever q-independent field helps the final sum. It's not a literal decomposition.

How this relates to the NSNO paper

Chen et al.4 build and benchmark NSNO on 2D Helmholtz problems with a UNO backbone. This project borrows the three-term Neumann/Born-series chain and adapts it to 3D pressure-amplitude prediction on TFUScapes. So this is an adaptation of the idea but not line-by-line faithful.

References

  1. Srivastav et al. A Skull-Adaptive Framework for AI-Based 3D Transcranial Focused Ultrasound Simulation. arXiv:2505.12998, 2025.
  2. Mason Wang. Reproducing DeepTFUS. masonjwang.com, 2026.
  3. Rahaman et al. On the Spectral Bias of Neural Networks. arXiv:1806.08734, 2018.
  4. Chen et al. NSNO: Neumann Series Neural Operator for Solving Helmholtz Equations in Inhomogeneous Medium. arXiv:2401.13494, 2024.
  5. Kovachki et al. Neural Operator: Learning Maps Between Function Spaces. arXiv:2108.08481, 2021.