We built period, an end-to-end parking model with 11M parameters that runs at 120hz on a MacBook. It was trained from scratch from 7 hours of general driving data and 1 hour of task-specific trajectories and is able to park a car in an unseen parking lot!
The core inspiration behind our system is Rhoda's DVA, which uses a causal video model pretrained on web-scale video to predict what the robot should see next, and a small inverse dynamics model to translate that into motor commands. We were also inspired by Standard Intelligence's FDM-1 which instead trains an IDM to label millions of hours of screen recordings with actions, then trains a forward model on next-action prediction from that data without requiring a world-model/IDM at inference time.
For our final model architecture, we started from DIAMOND, which trains a diffusion UNet to predict future causal frames conditioned on an action input. We removed the action conditioning and added a small (17K parameter) action head coming out of the bottleneck.
During training the full model sees 8 context frames (every-other for 0.8 seconds at 20hz) and a noisy next frame, and optimizes two things: denoising the next frame (diffusion loss), and predicting the driver's curvature and acceleration (action loss). Both losses flow through the same shared encoder module.
During inference time, the image decoder is the most time-intensive piece due to the diffusion sampling loop. Instead of running the decoder for the next frame, we tried to feed noise where the next frame should be, run only the encoder and action head. This led to the same accuracy as full 5-step EDM while running almost an order of magnitude faster, because you skip the image decoder sampling loop entirely.
Curriculum Learning
We train the model from general to more specific data. We start with a fraction of Comma's broad driving footage filtered for speeds under 15 mph. This is important because at the lower speeds we see in parking, the car model is better approximated by the kinematic bicycle model (no-slip) instead of the dynamic bicycle model (tire-modeling slip). Obviously with more data and training time, you don't have to filter.
After pre-training on this broad data, we train on segments driving around San Diego in our car and segments of driving around random parking lots. Finally, we train on parking trajectories segmented from turn-signal activation to switching gear (reversing out). Additionally, we reverse our trajectories of exiting parking spots to double the dataset since this is kinematically reversible.
Splatting & Synthetic Trajectories
The video model gets better with more data, so we explored scalable methods of generating realistic driving videos. In particular, we took a keen interest in NVIDIA's NuRec pipeline for realistic scene reconstruction, which seemed to yield pretty promising results.
The goal was to generate smooth random trajectories that could be used to pretrain the video model. This would also train the action head since it predicts geometric and kinematic features (e.g., curvature and tangential acceleration) rather than car-specific controls (e.g., steering and throttle).
We recorded several empty (ish) parking lots with a DJI drone, which produced stable videos of each lot and GPS metadata associated with each video frame. With each video, we extract every 3rd or 5th frame, and use them to perform feature extraction, sequential feature matching, and sparse reconstruction before creating a Gaussian splat of the recorded scene.
▶ Explored optimizations
Vanilla COLMAP ships with a lot of strong but relatively slow algorithms for Structure-from-Motion (SfM). We wanted to iterate on our methods as quickly as possible, so we explored a variety of methods that yielded faster and sometimes even better results, whether it was through GPU utilization or faster algorithms in general.
We replaced SIFT with SuperPoint and LightGlue, which are both learned, GPU-native methods for feature extraction and matching. Specifically in low-texture environments like parking lots, SuperPoint does a much better job with finding reliable keypoints, and LightGlue's attention mechanism utilizes scene-wide geometric context to match points instead of just local appearance like SIFT does, noticeably improving the quality of scene reconstruction.
One of the longest stages in geometric reconstruction is the iterative bundle adjustment step during mapping, which refines camera poses as new frames are acquired. Periodically, a Ceres solver is used to minimize the reprojection error between every observed 3D point, with the primary bottleneck being the Cholesky factorization of a camera-only system to perform pose refinement until some kind of convergence. Fortunately, NVIDIA has their own GPU-accelerated sparse solver, cuDSS, which we used to parallelize this factorization, speeding up the overall mapping phase. This is also accelerated by unified memory on the GH200 superchip architecture.
Finally, we explored the usage of GLOMAP, which replaces COLMAP's incremental reconstruction with a global approach by estimating all camera rotations via rotation averaging, then jointly solving for all camera positions and 3D structure at once. This was an order of magnitude faster than incremental bundle adjustments!
Once sparse reconstruction is done, these inputs are fed into 3DGRUT, where an explicit representation of the scanned environment is created by minimizing a weighted combination of pixel error and patch-level structural dissimilarity. This gives us a Gaussian splat, in the form of a PLY file.
To generate a trajectory, we first defined the valid configuration space in which they could exist (after all, we wouldn't want trajectories of ourselves running into obstacles). To do this, we created two PLY files — one untouched, and one where every gaussian is removed except for the ones that define the drivable space. Extracting the planar coordinates of this stripped splat, we calculate a 2D convex hull that defines the legal boundary these trajectories can live inside of.
Once this is defined, we just sample random points inside the hull, fit a spline through them, and use gsplat to render a video of navigating the splat at a desired FOV.
There are a lot of immediate issues here — from the unrealistic dynamics to the artifacts cluttering most of the view. Some of this comes down to poor splat quality when scanning the parking lot from a higher altitude instead of at realistic vehicle height. Following the NuRec pipeline, we attempted to use NVIDIA Difix3D+, which tries to prune these messy artifacts using a single-step diffusion model:
There are some clearly awesome frames but overall it's not a perfect solution. We stopped pursuing synthetic trajectory generation in the interest of time but smarter scene scanning (vehicle-height) should make this viable.
Results
Throughout the course of the training, we evaluated the checkpoints on specific clips of interesting scenarios (not expecting much with very little data and training).
During these evals, we found some incredibly cool side-effect results within our model. For example, in this clip, our model accelerates before the ground truth, on the precise frame the light turns green:
The rollout diverges but clearly shows the ego curving to enter the center of the lane.
For our real-world testing, we use a Comma 4 plugged into the vehicle CAN bus. Live video comes from the front camera over ZMQ to the laptop running our model; we transmit back curvature and longitudinal acceleration as testJoystick messages over ZMQ to control the car. [code]
Now finally, here is the model parking in an unseen lot: