Period

We built period, an end-to-end parking model with 11M parameters that runs at 120hz on a MacBook. It was trained from scratch from 7 hours of general driving data and 1 hour of task-specific trajectories and is able to park a car in an unseen parking lot!

The core inspiration behind our system is Rhoda's DVA, which uses a causal video model pretrained on web-scale video to predict what the robot should see next, and a small inverse dynamics model to translate that into motor commands. We were also inspired by Standard Intelligence's FDM-1 which instead trains an IDM to label millions of hours of screen recordings with actions, then trains a forward model on next-action prediction from that data without requiring a world-model/IDM at inference time.

For our final model architecture, we started from DIAMOND, which trains a diffusion UNet to predict future causal frames conditioned on an action input. We removed the action conditioning and added a small (17K parameter) action head coming out of the bottleneck.

During training the full model sees 8 context frames (every-other for 0.8 seconds at 20hz) and a noisy next frame, and optimizes two things: denoising the next frame (diffusion loss), and predicting the driver's curvature and acceleration (action loss). Both losses flow through the same shared encoder module.

During inference time, the image decoder is the most time-intensive piece due to the diffusion sampling loop. Instead of running the decoder for the next frame, we tried to feed noise where the next frame should be, run only the encoder and action head. This led to the same accuracy as full 5-step EDM while running almost an order of magnitude faster, because you skip the image decoder sampling loop entirely.

Both outputs share the encoder during training. At deployment the decoder is discarded for a single forward pass.

Curriculum Learning

We train the model from general to more specific data. We start with a fraction of Comma's broad driving footage filtered for speeds under 15 mph. This is important because at the lower speeds we see in parking, the car model is better approximated by the kinematic bicycle model (no-slip) instead of the dynamic bicycle model (tire-modeling slip). Obviously with more data and training time, you don't have to filter.

After pre-training on this broad data, we train on segments driving around San Diego in our car and segments of driving around random parking lots. Finally, we train on parking trajectories segmented from turn-signal activation to switching gear (reversing out). Additionally, we reverse our trajectories of exiting parking spots to double the dataset since this is kinematically reversible.

Data curriculum narrows from broad low-speed driving to specific parking demonstrations. Each stage fine-tunes from the previous checkpoint.

Splatting & Synthetic Trajectories

The video model gets better with more data, so we explored scalable methods of generating realistic driving videos. In particular, we took a keen interest in NVIDIA's NuRec pipeline for realistic scene reconstruction, which seemed to yield pretty promising results.

The goal was to generate smooth random trajectories that could be used to pretrain the video model. This would also train the action head since it predicts geometric and kinematic features (e.g., curvature and tangential acceleration) rather than car-specific controls (e.g., steering and throttle).

We recorded several empty (ish) parking lots with a DJI drone, which produced stable videos of each lot and GPS metadata associated with each video frame. With each video, we extract every 3rd or 5th frame, and use them to perform feature extraction, sequential feature matching, and sparse reconstruction before creating a Gaussian splat of the recorded scene.

Scanning the lot on a DJI drone

Sparse reconstruction of the parking lot — Sparse reconstruction of the same lot

Once sparse reconstruction is done, these inputs are fed into 3DGRUT, where an explicit representation of the scanned environment is created by minimizing a weighted combination of pixel error and patch-level structural dissimilarity. This gives us a Gaussian splat, in the form of a PLY file.

One of many parking lot splats

To generate a trajectory, we first defined the valid configuration space in which they could exist (after all, we wouldn't want trajectories of ourselves running into obstacles). To do this, we created two PLY files — one untouched, and one where every gaussian is removed except for the ones that define the drivable space. Extracting the planar coordinates of this stripped splat, we calculate a 2D convex hull that defines the legal boundary these trajectories can live inside of.

C-space generation.

Once this is defined, we just sample random points inside the hull, fit a spline through them, and use gsplat to render a video of navigating the splat at a desired FOV.

A generated splat trajectory.

There are a lot of immediate issues here — from the unrealistic dynamics to the artifacts cluttering most of the view. Some of this comes down to poor splat quality when scanning the parking lot from a higher altitude instead of at realistic vehicle height. Following the NuRec pipeline, we attempted to use NVIDIA Difix3D+, which tries to prune these messy artifacts using a single-step diffusion model:

A somewhat cleaned up splat trajectory.

There are some clearly awesome frames but overall it's not a perfect solution. We stopped pursuing synthetic trajectory generation in the interest of time but smarter scene scanning (vehicle-height) should make this viable.

Results

Throughout the course of the training, we evaluated the checkpoints on specific clips of interesting scenarios (not expecting much with very little data and training).

During these evals, we found some incredibly cool side-effect results within our model. For example, in this clip, our model accelerates before the ground truth, on the precise frame the light turns green:

Traffic light reaction

Before we removed the image decoder diffusion head, our model was pretty good at predicting the next frame! Look at the right edge of the house:

We also wanted to know how stable it was autoregressively since we did do noising techniques during training but this doesn't matter much for our use case as we constantly replace the current frames with real observations:

The rollout diverges but clearly shows the ego curving to enter the center of the lane.

For our real-world testing, we use a Comma 4 plugged into the vehicle CAN bus. Live video comes from the front camera over ZMQ to the laptop running our model; we transmit back curvature and longitudinal acceleration as testJoystick messages over ZMQ to control the car. [code]

Now finally, here is the model parking in an unseen lot:

Outside view

Inside view

The car slows down before the curb and also straightens out the wheel at the end! As checkpoints progressed, the car got better at being perfectly perpendicular & slowing down correctly. We did observe getting slightly closer to being inside the lines but with very few samples we're not sure (though many humans also struggle with this too, so hopefully this isn't a commentary on the data we gathered).