Abstract
While recent video diffusion models produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce **VideoGPA** (**Video** **G**eometric **P**reference **A**lignment), a data-efficient self-supervised framework that leverages geometry foundation model to automatically derive dense preference signals to guide VDMs via Direct Preference Optimization. This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, phyiscal plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
Quantitative Benchmarks
Comprehensive results on 3D Reconstruction Error, 3D Geometric Consistency Metrics, and VideoReward Benchmark.
| Method | 3D Reconstruction Error | 3D Consistency | VideoReward (Win Rate %) | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | MVCS ↑ | 3DCS ↓ | Epipolar ↓ | VQ | MQ | TA | OVL | |
| Image-to-Video (I2V) Tasks | ||||||||||
| Baseline-I2V | 14.57 | 0.455 | 0.653 | 0.976 | 0.687 | 0.706 | - | - | - | - |
| SFT | 15.23 | 0.509 | 0.639 | 0.982 | 0.665 | 0.628 | 44.67 | 33.00 | 52.67 | 35.00 |
| Epipolar-DPO | 15.07 | 0.479 | 0.615 | 0.984 | 0.646 | 0.571 | 67.33 | 51.33 | 56.67 | 66.00 |
| VideoGPA (Ours) | 15.19 | 0.510 | 0.608 | 0.986 | 0.638 | 0.564 | 74.00 | 56.00 | 57.67 | 76.00 |
| Text-to-Video (T2V) Tasks | ||||||||||
| Baseline-T2V | 17.53 | 0.614 | 0.508 | 0.967 | 0.533 | 0.584 | - | - | - | - |
| SFT | 17.07 | 0.573 | 0.563 | 0.968 | 0.586 | 0.719 | 14.67 | 23.67 | 39.33 | 15.33 |
| Epipolar-DPO | 17.69 | 0.618 | 0.507 | 0.971 | 0.528 | 0.579 | 45.00 | 53.67 | 49.00 | 48.67 |
| VideoGPA (Ours) | 17.31 | 0.621 | 0.495 | 0.974 | 0.519 | 0.548 | 62.67 | 67.00 | 42.67 | 60.33 |
| Comparison with State-of-the-Art | ||||||||||
| Baseline-T2V15 | 15.57 | 0.480 | 0.555 | 0.976 | 0.588 | 0.685 | - | - | - | - |
| GeoVideo | 15.90 | 0.510 | 0.643 | 0.852 | 0.673 | 0.840 | 17.36 | 44.44 | 30.56 | 18.06 |
| VideoGPA (Ours) | 14.88 | 0.503 | 0.520 | 0.982 | 0.556 | 0.567 | 60.42 | 54.17 | 52.08 | 57.64 |
Note: 3DCS denotes our proposed 3D Consistency Score. VQ, MQ, TA, and OVL represent Visual Quality, Motion Quality, Text Alignment, and Overall, respectively. Filtered static videos as Training Stage.
Visual Gallery
Citation
Incoming