Researchers from Tencent have developed DepthCrafter, a novel method for generating temporally consistent long depth sequences for open-world videos using video diffusion models.
It leverages a pre-trained image-to-video diffusion model (SVD) as the foundation and uses a 3-stage training strategy on paired video-depth datasets:
1. Train on a large realistic dataset (1-25 frames)
2. Fine-tune temporal layers on realistic data (1-110 frames)
3. Fine-tune spatial layers on synthetic data (45 frames)
It adapts SVD's conditioning mechanism for frame-by-frame video input and employs latent diffusion in VAE space for efficiency.
Sprinkle some intelligent inference strategy for extremely long videos:
- Segment-wise processing (up to 110 frames)
- Noise initialization to anchor depth distributions
- Latent interpolation for seamless stitching
And outperforms SOTA methods on multiple datasets (Sintel, ScanNet, KITTI, Bonn).
Read here: https://depthcrafter.github.io
It leverages a pre-trained image-to-video diffusion model (SVD) as the foundation and uses a 3-stage training strategy on paired video-depth datasets:
1. Train on a large realistic dataset (1-25 frames)
2. Fine-tune temporal layers on realistic data (1-110 frames)
3. Fine-tune spatial layers on synthetic data (45 frames)
It adapts SVD's conditioning mechanism for frame-by-frame video input and employs latent diffusion in VAE space for efficiency.
Sprinkle some intelligent inference strategy for extremely long videos:
- Segment-wise processing (up to 110 frames)
- Noise initialization to anchor depth distributions
- Latent interpolation for seamless stitching
And outperforms SOTA methods on multiple datasets (Sintel, ScanNet, KITTI, Bonn).
Read here: https://depthcrafter.github.io