Abstract
With the advance of diffusion models, today's video generation has achieved impressive qual-ity. To extend the generation length and facilitate real-world applications, a majority of video dif-fusion models (VDMs) generate videos in an au-toregressive manner, ie., generating subsequent clips conditioned on the last frame(s) of the previ-ous clip. However, existing autoregressive VDMS are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the compu-tational demands increase significantly (ie, with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal gen-eration and Cache sharing. For causal gener-atlon, it introduces unidirectional feature com-putation, which ensures that the cache of con-ditional frames can be precomputed in previous autoregression steps and reused in every subse-quent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache stor-age cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quan-titative and qualitative video generation results and significantly improves the generation speed. Code is available: https://github.com/Dawn-LX/CausalCache-VDM
| Original language | English |
|---|---|
| Pages (from-to) | 18550-18565 |
| Number of pages | 16 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 267 |
| Publication status | Published - 2025 |
| Event | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada Duration: 13 Jul 2025 → 19 Jul 2025 |
Bibliographical note
Publisher Copyright:© 2025 by the author(s).