Share and discover more about AI with social posts from the community.huggingface/OpenAi
Text-to-Video: The Task, Challenges and the Current State

Video samples generated with ModelScope.

Text-to-video is next in line in the long list of incredible advances in generative models. As self-descriptive as it is, text-to-video is a fairly new computer vision task that involves generating a sequence of images from text descriptions that are both temporally and spatially consistent. While this task might seem extremely similar to text-to-image, how do they differ from text-to-image models, and what kind of performance can we expect from them?

We will start by reviewing the differences between the text-to-video and text-to-image tasks, , we will cover the most recent developments in text-to-video models, exploring how these methods work and what they are capable of. Finally, we will talk about what we are working on at Hugging Face to facilitate the integration and use of these models and share some cool demos and resources both on and outside of the Hugging Face Hub. #Text-to-Video
Why another text-to-image model?
Well, this one is pretty fast and efficient. Würstchen’s biggest benefits come from the fact that it can generate images much faster than models like Stable Diffusion XL, while using a lot less memory! So for all of us who don’t have A100s lying around, this will come in handy. Here is a comparison with SDXL over different batch sizes: #text-to-image