Text-to-Video: The Task, Challenges and the Current State...

Text-to-Video: The Task, Challenges and the Current State

Video samples generated with ModelScope.

Text-to-video is next in line in the long list of incredible advances in generative models. As self-descriptive as it is, text-to-video is a fairly new computer vision task that involves generating a sequence of images from text descriptions that are both temporally and spatially consistent. While this task might seem extremely similar to text-to-image, how do they differ from text-to-image models, and what kind of performance can we expect from them?

We will start by reviewing the differences between the text-to-video and text-to-image tasks, , we will cover the most recent developments in text-to-video models, exploring how these methods work and what they are capable of. Finally, we will talk about what we are working on at Hugging Face to facilitate the integration and use of these models and share some cool demos and resources both on and outside of the Hugging Face Hub. #Text-to-Video