Yeah, it's about 5x slower than realtime with the current configuration. The good news is that diffusion models and transformers are constantly benefitting from new acceleration techniques. This was a big reason we wanted to take a bet on those architectures.
Edit: If we generate videos at a lower resolution and with a fewer number of diffusion steps compared to what's used in the public configuration, we are able to generate videos at 20-23 fps, which is just about real-time. Here is an example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...
It's something we're thinking about. Our main focus right now is to make the model as good as it can be. There are still many edge cases and failure modes.
Edit: If we generate videos at a lower resolution and with a fewer number of diffusion steps compared to what's used in the public configuration, we are able to generate videos at 20-23 fps, which is just about real-time. Here is an example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...