Better models are coming out which are already pretrained on a significant amount of data, so the model already learned a lot about what is common to all example of video generation (keeping the edges aligned coherently at every frame, keeping texture and lighting coherent etc.) and will not need to re-learn that for every target.
Since initially, deepfake models were trained from scratch for every single target, you had to provide a lot of data from the person you want to target so that the model can learn what is common as well as what is specific.
Now you can get descent performance with much less data, since you only need to learn the specifities.
However, this only helps if you need a limited deepfake: The model cannot infer the exact facial expression of the target when they are, for example, laughing unless you provided an example of that in the training data (assuming there is no way to infer the laughing expression from someone by looking at other provided expressions). It will instead generate a generic laugh. All missing informations are substituted by what was seen, on average, in the pre-training phase.
That wouldn't work for a long complex deepfake meant to be sent to someone reasonably close with the target.
But for the types of deepfake where it's targeting a personality that we all know, but not very well at all, much less data is neeeded than before for a similar result.
At least in my experience - audio is much harder to convincingly fake than video. If you have heard the real person speaking, they have very specific and distinguishable patterns of speech.
You can fake it reasonably, but you need to have a very large collection of audio clips to do so, and if you do a bad job it literally jumps out at the viewer.
Video might be off, but it requires close attention and large screens to notice - much easier to miss if you're viewing on a phone.
Better models are coming out which are already pretrained on a significant amount of data, so the model already learned a lot about what is common to all example of video generation (keeping the edges aligned coherently at every frame, keeping texture and lighting coherent etc.) and will not need to re-learn that for every target.
Since initially, deepfake models were trained from scratch for every single target, you had to provide a lot of data from the person you want to target so that the model can learn what is common as well as what is specific.
Now you can get descent performance with much less data, since you only need to learn the specifities.
However, this only helps if you need a limited deepfake: The model cannot infer the exact facial expression of the target when they are, for example, laughing unless you provided an example of that in the training data (assuming there is no way to infer the laughing expression from someone by looking at other provided expressions). It will instead generate a generic laugh. All missing informations are substituted by what was seen, on average, in the pre-training phase.
That wouldn't work for a long complex deepfake meant to be sent to someone reasonably close with the target.
But for the types of deepfake where it's targeting a personality that we all know, but not very well at all, much less data is neeeded than before for a similar result.