This has been happening for at least a year, probably more. I always assumed it was something the creator had to manually mark, never considered that it might be automatic.
My TV's YT app is useless and buggy. It doesn't have automatically generated & automatically translated subtitles. The YT app on my mi-box does. So it wouldn't be impossible to have the VTT prescan the audio and "if VTT has the phrase so-and-so then display animation lalala.gif"
(I use the VTT on my Firefox, I use IDM to download the auto-generated English language scripts for lengthy videos).
I remember listening to Chomsky saying that he prefers to read the transcripts than listening/watching speeches.
why would these auto generated subs be the apps purview? wouldn't it be YT platform doing all of this work and providing a CC button if the data is available. i can't imagine this being done in real time and not just some post process that is run against new content at time up upload processing.