A related interesting problem would be: how to search by text keywords in the *a...

A related interesting problem would be: how to search by text keywords in the audio stream of the video.

I guess part of one solution approach could be to convert the speech to text. Could then grep it. But how would we correlate that back to the time positions in the video where those keywords occurred?