> it knows what is being said in those videos at the exact moment it is said. Ho...

slg · on Aug 20, 2024

I don't know why you quoted this snippet to ask this question when I addressed it literally in the next sentence of my original comment, but I guess I need to reiterate it. Automatic transcripts/ closed captioning suggest Google heard what was said. This feature suggests Google understands what was said.

Think of it like reading a foreign language that uses the same alphabet. You can give me some French text and I can read it well enough for a French speaker to understand it. That doesn't mean that I myself understand what I read. Transcripts are the former and this feature is the latter.

johnfn · on Aug 20, 2024

The reason I didn't respond to the rest of your post is because I can't understand its relationship to your argument. There is no need for deeper understanding here. It's just a substring match.

slg · on Aug 20, 2024

Ironically you seem to be focusing too much on the exact words and phrases I used rather than the deeper meaning. So let's just get completely away from words like "knows" and "understanding" which seem to be tripping multiple people up.

> It's just a substring match.

Let's just say this is true. That is a super simply process, but what would it look like?

Step 1: Transcribe the audio into text

Step 2: Run substring match on text

The transcribing/close captioning feature only does step 1. This shows that a step 2 is possible. I think you would have to be naive to think the capability to do this type of analysis on the transcribed text was designed for only this feature and would never be used for anything else. This feature is announcing that Youtube isn't merely creating transcripts of the audio in videos, it is running some unknown amount of analysis on that data.

As I said in my original comment, "it isn't that I didn't know Google had this ability", but this is literally a glowing sign pointing to this fact. I think the danger of Google reminding people of this has potential to outweigh the benefit of the "that's cute" reaction that this is designed to elicit.

johnfn · on Aug 20, 2024

> I think you would have to be naive to think the capability to do this type of analysis on the transcribed text was designed for only this feature and would never be used for anything else.

Why would you have to be naive to believe this? The subtitles, with timings, are available on the client-side already. You seem to allude that doing this work would require some sort of deep analysis work. I think it's really more like 5 lines of JS, and 4 of them are producing the fun animated gradient :P

slg · on Aug 20, 2024

It isn’t about this requiring “deep analysis work”. It is the difference between some analysis and no analysis. That increase from 0 to 1 is always the biggest hurdle to clear when it comes to any corporate behavior like this.

This feature is like walking into your kitchen one day to find a dead cockroach. I’m saying that is an indication of a cockroach problem while you’re effectively responding with “it’s fine, the one cockroach is dead and there is no reason to believe there are any others.”

johnfn · on Aug 20, 2024

In a previous post, you said this:

> This feature suggests Google understands what was said.

Running .includes() on the client does not imply that Google has any "understanding" of what was said. It only implies that they ran .includes() on the client. includes() does not "understand" anything.

The thing I really don't understand is this: the fact that Google has closed captions at all implies they do an enormous more "analysis" than this minor feature could possibly require. If you understand how Google does CCs and what that means, this shouldn't have bothered you at all.

In your analogy, it's like you see a mound of a hundred thousand cockroaches, but you're are worried about a dust speck in another room.

slg · on Aug 21, 2024

>Running .includes() on the client does not imply that Google has any "understanding" of what was said. It only implies that they ran .includes() on the client. includes() does not "understand" anything.

Do you want to have a good faith conversation about this? Because going back to debating the meaning of "understanding" after I already said this was misleading is not a good way to have a conversation.

>The thing I really don't understand is this: the fact that Google has closed captions at all implies they do an enormous more "analysis" than this minor feature could possibly require. If you understand how Google does CCs and what that means, this shouldn't have bothered you at all.

Can we set up a baseline that there is a difference between content agnostic analysis and content aware analysis? Transcripts are content agnostic in that they can be produced without any comprehension of the words said. This feature is content aware in that it is looking for specific meaning in the words said. Do you not see any difference between these two?

johnfn · on Aug 21, 2024

> Do you want to have a good faith conversation about this? Because going back to debating the meaning of "understanding" after I already said this was misleading is not a good way to have a conversation.

Call it "understanding", call it "content aware analysis". I guarantee that their closed captioning service has much more of that quality than this new feature does.

> Can we set up a baseline that there is a difference between content agnostic analysis and content aware analysis? Transcripts are content agnostic in that they can be produced without any comprehension of the words said. This feature is content aware in that it is looking for specific meaning in the words said. Do you not see any difference between these two?

Again, I don't see it. CCs are not content agnostic: they have to have semantic understanding of the words said in order to produce accurate results. How do you think CCs differentiate between the words "to", "too" and "two" without looking at the surrounding words and having some idea of contextual usage? How do you think CCs can tell between "there" and "they're" without understanding if the speaker is referring to a person or a location? This is only the tip of the iceberg as to how CCs actually work, and more "content aware analysis" will always lead to more accurate CCs.

slg · on Aug 21, 2024

>Again, I don't see it. CCs are not content agnostic: they have to have semantic understanding of the words said in order to produce accurate results. How do you think CCs differentiate between the words "to", "too" and "two" without looking at the surrounding words and having some idea of contextual usage? How do you think CCs can tell between "there" and "they're" without understanding if the speaker is referring to a person or a location? This is only the tip of the iceberg as to how CCs actually work, and more "content aware analysis" will always lead to more accurate CCs.

Still can't get away from that "understanding" debate. You're also now equating an understanding of context with an understanding of meaning. An understanding of meaning isn't needed to differentiate between "to", "two", and "too" because they're all used differently in sentences. When the system encounters those, I don't think it goes to the definitions and tries to find which word makes the most meaningful sentence. Most times the specific homophone can be inferred based on things like part of speech and the part of speech can often be inferred from a sentence without knowing any meaning.

For example, would the system be able to properly handle homophones that are grammatically similar? Could it consistently transcribe sentences like "I have Celiac disease and enjoy the taste of rose water, so I prefer flower to flour in my deserts." That is an easy sentence to understand for anyone who knows the meaning of those words, but there are no grammatical or structural indications on which flower/flour to use.

But either way, that is getting way too deep in the weeds compared to where my point started. This feature calls attention to an analysis of meaning because the user sees the software reacting to the meaning of the content of the video. A transcript does not call attention to an analysis of meaning because the behavior of the software does not change based on the content of the video.

johnfn · on Aug 22, 2024

> But either way, that is getting way too deep in the weeds compared to where my point started.

Your first comment - the one that started all this - was, as far as I can understand, arguing that this feature indicated that Google had the capabilities to do more advanced - understanding? processing? meaning analysis? - than it had done in the past. If I keep coming back to that, well, it's because it appears to be your main point. If it's not, please correct me.

> Most times the specific homophone can be inferred based on things like part of speech and the part of speech can often be inferred from a sentence without knowing any meaning.

This is not true. I don't think I have enough more responses on HN to fully explain why homophones can not be inferred without understanding meaning, but I encourage you to go and read about how transcription works!

> For example, would the system be able to properly handle homophones that are grammatically similar?

I mean, this is easy enough for you to check. Here's some videos about flour / flower - notice how the CCs correctly determine if the word is flour or flower with almost 100% accuracy.

https://www.youtube.com/watch?v=y8vLjPctrcU https://www.youtube.com/watch?v=xdaRvErv2Kc

> This feature calls attention to an analysis of meaning because the user sees the software reacting to the meaning of the content of the video.

Are you saying you specifically think that YT is analyzing meaning from this feature, or just some generic user? I think you are smart enough to know that it's not true, but perhaps my mom might not understand that CCs require infinitely more processing power and this feature is just a drop in the bucket. (If you really still don't think it's true, definitely go read more about how CCs are made!)

slg · on Aug 22, 2024

>Your first comment - the one that started all this - was, as far as I can understand, arguing that this feature indicated that Google had the capabilities to do more advanced - understanding? processing? meaning analysis? - than it had done in the past. If I keep coming back to that, well, it's because it appears to be your main point. If it's not, please correct me.

Here is what I said. "It highlights how much Google analyses the content of its videos... It isn't that I didn't know Google had this ability...". My point was not that I learned about Google's capability from this feature or that this capability was new, it is that this calls attention to Google looking for meaning in the content of the video. A transcript does not call attention to Google looking for meaning regardless of how the transcripts are prepared.

>I mean, this is easy enough for you to check. Here's some videos about flour / flower - notice how the CCs correctly determine if the word is flour or flower with almost 100% accuracy.

Both those videos include the correct homophone in the title and description of the video. Choosing the correct one is not an indication of the system using the meaning of those words, it is pattern recognition. Every use of "flower" means the next usage is less likely to be "flour". The specificity of the example I used was important because it used both "flower" and "flour" in a way that can only be distinguished by the meaning of the words.

>Are you saying you specifically think that YT is analyzing meaning from this feature, or just some generic user? I think you are smart enough to know that it's not true, but perhaps my mom might not understand that CCs require infinitely more processing power and this feature is just a drop in the bucket. (If you really still don't think it's true, definitely go read more about how CCs are made!)

This feature is a glowing sign that Youtube as a company analyses the content of the videos for the meaning of what is said in those videos. You are too deep into the technical details trying to assign credit for what aspect of Youtube does the "understanding" or which "require[s] infinitely more processing power".

Think of this feature like receiving mail and you see one of the letters has already been opened. That could make you feel like your privacy was invaded in a way you wouldn't feel after receiving a postcard. And now we have spent several comments debating whether a torn envelope indicates whether anyone read the letter and whether a postcard is private.

FrequentLurker · on Aug 21, 2024

Google also censors all swear words in the auto transcriptions so is it content agnostic anymore?

slg · on Aug 21, 2024

That depends on how the transcription software is written. Are swear words filtered out or are they just never in the system's vocabulary in the first place? I assumed the latter, but fair enough point. It is possible my categorization needs more thought.

Regardless, there is in my opinion a clear distinction in sophistication between a filter and something that triggers a timed action. And that was really what my original comment was about, this feature's elevated sophistication is a conscious reminder of Google's capabilities. Normally that is out of sight and out of mind which is probably better for Google.

jncfhnb · on Aug 21, 2024

0 to 1? Pfft.

Try 1 to 2.

jrvieira · on Aug 20, 2024

so you're not so worried that they do this analysis (this is very tame comparing to what they really do), but rather that they are transparent about it?

zamadatix · on Aug 20, 2024

You've overlooked that automatic transcription also includes the filtering of certain words based on match lists as well.

8note · on Aug 20, 2024

Does this feature suggest they understand what was said?

I'd expect the button to light up at random times in a video about "smash the like button" causing the button to light up, rather than in the end call to action for the video where the video author actually wants the watcher to click the button.

Similarly, I doubt it can properly understand when the video presenter implies the term "smash the like button" while instead say, leaving an empty space where they would usually say such a thing.

If I pointed you at some french text and asked you to point at the term "le Chateau" you'd be able to do that without understanding french.

slg · on Aug 20, 2024

Now this is just a semantic debate about whether a computer can even "understand" anything. The system is conditionally responding based upon the content of what was said. That indicates a level of sophistication in analyzing the content of the video that a transcript does not. I don't really care if you define that as "understanding" or not.

pb7 · on Aug 20, 2024

Does it understand it or does it have a regex for "smash that subscribe button"?

hi-v-rocknroll · on Aug 21, 2024

It doesn't work, at least not on my channel or on an unlisted video. I tried "Press the 'Subscribe' button", "Smash the 'Subscribe' button", and "Smash the 'Like' button."

skeaker · on Aug 21, 2024

It isn't rolled out to every channel yet. Back when the chapters feature was added to Youtube not every video would have it even if they had the correct syntax in their description. I happened to upload a video around that time and happened to put that syntax in my description and got it to work despite having a channel with almost 0 subscribers, making me think feature rollouts like these are essentially just random. Anecdotally, I suspect videos with these new features are also given an algorithm boost: My video where I had this feature randomly trigger early got over a million views despite that not being my intent and despite my channel being otherwise near non-existent in terms of activity.

dylan604 · on Aug 20, 2024

do you really think that Googs has implemented a regex for this little easter egg without parsing the rest of the content in a much more meaningful manner as the original comment is suggesting? i'm in agreement that the levels of what these platforms can infer about the content you consume is beyond creepy.

root_axis · on Aug 20, 2024

Yes, because a regex match is cheap, easy and predictable. I don't see a reason why it needs to be any more complicated than that.

dylan604 · on Aug 20, 2024

but that's not how Googs makes money

root_axis · on Aug 20, 2024

What does that have to do with anything? The text transcript for the video already exists on the client, implementing this behavior is trivial with that in mind. Whatever creepy analysis YouTube is doing (which I don't doubt they are), this feature isn't evidence of it.

dylan604 · on Aug 21, 2024

you honestly think that Googs spend the time and money to develop automatic transcription/translation for anything other than figuring out ad sales to videos? anything else they've done with it is just bonus features from the work on how to direct ads. how is that difficult to grok?

SirMaster · on Aug 21, 2024

So you consider an if statement checking whether "like button" was present at that time in the CC to be the computer "understanding"?