LLM can take audio as input, e.g. Gemini 2.0 Flash can take audio as input and i...

Dylan16807 · 2025-03-17T07:46:47 1742197607

Is the specific component that takes the audio in gemini an LLM? If so, how does audio data become tokens?

But my point is not just about the ease of running whisper, it's that the device is already doing transcription. Even if it's not optimal, it works.

pzo · 2025-03-17T13:00:56 1742216456

don't know exactly how it works under the hood for LLM. Gemini provide dedicated API for live API where you stream those audio via WebSocket - I guess they probably use some audio specific tokenizer.

If I would like to guess why amazon is moving to cloud it's:

1) They support only 8 languages right now - cloud LLM or even whisper can support like 50 languages pretty well. I was always dissapointed that couldn't buy Google Mini or Alexa or Apple Home for my mum because none of them speak Polish.

2) They want to provide good support or those less beefy smartspeakers that don't have much power and those still sell well.

3) They wanna move people to this new Alexa subscription that they recently announced or make people more subscribe to Prime.

4) Gather more voice samples so they can train as good multilingual TTS as elevenlabs.