Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LLM can take audio as input, e.g. Gemini 2.0 Flash can take audio as input and is very fast and cheap (costs 25 audio tokens for a second, and prices $0.1 for 1 million input tokens).

But I get your point that e.g. whisper is also very good and can run fast locally on edge devices. We also have recently phi4-multimodal LLM that is 5.6B params and is good enough to do inference locally on beefy smartphones. Problem is probably those smart speakers have really low CPU/GPU. Also LLM with audio input probably can be better re WER and take your mood/emotion into account when doing inference.



Is the specific component that takes the audio in gemini an LLM? If so, how does audio data become tokens?

But my point is not just about the ease of running whisper, it's that the device is already doing transcription. Even if it's not optimal, it works.


don't know exactly how it works under the hood for LLM. Gemini provide dedicated API for live API where you stream those audio via WebSocket - I guess they probably use some audio specific tokenizer.

If I would like to guess why amazon is moving to cloud it's:

1) They support only 8 languages right now - cloud LLM or even whisper can support like 50 languages pretty well. I was always dissapointed that couldn't buy Google Mini or Alexa or Apple Home for my mum because none of them speak Polish.

2) They want to provide good support or those less beefy smartspeakers that don't have much power and those still sell well.

3) They wanna move people to this new Alexa subscription that they recently announced or make people more subscribe to Prime.

4) Gather more voice samples so they can train as good multilingual TTS as elevenlabs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: