Cable TV begs to differ. I grew up working poor and plenty of people around me dumped a lot of money into cable TV subscriptions, and $120 back in the late 90s is $240 now.
Computer costs keep collapsing. Image and audio generation is turned out to be less computer intensive than text (lol).
First company to launch 24/7 customized streaming AI slop wins!
I think the poster was saying giving away the models for $200 isn't sustainable for the provider, not that a user won't pay $200 for the latest and greatest models.
$10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.
(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)
As a typical example DeepSeek v4-pro has 59B active params at mostly FP4 size, so it needs to "find" around 30GB worth of params in RAM per inferred token. On a 512GB total RAM machine, most of those params will actually be cached in RAM (model size on disk is around 862GB), so assuming for the sake of argument that MoE expert selection is completely random and unpredictable, around 15GB in total have to be fetched from storage per token. If MoE selection is not completely random and there's enough locality, that figure actually improves quite a bit and inference becomes quite workable.
I've never seen reports of this kind of setup being able to deliver more than low single-digit tokens per second. That's certainly not usable interactively, and only of limited utility for "leave it to think overnight" tasks. Am I missing something?
Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.
Have you actually done this? On what hardware? With what inference engine?
Not really. The hardware requirements remain indefinitely out of reach.
Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.
The memory requirements aren't that intense. You can run useful (not frontier) models on a $2-5K machine at reasonable speeds. The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.
Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.
Maybe 20 years ago, today it's no better than anything else - well designed in some aspects, total trash in others. The stewards of xcode, spotlight and siri (among many other stinkers) are disqualified from the category of "best"
Exactly my thoughts. It also raises major questions about organizational and executive leadership, it seems crazy to put the reigns of such a massive ship - integral to the business of huge swaths of the economy - into the hands of an ambitious flash in the pan startup.
It's been an agenda for the republicans and Israel for decades. It's been reported that Israel tried to encourage Obama and then Biden to do the same. Back during the "normal" political era, even respected republicans like John McCain would joyously sing about bombing Iran.
Trump agreed with Israel that it was a good idea so now here we are.
reply