Hacker Newsnew | past | comments | ask | show | jobs | submit | root_axis's commentslogin

Yeah, but $200 a month is not a sustainable price.

Seems they are growing and model is overloaded. I suspect they’ll raise the prices.

$1k for a lot of developers here is totally worth it.


Cable TV begs to differ. I grew up working poor and plenty of people around me dumped a lot of money into cable TV subscriptions, and $120 back in the late 90s is $240 now.

Computer costs keep collapsing. Image and audio generation is turned out to be less computer intensive than text (lol).

First company to launch 24/7 customized streaming AI slop wins!


I think the poster was saying giving away the models for $200 isn't sustainable for the provider, not that a user won't pay $200 for the latest and greatest models.

> new hardware runs $4k-$10k last I checked

Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.


$10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.

(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)


You can't put "SSD offload" and "workable speed" in the same sentence.

As a typical example DeepSeek v4-pro has 59B active params at mostly FP4 size, so it needs to "find" around 30GB worth of params in RAM per inferred token. On a 512GB total RAM machine, most of those params will actually be cached in RAM (model size on disk is around 862GB), so assuming for the sake of argument that MoE expert selection is completely random and unpredictable, around 15GB in total have to be fetched from storage per token. If MoE selection is not completely random and there's enough locality, that figure actually improves quite a bit and inference becomes quite workable.

I've never seen reports of this kind of setup being able to deliver more than low single-digit tokens per second. That's certainly not usable interactively, and only of limited utility for "leave it to think overnight" tasks. Am I missing something?

Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.

Have you actually done this? On what hardware? With what inference engine?


Not really. The hardware requirements remain indefinitely out of reach.

Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.


The memory requirements aren't that intense. You can run useful (not frontier) models on a $2-5K machine at reasonable speeds. The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.

Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.


> The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.

Yes, a lot better, but still terribly unreliable and far less capable than the big unquantized models.


When they say "doing the opposite" they are referring to Anthropic's hyperbolic marketing strategy.

Though, I don't think that justifies spreading FUD in the opposite direction. I also don't think the comment the GP was replying to contains FUD.


Great post. Looking forward to the followup article about lights and shadows :)

Maybe 20 years ago, today it's no better than anything else - well designed in some aspects, total trash in others. The stewards of xcode, spotlight and siri (among many other stinkers) are disqualified from the category of "best"

Exactly my thoughts. It also raises major questions about organizational and executive leadership, it seems crazy to put the reigns of such a massive ship - integral to the business of huge swaths of the economy - into the hands of an ambitious flash in the pan startup.

Obviously, the fun part is delivering value for the shareholders.

I have also found LLMs are a great tool for understanding a new code base, but it's not clear to me what your comment has to do with skill atrophy.

Well ultimately the skill I care about is understanding software, changing it, and making more of it. And clearly that isn't atrophying.

My syntax writing skills may well be atrophying, but I'll just do a leetcode by hand once in a while.


It's been an agenda for the republicans and Israel for decades. It's been reported that Israel tried to encourage Obama and then Biden to do the same. Back during the "normal" political era, even respected republicans like John McCain would joyously sing about bombing Iran.

Trump agreed with Israel that it was a good idea so now here we are.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: