The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU
Robbing Peter to pay Paul. They are probably resource-constrained, and have determined that it's better to supply a worse answer to more people than to supply a good answer to some while refusing others. Especially knowing that most people probably don't need the best answer 100% of the time.
> Especially knowing that most people probably don't need the best answer 100% of the time.
More: probably don't know if they've got a good answer 100% of the time.
It is interesting to note that this trickery is workable only where the best answers are sufficiently poor. Imagine they ran almost any other kind of online service such email, stock prices or internet banking. Occasionally delivering only half the emails would trigger a customer exodus. But if normal service lost a quarter of emails, they'd have only customers who'd likely never notice half missing.
Well, there's this other project that recently secured funding from a company that has a proven track record of supporting great open-source projects like Astro, TanStack, and Hono without trying to capture or lock anything down.
reply