> but did not dispute that they store the entire model on SRAM.
No idea what they did or did not do for that specific test (which was about delivering 1800 tokens/sec though, not simply running qwen-3) since they didn't provide any detail. I don't think there is any point storing everything in SRAM, even if you do happen to have 100M$ worth of chips lying around in a test cluster at the office, since WSE-3 is designed from the ground up for data parallelism (see [1] section 3.2) and inference is sequential both within a single token generation (you need to go through layer 1 before you can go through layer 2 etc.) and between tokens (autoregressive, so token 1 before token 2). This means most of your weights loaded in SRAM would be just sitting unused most of the time, and when they need to be used they need to be broadcasted to all chips from the SRAM of the chip that has the particular layer you care about, this is extremely fast, but external memory is certainly fast enough to do this if you fetch the layer in advance. So the way to get the best ROI on such a system would be to pack the biggest batch size you can (so many users' queries) and process them all in parallel, streaming the weights as needed. The more your SRAM is occupied by batch activations and not parameters, the better the compute density and thus $/flops.
You can check the Cerebras doc to see how weight streaming works [2]. From the start, one of the selling point of Cerebras is the possibility to scale memory independently of compute, and they have developped an entire system specifically for weight streaming from that decoupled memory. Their docs seems to keep things fairly simple assuming you can only fit one layer in SRAM and thus they fetch things sequentially, but if you can store at least 2 layers in those 44GB of SRAM then you can simply fetch l+1 when l is starting to compute, completely masking latency cost. Its possible they already mask the latency even within a single layer by streaming by tiles for matmul though, unclear from their docs. They mention that in passing in [3] section 6.3.
All of their doc is for training since it seems for inference play they have pivoted to selling API access rather than chips, but inference is really the same thing, just without the backprop (especially in their case were they aren't doing pipeline parallelism where you could claim doing fwd+back prop gives you better compute density). At the end of the day whether you are doing training or inference, all you care about is that your cores have the data they need in their registers at the moment they are free to compute, so streaming to SRAM works the same way in both cases.
Ultimately I can't tell you how much it cost to run Qwen-3, you can certainly do it on a single chip + weight streaming, but their specs are just too light on the exact FLOPs and bandwidth to know what the memory movement cost would be in this case (if any), and we don't even know the price of single chip (everyone is saying 3M$ though, regardless of that comment on the other thread). But I can tell you that your math of doing `model_size/sram_per_chip * chip_cost` just isn't the right way to think about this, and so the 100M$ figure doesn't make sense.
Same techniques apply.
> but did not dispute that they store the entire model on SRAM.
No idea what they did or did not do for that specific test (which was about delivering 1800 tokens/sec though, not simply running qwen-3) since they didn't provide any detail. I don't think there is any point storing everything in SRAM, even if you do happen to have 100M$ worth of chips lying around in a test cluster at the office, since WSE-3 is designed from the ground up for data parallelism (see [1] section 3.2) and inference is sequential both within a single token generation (you need to go through layer 1 before you can go through layer 2 etc.) and between tokens (autoregressive, so token 1 before token 2). This means most of your weights loaded in SRAM would be just sitting unused most of the time, and when they need to be used they need to be broadcasted to all chips from the SRAM of the chip that has the particular layer you care about, this is extremely fast, but external memory is certainly fast enough to do this if you fetch the layer in advance. So the way to get the best ROI on such a system would be to pack the biggest batch size you can (so many users' queries) and process them all in parallel, streaming the weights as needed. The more your SRAM is occupied by batch activations and not parameters, the better the compute density and thus $/flops.
You can check the Cerebras doc to see how weight streaming works [2]. From the start, one of the selling point of Cerebras is the possibility to scale memory independently of compute, and they have developped an entire system specifically for weight streaming from that decoupled memory. Their docs seems to keep things fairly simple assuming you can only fit one layer in SRAM and thus they fetch things sequentially, but if you can store at least 2 layers in those 44GB of SRAM then you can simply fetch l+1 when l is starting to compute, completely masking latency cost. Its possible they already mask the latency even within a single layer by streaming by tiles for matmul though, unclear from their docs. They mention that in passing in [3] section 6.3.
All of their doc is for training since it seems for inference play they have pivoted to selling API access rather than chips, but inference is really the same thing, just without the backprop (especially in their case were they aren't doing pipeline parallelism where you could claim doing fwd+back prop gives you better compute density). At the end of the day whether you are doing training or inference, all you care about is that your cores have the data they need in their registers at the moment they are free to compute, so streaming to SRAM works the same way in both cases.
Ultimately I can't tell you how much it cost to run Qwen-3, you can certainly do it on a single chip + weight streaming, but their specs are just too light on the exact FLOPs and bandwidth to know what the memory movement cost would be in this case (if any), and we don't even know the price of single chip (everyone is saying 3M$ though, regardless of that comment on the other thread). But I can tell you that your math of doing `model_size/sram_per_chip * chip_cost` just isn't the right way to think about this, and so the 100M$ figure doesn't make sense.
[1]: https://arxiv.org/html/2503.11698v1#S3.
[2]: https://training-api.cerebras.ai/en/2.1.0/wsc/cerebras-basic....
[3]: https://8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533...