You'd still need a fairly large amount of compute power to be able to run DeepSe...

roblabla · 2025-02-18T13:36:21 1739885781

Well yes, but not so large that it's completely prohibitive. People have been running the full models on computers going as low as $6000: https://x.com/carrigmat/status/1884244369907278106

Of course this is for a personal instance, you'd need a much more expensive setup to handle concurrent users. And that's to run it, not train it.

plagiarist · 2025-02-18T14:09:34 1739887774

Sortof a letdown that after 24 32Gb RAM sticks you only get 6-8 tokens per second.

Mekoloto · 2025-02-18T15:01:40 1739890900

But a token is not just a character.

"hello how are you today?" - 7 tokens.

And this is so much better than I could have imagined in a very short span of time.

acchow · 2025-02-18T17:12:50 1739898770

And only get to use 20k context length before it OOMs

mechagodzilla · 2025-02-18T21:20:27 1739913627

I have a used workstation I got for $2k (with 768GB of RAM) - using the Q4 model, I can get about 1.5 tokens/sec and use very large contexts. It's pretty awesome to be able to run it at home.

nomel · 2025-02-19T19:25:46 1739993146

For me, where electricity is $0.45/kWh, assuming 1kW consumption, it would be around $80 USD/million!

CyberDildonics · 2025-02-20T03:49:53 1740023393

I think you might have to show your math on that one.

nomel · 2025-02-20T18:08:44 1740074924

They said 1.5 tokens/second. 1 mil tokens is 667k seconds is 185 hours per million tokens. 1kW * 185hr * $0.45/kWh = $80 per million tokens. Again, assuming 1kW, which may be high (or low). The cost of the physical computation is electricity cost.

CyberDildonics · 2025-02-20T18:36:52 1740076612

They said it has a crappy GPU, the whole computer probably only uses 200 - 250 watts.

nomel · 2025-02-20T20:49:28 1740084568

No way. 768GB of ram will have significant power draw. DDR4 (which this probably is) is something like 3W/8GB. That's > 250W alone.

So, say 500W. That's, for me in my expensive electricity city, $40/million tokens, with the pretty severe rate limit of 5600 tokens/hours.

If you're in Texas, that would be closer to $10/million tokens! Now you're at the same price as GPT-4o.

menaerus · 2025-02-21T08:42:48 1740127368

But you can run and experiment with any model of your liking. And your data does not leave your desktop environment. You can build services. I don't think anybody doing this is doing it to save $20 a month.

nomel · 2025-02-21T19:43:31 1740167011

Yes. I was only making a monetary comparison.

Related, you can get a whole lot of cloud computing for $2k, for those same experiments, on much faster hardware.

But yes, the data stays local. And, it's fun.

This comment chain is pretty funny.

MysticFear · 2025-02-18T22:05:31 1739916331

Would love to know more info & specs of your workstation.

mechagodzilla · 2025-02-19T01:52:37 1739929957

It's an HP Z8 G4 (dual-socket 18-core, 3 GHz Xeons, 24x32GB of DDR4-2666, and then a crappy GPU, 8TB HDD, 1TB SSD). It can accommodate 3 dual-slot GPUs, but I was mostly interested in playing with frontier models where holding all the weights in VRAM requires a ~$500k machine. It can run the full Deepseek R1, Llama3-405B, etc, usually around 1-2 tokens/sec.

fspeech · 2025-02-18T19:57:53 1739908673

A better approach is to split the model with MOEs running on CPUs and MLAs running on GPU. See the ktransformers project: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

This takes advantage of the sparsity of MOE and the efficient KV-cache of MLA.

menaerus · 2025-02-19T11:01:43 1739962903

You perhaps forgot to mention that for their AMX optimizations to be even feasible you'd need to spend ~$10k for a single CPU, let alone the whole system which is probably ~$100k.

phonon · 2025-02-19T11:46:18 1739965578

Granite Rapids-W (Workstation) is coming out soon for likely much less than half that per CPU. (Xeon W-3500/2500 launched at $609 to $5889 per CPU less than a year ago and also has AMX).

menaerus · 2025-02-19T12:11:44 1739967104

Point being? Workstations that are fresh on the market and which have comparable performance of the server counterparts still easily cost anywhere between $20k and $40k. At least this is according to Dell workstations last time I looked.

phonon · 2025-02-20T00:38:23 1740011903

Supermicro X13SWA-TF Motherboard (16 DIMM slots with Xeon W-3500)= ~$1,000

E-ATX case = ~$300

Power Supply= ~$300

Xeon W-3500 (8 channel memory) = $1339 - $5889

Memory = $300-$500 per 64GB DDR5 RDIMM

Memory will be the major cost. The rest will be around $5,000. A lot less than "$100,000"!

menaerus · 2025-02-20T07:46:51 1740037611

I acknowledged in my last comment that the cost doesn't have to be $100k but that it would still be very high if you opted for the workstation design. You're gonna need to add one more CPU to your design, add another 8 memory channels, beefier PSU, and a new motherboard that can accommodate this. So, 8k (memory) + 10k (cpus) + the rest. As I said, not less than $20k.

phonon · 2025-02-20T12:56:56 1740056216

Why does it have to be a dual CPU design? 8 channels of DDR5 4800 will still get you something like 300 GB per second bandwidth. Not amazing, but OK. Granite Rapids-W will likely be something like 50% better (cores and bandwidth).

And the original message you were responding to was using a CPU with AMX and mixing it with a GPU like Nvidia 4900/5900. That way the large part of the model sits in the larger slower memory, and the active part in the GPU with the faster memory. Very cost effective and fast. (Something like generating 16 Tokens/s of 671B Deepseek R1 with a total hardware cost of $10-$20k.) They tried both single and dual CPU, with the latter about 30% faster....not necessarily worth it.

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

menaerus · 2025-02-20T13:32:02 1740058322

> 8 channels of DDR5 4800 will still get you something like 300 GB per second bandwidth.

That's the theory. In practice, Sapphire Rapids needs 24-28 cores to hit the 200 GB/s mark and it doesn't go much further than that. Intel CPU design generally has a hard time saturating the memory bandwidth so it remains to be seen if they managed to fix this but I wouldn't hold my breath. 200 GB/s is not much. My dual-socket Skylake system hits ~140 GB/s and it's quite slow for larger LLMs.

> Why does it have to be a dual CPU design?

Because memory bandwidth is one of the most important limiting (compute) factors for larger models inference. With dual-socket design you're essentially doubling the available bandwidth.

> And the original message you were responding to was using a CPU with AMX and mixing it with a GPU like Nvidia 4900/5900.

Dual-socket CPU that costs $10k on a server that costs probably couple of factors more. Now you claimed that it doesn't have to be that expensive but I beg to differ - you still need $20k-$30k of worth equipment to run it. That's a lot and not quite "cost effective".

phonon · 2025-02-20T15:37:41 1740065861

The proof of the pudding is in the eating. Read the link above. It's one or two mid range[1] Sapphire Rapids CPUs and a 4090. Dual CPU is faster (partially because 32->64 cores, not just bandwidth) but also hit data locality issues, limiting the increase to about 30%.

(Dual Socket Skylake? Do you mean Cascade Lake?)

If you price it out, it's basically the most cost effective set-up with reasonable speed for large (more than 300 GB) models. Dual socket basically doubles the motherboard[2] and CPU cost, so maybe another $3k-$6k for a 30% uplift.

[1] https://www.intel.com/content/www/us/en/products/sku/231733/... $3,157

[2] https://www.serversupply.com/MOTHERBOARD/SYSTEM%20BOARD/LGA-... $1,800

menaerus · 2025-02-20T16:10:55 1740067855

Yes, dual socket Skylake. What's strange about that?

Please price it out for us because I still don't see what's cost effective in a system that costs well over $10k and runs at 8 tok/s vs the dual zen4 system for $6k running at the same tok/s.

phonon · 2025-02-21T00:38:10 1740098290

Sorry. Didn't realize you meant Skylake-SP.

I am not sure what your point is? There are some nice dual socket Epyc examples floating around as well, that claim 6-8 tokens/s. (I think some of those are actually distilled versions with very small context sizes...I don't see any as thoroughly documented/benchmarked as the above). This is a dual socket Sapphire Rapids example with similar sized CPUs and a consumer graphics card that gives about 16 tokens/second. Sapphire Rapids CPU and MB are a bit more expensive, and a 4090 was $1500 until recently. So for a few thousand more you can double the speed. Also the prompt processing speed is waaaaay faster as well. (Something like 10x faster than the Epyc versions.)

In any case, these are all vastly cheaper approaches than trying to get enough H100s to fit the full R1 model in VRAM! A single H100 80 GB is more than $20k, and you would need many of them + server just to run R1.

menaerus · 2025-02-21T07:36:26 1740123386

I don't argue their idea, which is sound, but I argue that the cost needed to achieve the claimed performance is not "for a few thousand more" as you stubbornly continue to claim.

The math is clear: single-socket ktransformers performance is 8.73 tok/s and it costs ~$12k to build such a rig. The same performance one gets from a $6k dual-EPYC system. It is a full-blown version of R1 and not a distilled one as you say.

Your claim about 16 tok/s is also misleading. It's a figure for 6 experts while we are comparing R1 with 8 experts against llama with 8 experts. 8 experts on dual-socket system per ktransformer benchmarks runs at 12.2 - 13.4 tok/s and not 16 tok/s.

So, ktransformers can roughly achieve 50% more in dual-socket configuration and 50% more than dual-EPYC system. This is not double as you say. And finally, the cost of such dual-socket system is ~$20k and therefore isn't the "best cost effective" solution since it is 3.5x more expensive for 50% better output.

And tbh llama.cpp is not quite optimized for pure CPU inference workloads. It has this strange "compute graph" framework which I don't understand what is it there for. It appears completely unnecessary to me. I also profiled couple of small-, mid- and large-sized models and the interesting thing was that majority of them turned out to be bottlenecked by the CPU compute on a system with 44 physical cores and 192G of RAM. I think it could do a much better job there.

phonon · 2025-02-22T11:42:15 1740224535

Are we doing this?

Cheapest 32 core latest EPYC (9335) x 2 = $3,079.00 x 2

Intel 32 Core CPU used above x 2 = $3,157 x 2 (I would choose the Intel Xeon Gold 6530 which is going for around $2k now, and with with higher clock speeds, and a 100 MB of more cache)

AMD Epyc Dual Socket Motherboard Supermicro H13DSH = $1899

Intel Supermicro X13DEG-QT = $1,800

Memory, PSU, Case = Same

4090 GPU = $1599 - $3,000 (temporary?)

Besides the GPU cost, the rest is about the same price. You only get a deep discount with AMD setups if you use EPYCs a few years old with cheaper (and slower) DDR4.

And again, if you go single CPU, you save over $4,000, but lose around 30% in token generation.

The "$6,000" AMD examples I've seen are pretty vague on exactly what parts were used and exactly what R1 settings including context length they were run at, making true apple to apple comparisons difficult. Plus the Sapphire Rapids + GPU example is about 10x faster in prompt processing. (53 seconds to 6 seconds is no joke!)

menaerus · 2025-02-22T13:37:12 1740231432

> Are we doing this?

Yes, you're blatantly misrepresenting information and moving goalposts. Right now it has become clear that you're doing this because you're obviously affiliated with ktransformers project.

$6k for 8 tok/s or $20k for 12 tok/s. People are not stupid. I rest my case here.

menaerus · 2025-02-18T16:48:18 1739897298

6k is not that bad considering that top of the line Apple laptop costs as much. However, I don't have X so unfortunately I can't read the details.

longitudinal93 · 2025-02-18T18:09:21 1739902161

You can read the whole thread through nitter:

https://xcancel.com/carrigmat/status/1884244369907278106