firefly2000's comments

firefly2000 · 2026-02-25T22:12:02 1772057522

Are there plans to elucidate implicit GC costs as well?

jonasn · 2026-02-25T22:18:04 1772057884

Great question! I actually just touched on this in another thread that went up right around the same time you asked this. It is clearly the next big frontier!

The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.

magicalhippo · 2026-02-26T16:37:42 1772123862

> instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult

I've used a sampling profiler with success to find lock contention in heavily multithreaded code, but I guess there are some details that makes it not viable for this?

firefly2000 · 2026-02-26T00:57:53 1772067473

Do you think it can be done by adjusting GC aggressiveness (or even disabling it for short periods of time) and correlating it with execution time?

jonasn · 2026-02-26T11:29:53 1772105393

That is spot on. Effectively disabling GC to establish a baseline is exactly the methodology used in the Blackburn & Hosking paper [1] I referenced.

In general, for a production JVM like HotSpot, the implicit cost comes largely from the barriers (instructions baked directly into the application code). So even if we disable GC cycles, those barriers are still executing.

If we were to remove barriers during execution, maintaining correctness becomes the bottleneck. We would need a way to ensure we don't mark a live (reachable) object as dead the moment we re-enable the collector.

[1] https://dl.acm.org/doi/pdf/10.1145/1029873.1029891

babol · 2026-02-26T11:49:33 1772106573

Would running an application with chosen GC, subtracting GC time reported by methods You introduced, and then comparing with Epsilong-based run be a good estimate of barrier overhead ?

Thank you for the well written article!

jonasn · 2026-02-26T12:03:15 1772107395

That is a creative idea, but unfortunately, Epsilon changes the execution profile too much to act as a clean baseline for barrier costs.

One huge issue is spatial locality. Epsilon never reclaims, whereas other GCs reclaim and reuse memory blocks. This means their L2/L3 cache hit rates will be fundamentally different.

If you compare them, the delta wouldn't just be the barrier overhead; it would be the barrier overhead mixed with completely different CPU cache behaviors, memory layout etc. The GC is a complex feedback loop, so results from Epsilon are rarely directly transferable to a "real" system.

firefly2000 · 2026-02-20T20:46:13 1771620373

If the workload were perfectly parallelizable, your claim would be true. However, if it has serial dependency chains, it is absolutely worth it to compute it quickly and unreliably and verify in parallel

magicalhippo · 2026-02-20T22:37:46 1771627066

This is exactly what speculative decoding for LLMs do, and it can yield a nice boost.

Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.

If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.

firefly2000 · 2026-02-20T23:00:34 1771628434

I have a question about what "validation" means exactly. Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft? Wondering if there is a better method that preserves the distribution of the main model.

magicalhippo · 2026-02-21T02:34:06 1771641246

> Does this process work by having the main model compute the "probability" that it would generate the draft sequence, then probabilistically accepting the draft?

It does the generation as normal using the draft model, thus sampling from the draft model's distribution for a given prefix to get the next (speculated) token. But it then uses the draft model's distribution and the main model's distribution for the given prefix to probabilistically accept or reject the speculated token, in a way which guarantees the distribution used to sample each token is identical to that of the main model.

The paper has the details[1] in section 2.3.

The inspiration for the method was indeed speculative execution as found in CPUs.

[1]: https://arxiv.org/abs/2211.17192 Fast Inference from Transformers via Speculative Decoding

firefly2000 · 2026-02-17T18:06:59 1771351619

Is this Nvidia-only or does it work on other architectures?

LegNeato · 2026-02-17T18:10:55 1771351855

Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.

monster_truck · 2026-02-17T18:32:37 1771353157

I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.

In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)

LegNeato · 2026-02-17T21:29:05 1771363745

Thank you! We're small so have to focus. If anyone from AMD wants to reach out, happy to chat.

latchkey · 2026-02-18T01:29:24 1771378164

Anush is who you want to ping, he's motivated and will take care of you.

https://x.com/AnushElangovan

magic_at_nodai · 2026-02-18T02:39:56 1771382396

yes lmk how i can help. at the minimum i can get you hw and help with PRs etc. firstname at amd.com to reach me.

firefly2000 · 2026-02-17T18:59:39 1771354779

Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?