More

LegNeato · 2026-02-17T20:17:08 1771359428

Doing things at compile time / AOT is almost always better for perf. We believe async/await and futures enables more complex programs and doing things you couldn't easily do on the GPU before. Less about performance and more about capability (though we believe async/await perf will be better in some cases, time will tell).

bob1029 · 2026-02-18T09:47:02 1771408022

> Doing things at compile time / AOT is almost always better for perf

https://devblogs.microsoft.com/dotnet/bing-on-dotnet-8-the-i...

LegNeato · 2026-02-17T18:10:55 1771351855

Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.

monster_truck · 2026-02-17T18:32:37 1771353157

I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.

In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)

LegNeato · 2026-02-17T21:29:05 1771363745

Thank you! We're small so have to focus. If anyone from AMD wants to reach out, happy to chat.

latchkey · 2026-02-18T01:29:24 1771378164

Anush is who you want to ping, he's motivated and will take care of you.

https://x.com/AnushElangovan

magic_at_nodai · 2026-02-18T02:39:56 1771382396

yes lmk how i can help. at the minimum i can get you hw and help with PRs etc. firstname at amd.com to reach me.

firefly2000 · 2026-02-17T18:59:39 1771354779

Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?

LegNeato · 2026-02-17T18:10:27 1771351827

We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.

The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.

textlapse · 2026-02-17T18:55:05 1771354505

warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.

GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).

So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?

Good tech showcase though :)

zozbot234 · 2026-02-17T19:51:51 1771357911

I think you're conflating GPU 'threads' and 'warps'. GPU 'threads' are SIMD lanes that are all running with the exact same instructions and control flow (only with different filtering/predication), whereas GPU warps are hardware-level threads that run on a single compute unit. There's no issue with adding extra "don't run code" when using warps, unlike GPU threads.

textlapse · 2026-02-17T20:38:04 1771360684

My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.

I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.

It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?

rowanG077 · 2026-02-17T20:49:02 1771361342

That applies inside a single warp, notice the wording:

> In SIMT, all threads in the warp are executing the same kernel code, but each thread may follow different branches through the code. That is, though all threads of the program execute the same code, threads do not need to follow the same execution path.

This doesn't say anything about dependencies of multiple warps.

textlapse · 2026-02-17T21:05:49 1771362349

It's definitely possible, I am not arguing against that.

I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.

I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.

You still have to worry about different architectures and the streaming nature at the end of the day.

I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.

LegNeato · 2026-02-17T17:57:48 1771351068

Yes, that's the idea.

GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.

LegNeato · 2026-01-28T03:46:40 1769572000

We use the cuda device allocator for allocations on the GPU via Rust's default allocator.

saagarjha · 2026-01-28T15:02:50 1769612570

Have you considered “allocating” out of shared memory instead?

LegNeato · 2026-01-28T03:42:30 1769571750

Flip on the pedantic switch. We have std::fs, std::time, some of std::io, and std::net(!). While the `libc` calls go to the host, all the `std` code in-between runs on the GPU.

LegNeato · 2026-01-28T03:41:01 1769571661

Author here! Flip on the pedantic switch, we agree ;-)

LegNeato · 2025-10-23T21:59:25 1761256765

You might be interested in a previous blog post where we showed one codebase running on many types of GPUs: https://rust-gpu.github.io/blog/2025/07/25/rust-on-every-gpu...

LegNeato · 2025-10-23T17:44:47 1761241487

One of the founders here, feel free to ask whatever. We purposefully didn't put much technical detail in the post as it is an announcement post (other people posted it here, we didn't).

structural · 2025-10-24T04:44:37 1761281077

1. What does it mean to be a GPU-native process?

2. Can modern GPU hardware efficiently make system calls? (if you can do this, you can eventually build just about anything, treating the CPU as just another subordinate processor).

3. At what order-of-magnitude size might being GPU-native break down? (Can CUDA dynamically load new code modules into an existing process? That used to be problematic years ago)

Thinking about what's possible, this looks like an exceptionally fun project. Congrats on working on an idea that seems crazy at first glance but seems more and more possible the more you think about it. Still it's all a gamble of whether it'll perform well enough to be worth writing applications this way.

LegNeato · 2025-10-24T11:43:01 1761306181

1. The GPU owns the control loop And the only sparingly kicks to the CPU when it can't do something.

2. Yes

3. We're still investigating the limitations. A lot of them are hardware dependent, obviously data center cards have higher limits more capability than desktop cards.

Thanks! It is super fun trailblazing and realizing more of the pieces are there than everybody expects.

LegNeato · 2025-10-23T17:42:49 1761241369

Pedantic note: rust-cuda was created by https://github.com/RDambrosio016 and he is not currently involved in VectorWare. rust-gpu was created by the folks at embark software. We are the current maintainers of both.

We didn't post this or the title, we would never claim we created the projects from scratch.

ashvardanian · 2025-10-23T17:44:44 1761241484

My bad! "contributors" is more accurate, but HN doesn't allow editing titles, sadly :(

kibwen · 2025-10-23T20:13:10 1761250390

HN allows the submitter to edit the title, at least it did last time I checked.

pjmlp · 2025-10-24T07:32:28 1761291148

It still does, but you have a timeout for the first set of minutes after submission.

I routinely have to fix the autoformating done by HN.

LegNeato · 2025-10-23T17:57:10 1761242230

No worries, just wanted to correct it for folks. Thanks for posting!

Keyframe · 2025-10-23T22:28:23 1761258503

folks at embark software

seems like embark has disembarked from Rust and support for it altogether