That makes sense. I don't know anything about embedded programming really but I thought that it really fundamentally requires async (in the conceptual sense). So you have to structure your program as an event loop no matter what. Wasn't the alleged goal of rust async to be zero-cost in the sense that the program transformation of a future ends up being roughly what you would write by hand if you have to hand-roll a state machine? Of course the runtime itself requires a runtime and I get why something like Tokio would be a non-started in embedded environments, but you can still hand-roll the core runtime and structure the rest of the code with async/await right? Or are you saying that the generated code even without the runtime is too heavy for an embedded environment?
> fundamentally requires async (in the conceptual sense)
Sometimes, kind of. For some counter-examples, consider a security camera or a thermostat. In the former you run in a hot loop because it's more efficient when you constantly have stuff to do, and in the latter you run in a hot loop (details apply for power-efficiency reasons, but none which are substantially improved by async) since the timing constraints are loose enough that you have no benefit from async. One might argue that those are still "conceptually" async, but I think that misses the mark. For the camera, for example, a mental model of "process all the frames, maybe pausing for a bit if you must" is going to give you much better results when modeling that domain and figuring out how to add in other features (between those two choices of code models, the async one buys you less "optionality" and is more likely to hamstring your business).
> zero-cost
IMO this is a big misnomer, especially when applied to abstractions like async. I'll defer async till a later bullet point, looking instead at simpler abstractions.
The "big" observation is that optimization is hard, especially as information gets stripped away. Doing it perfectly seemingly has an exponential cost (active research problem to reduce those bounds, or even to reduce constant factors). Doing it approximately isn't "zero"-cost.
With perfect optimization being impossible for all intents and purposes, you're left with a world where equivalent units of code don't have the same generated instructions. I.e., the initial flavor of your code biases the generated instructions one way or another. One way of writing high-performance code then is to choose initial representations which are closer to what the optimizer will want to work with (basically, you're doing some of the optimization yourself and relying on the compiler to not screw it up too much -- which it mostly won't (there be dragons here, but as an approximate rule of thumb) because it can't search too far from the initial state you present to it).
Another framing of that is that if you start with one of many possible representations of the code you want to write, it has a low probability of giving the compiler the information it needs to actually optimize it.
Let's look at iterators for a second. The thing that's being eliminated with "zero-cost" iterators is logical instructions. Suppose you're applying a set of maps to an initial sequence. A purely runtime solution (if "greedy" and not using any sort of builder pattern) like you would normally see in JS or Python would have explicit "end of data" checks for every single map you're applying, increasing the runtime with all the extra operations existing to support the iterator API for each of those maps.
Contrast that with Rust's implementation (or similar in many other languages, including Zig -- "zero-cost" iterators are a fun thing that a lot of programmers like to write even when not provided natively by the language). Rust recognizes at compile-time that applying a set of maps to a sequence can be re-written as `for x in input: f0(f1(f2(...(x))))`. The `for x in input` thing is the only part which actually handles bounds-checking/termination-checking/etc. From there all the maps are inlined and just create optimal assembly. The overhead from iteration is removed, so the abstraction of iteration is zero-cost.
Except it's not, at least not for a definition of "zero-cost" the programmer likely cares about (I have similar qualms about safe Rust being "free of data-races", but those are more esoteric and less likely to come up in your normal day-to-day). It's almost always strictly better than nested, dynamic "end of iterator" checks, but it's not actually zero-cost.
Taking as an example something that came up somewhat recently for me, math over fields like GF(2*16) can be ... interesting. It's not that complicated, but it takes a reasonable number of instructions (and/or memory accesses). I understand that's not an every-day concern for most people, but the result will illustrate a more general point which does apply. Your CPU's resources (execution units, instruction cache, branch-prediction cache (at several hierarchial layers), etc) are bounded. Details vary, but when iterating over an array of data and applying a bunch of functions, even when none of that is vectorizable, you very often don't want codegen with that shape. You instead want to pop a few elements, apply the first function to those elements, apply the second function to those results, etc, and then proceed with the next batch once you've finished the first. The problems you're avoiding include data dependencies (it's common for throughput for an instruction to be 1-2/cycle but for latency to be 2-4 cycles, meaning that if one instruction depends on another's output it'll have to wait 2-4 cycles when it could in theory otherwise process that data in 0.5-1 cycles) and bursting your pipeline depth (your CPU can automagically resolve those data dependencies if you don't have too many instructions per loop iteration, but writing out the code explicitly guarantees that the CPU will _always_ be happy).
BUT, your compiler often won't do that sort of analysis and fix your code's shortcomings. If that approximate layout of instructions doesn't exist in your code explicitly then the optimizer won't solve for it. The difference in performance is absolutely massive when those scenarios crop up (often 4-8x). The "zero-cost" iterator API won't yield that better codegen, since it has an output that the optimizer can't effectively turn into that better solution (yet -- polyhedral models solve some similar problems, and that might be something that gets incorporated in modern optimizers eventually -- but it doesn't exist yet, it's very hard, and it's illustrative of the idea that optimizers can't solve all your woes; when that one is fixed there will still exist plenty more).
> zero-cost async
Another pitfall of "zero-cost" is that all it promises is that the generated code is the same as what you would have written by hand. We saw in the iterator model that "would have written" doesn't quite align between the programmer and the compiler, but it's more obvious in their async abstraction. Internally, Rust models async with state machines. More importantly, those all have runtime-known states.
You asked about hand-rolling the runtime to avoid Tokio in an embedded environment. That's a good start, but it's not enough (it _might_ be; "embedded" nowadays includes machines faster than some desktops from the 90s; but let's assume we're working in one of the more resource-constrained subsets of "embedded" programming). The problem is that the abstraction the compiler assumes we're going to need is much more complicated than an optimal solution given the requirements we actually have. Moreover, the compiler doesn't know those requirements and almost certainly couldn't codegen its assumptions into our optimal solution even if it had them. If you use Rust async/await, with very few exceptions, you're going to end up with both a nontrivial runtime (might be very light, but still nontrivial in an embedded sense), and also a huge amount of bloat on all your async definitions (along with runtime bloat (RAM+CPU) as you navigate that unnecessary abstraction layer).
The compiler definitely can't strip away the runtime completely, at least for nontrivial programs. For sufficiently simple programs it does a pretty good job (you still might not be able to afford supporting the explicit state machines it leaves behind, but whatever, most machines aren't _that_ small), but past a certain complexity level we're back to the idea of zero-cost abstractions not being real because of optimization impossibility, when you use most of the features you might want to use with async/await you find that the compiler can't fully desugar even very simple programs, and fully dynamic async (by definition) obviously can't exist without a runtime.
So, answering your question a bit more directly, my answer is that you usually can't fix the issue by hand-rolling the core runtime since it won't be abstracted away (resulting in high RAM/ROM/CPU costs), and even in sufficiently carefully constructed and simple code that it will be abstracted away you're still left with full runtime state machines, which themselves are overkill for most simple async problems. The space and time those take up can be prohibitive.