The architecture specific backends are another significant part of the compiler,...

jimsimmons · on May 13, 2023

What if I care about like 4 archs: x86/64, arm64 and PTX

mananaysiempre · on May 13, 2023

You might be underestimating the intricacy of the CPU models LLVM uses.

If you want to see them in action, the same data drives llvm-mca[1], which given a loop body can tell you the throughput, latency, and microarchitectural bottlenecks (decoding, ports, dependencies, store forwarding, etc.)—if not always precisely, then still as well on average as, say, x86’s IACA, the tool written at Intel by people who presumably knew how those CPUs work, unlike LLVM contributors and the rest of us who can only guess and measure. This separately for Haswell, Sandy Bridge, Skylake, etc.; not “x86”.

Now, is this the best model you can get? Not exactly[2], but it’s close enough to not matter. Do we often need machine code optimized that finely? Perhaps not[3], and if you’re using generic distro binaries, you’re not getting it, either. (Unlike Facebook, Google, etc., who know precisely what their servers have inside, and who fund or contribute sizable portions of this optimization work.)

With that in mind you can shave at least a factor of ten off LLVM’s considerable bulk at the cost of 20—30% of performance[4,5]. But if you do want those as well, it seems that the complexity of LLVM is a fair price, or has the right order of magnitude at least.

(Frontend not included, C++ frontend required to bootstrap sold separately, at a similar markup compared to a C-only frontend with somewhat worse ergonomics.)

[1] https://llvm.org/docs/CommandGuide/llvm-mca.html

[2] https://www.uops.info/

[3] https://briancallahan.net/blog/20211010.html

[4] https://c9x.me/compile/

[5] https://drewdevault.com/talks/qbe.html

mhh__ · on May 13, 2023

How much benefit do the models actually provide? Other than ISel my processor now has an enormous ROB, and a huge number of execution units, I've never really noticed anything super dramatic from compiler instruction scheduling.

MCA is basically useless for most programmers because it can't model the memory accesses i.e. cache performance.

compiler-guy · on May 13, 2023

Then you still have to do 4x the amount of work than you otherwise would. And when the next iteration of those chips become available, you will have to update your tooling yourself instead of get updated pipeline models for the cost of a source upgrade.

JonChesterfield · on May 13, 2023

Abstracting over x64 and aarch64 will go reasonably well until the backend. Throwing ptx (or sass, or amdgpu) into the mix will make life much more difficult.

wyldfire · on May 13, 2023

Do code size, instruction scheduling and register allocation matter to you?

Do you have an alternative to LLVM that only provides those four architectures? If not, you can constrain the scope of your LLVM build using -DLLVM_TARGETS_TO_BUILD.

saagarjha · on May 13, 2023

Your backend will probably be much worse than the one LLVM provides.