It's really not enough to just say that a GC gave you more pointers = it has worse cache locality. Compacting GC almost always has better cache utilization than malloc, because heap fragmentation over long-running programs will waste TLB cache entries and slack space between objects. A bump allocator from a compacting GC will give you a new pointer for each allocation because free doesn't reclaim the memory...but those allocations will be sequentially, and if you were in the case where you are churning through your heap and only ever touch the most recent object they will always be in cache still. Benchmarking the knock-on effects of allocators and GCs are insanely difficult and I'm very skeptical of basically any synthetic benchmarks like this.
I think the fact that it is complicated to reason about is precisely why systems developers don’t trust GC’s.
It’s far easier to write a threadsafe bump (slab) allocator than to locate and diagnose the code that someone wrote two years ago and, as of last week, started causing the GC to blow up the cache, contend on a lock, fragment the heap, add unbounded pauses, burn an extra cpu, etc, etc.
(Though, at this point, most mallocs are so good that the slab allocator loses anyway and there’s no need to bother.)
FWIW, that synthetic benchmark was reflective of some real world code we were deploying.
Using malloc/free for one function led to something like a 2x performance improvement of the whole program.
I think it's important to differentiate between malloc implementations/algorithms, just like it's important to differentiate between GCs.
E.g., mimalloc "shards" size classes into pages, with separate free lists per page. This way, subsequent allocations are all from the same page. Freeing does not free eagerly; only if the entire page is freed, or if we hit a new allocation and the page is empty, then it can hit a periodic slow path to do deferred work.
https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
Good malloc implementations can also employ techniques to avoid fragmentation.
It's unfortunate that the defaults are bad.
But I confess, compacting GCs and profiling the effects of heap fragmentation (especially over time in long running programs) are both things I lack experience in. Microbenchmarks are unlikely to capture that accurately.
If we are talking long running multithreaded processes, then libc malloc have issues that isn't even fragmentation. There are many workloads when it seems to forget to return whole empty pages and just accumulate allocated areas despite them being totally empty.
compaction really does help runtimes alot. but I'm not sure how much it really has to do with line level locality. in general we don't try to batch related objects together except in a coarse form by generation.
I think the measurable benefit comes from page level savings, both reducing the number of trips to the kernel to get zeroes pages, and from reduced pressure on the tlb.
but I have definitely seen numbers like 20% on some workloads for turning on compaction
> in general we don't try to batch related objects together except in a coarse form by generation.
It greatly depends on the GC algorithm, right? Copying collectors naturally bunch up related objects (though of course, in which order is less well defined, if one object has multiple pointers to others).
> Compacting GC almost always has better cache utilization than malloc
...only if you put each object into its own heap allocation, which is a dumb thing to do in manual memory management. Typically you'd group many related objects of the same type tightly packed into the same allocation and loop over them in sequence to make proper use of prefetching.
The programmer will always have a better idea how objects are going to be used compared to an automatic memory management solution, and even with a GC it makes a lot of sense to use arrays of value types for better memory layout control.