Hey there! Article author here -- just found out this was posted here and going through the comments :-)
One of the earliest test versions of my patch actually did inode reuse using slabs just like you're suggesting, but there are a few practical issues:
1. Performance implications. We use tmpfs internally within the kernel in a lock-free manner as part of some latency-sensitive operations, and using slabs complicates that somewhat. The fact that we make use of tmpfs internally as the kernel makes this situation quite different than other filesystems.
2. Back when I was writing the patch, each memory cgroup had its own set of slabs, which greatly complicated being able to reuse inodes as slabs between different services (since they run in different memcgs).
After it became clear that slab recycling wouldn't work, I wrote a test patch that uses IDA instead, but I found that the performance implications were also not tenable. There are other alternative solutions but they increase code complexity/maintenance non-trivially and aren't really worth it.
A 64-bit per-superblock inode space resolves this issue without introducing any of these problems -- before you go through 2^64-1 inodes, you're going to have other practical constraints anyway, at least for the timebeing :-)
Oh that's interesting, thank you! Note that when I said should, 1) it carries no weight and 2) I was referring to the old impl, not what the fix should be. Going 64-bit sounds like a good option, hopefully it can become the default.
Hey! Facebook engineer here. If you have it, can you send me the User-Agent for these requests? That would definitely help speed up narrowing down what's happening here. If you can provide me the hostname being requested in the Host header, that would be great too.
I just sent you an e-mail, you can also reply to that instead if you prefer not to share those details here. :-)
I'm not sure I'd publicly post my email like that, if I worked at FB. But congratulations on your promotion to "official technical contact for all facebook issues forever".
Don't think I used my email for anything important doing my time at FB. If it gets out of hand he could just make a request to have a new primary email made and use the above one for "spam"
Feels odd to read Facebook uses office365/exchange for emails. they haven't built their fsuite yet, I thought they would simply promote Facebook messenger internally. I'm only half joking.
Most communication is via Workplace (group posts and chat). Emails aren't very common any more - mainly for communication with external people and alerting.
But at least for email/calendar backend its exchange
The internal replacement clients for calendar and other things are killer...have yet to find replacements
For the most part though they use Facebook internally for messaging and regular communication (technically now Worplace but before it was just Facebook)
I don't want to share my website for personal reasons, but here is some data from cloudflare dashboard (a request made on 11 Jun, 2020 21:30:55 from Ireland, I have 3 requests in the same second from 2 different IPs)
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
ip 1: 2a03:2880:22ff:3::face:b00c (1 request)
ip 2: 2a03:2880:22ff:b::face:b00c (2 requests)
ASN: AS32934 FACEBOOK
I can by the way confirm this issue. I work in a large newspaper in Norway and around a year ago we saw the same issue. Thousands of requests per second until we blocked it . And after we blocked it, traffic to our Facebook page also plummeted. I assume Facebook considered our website down and thus wouldn't give users content from our Facebook page either as that would serve them content that would give a bad user experience. The Facebook traffic did not normalize before the attack stopped AND after we told Facebook to reindex all our content.
I'd you want more info, send me a email and il dig out some logs etc. thu at db.no
I had no idea anyone would find this so fast, so I thought it would be safe to change the url to match the final title before posting anywhere -- how wrong I was.
I mean, it really depends on your application how non-trivial the performance improvement will be, but this statement isn't theoretical -- memory bound systems are a major case where being able to transfer out cold pages to swap can be a big win. In such systems, having optimal efficiency is all about having this balancing act between overall memory use without causing excessive memory pressure -- swap can not only reduce pressure, but often is able to allow reclaiming enough pages that we can increase application performance when memory is the constraining factor.
The real question is why those pages are being held in RAM. If they're needed, swapping them out will induce latency. I'd they're a leak or not needed, the application should be fixed to not allocate swathes of RAM it does not use.
There are some systems which are memory-bound by nature, not as a consequence of poor optimisation, so it's not really as simple as "needed" or "not needed". As a basic example, in compression, more memory available means that we can use a larger window size, and therefore have the opportunity to achieve higher compression ratios. There are plenty of more complex examples -- a lot of mapreduce work can be made more efficient with more memory available, for example.
Indeed. None of the above are typically used (as in most of the time) on desktop systems where swap is the most problematic.
As for compession, the only engine I know of that wants more than 128 MB of RAM is lrzip and other rzip derivatives.
Common offenders that bog down the system in swap for me as a developer are the web browser, JVM (Android) and electron based apps (messengers, two).
I would also like a source that substantiate the claim that using swap in map-reduce workloads actually helps.
Or perhaps in database workloads. Or on any machine with relatively fixed workload.
> how much time did I spend waiting to swap-in a page
You can do this with eBPF/BCC by using funclatency (https://github.com/iovisor/bcc/blob/master/tools/funclatency...) to trace swap related kernel calls. It depends on exactly what you want, but take a look at mm/swap.c and you'll probably find a function which results in the semantics you want.
> Once desktop systems and applications support required APIs to handle saving state before being shut down.
SIGTERM and friends? :-)
If your application is just dropping state on the floor as a result of having an intentionally trappable signal being sent to it or its children, that seems like a bug.
> I never saw programs I would actually want to use try to allocate that much
What about applications that have memory-bound performance characteristics? In these cases, saving a bit of memory often directly translates into throughput, which translates into $$$.
This isn't a theoretical, a bunch of services which I've run in the past and currently literally make more money because of swap. By using memory more efficiently and monitoring memory pressure metrics instead of just "freeness" (which is not really measurable anyway), we allow more efficient use of the machine overall.
One problem with relying on the OOM killer in general is that the OOM killer is only invoked in moments of extreme starvation of memory. We really have no ability currently in Linux (or basically any operating system using overcommit) to determine when we're truly "out of memory", so the main metric used is our success or failure to reclaim enough pages to meet new demands.
As for the analogy -- there are metrics you can use today to bat away grandma before she starts hoarding too much. We have metrics for how much grandma is putting in the house (memory.stat), at what rate we kick our own stuff out of the house just to appease grandma, but then we realise we removed stuff we actually need (memory.stat -> workingset_refault), and similar. Using this and Johannes' recent work on memdelay (see https://patchwork.kernel.org/patch/10027103/ for some recent discussion), it's possible to see memory pressure before it actually impacts the system and drives things into swap.
> One problem with relying on the OOM killer in general is that the OOM killer is only invoked in moments of extreme starvation of memory. We really have no ability currently in Linux (or basically any operating system using overcommit) to determine when we're truly "out of memory", so the main metric used is our success or failure to reclaim enough pages to meet new demands.
The problem with relying on swap instead of the OOM killer is that, instead of the OOM killer, the user gets invoked in moments of extreme starvation of memory and the whole machine gets rebooted. The OOM killer is far gentler; it only kills processes until the extreme starvation is resolved.
> In my experience, a misbehaving linux system that's out of RAM and has swap to spare will be unusably slow.
Yeah, this is basically the main drawback of swap. I tried to address this somewhat in the article and the conclusion:
> Swap can make a system slower to OOM kill, since it provides another, slower source of memory to thrash on in out of memory situations – the OOM killer is only used by the kernel as a last resort, after things have already become monumentally screwed. The solutions here depend on your system:
> - You can opportunistically change the system workload depending on cgroup-local or global memory pressure. This prevents getting into these situations in the first place, but solid memory pressure metrics are lacking throughout the history of Unix. Hopefully this should be better soon with the addition of refault detection.
> - You can bias reclaiming (and thus swapping) away from certain processes per-cgroup using memory.low, allowing you to protect critical daemons without disabling swap entirely.
Have a go setting a reasonable memory.low on applications that require low latency/high responsiveness and seeing what the results are -- in this case, that's probably Xorg, your WM, and dbus.
One of the earliest test versions of my patch actually did inode reuse using slabs just like you're suggesting, but there are a few practical issues:
1. Performance implications. We use tmpfs internally within the kernel in a lock-free manner as part of some latency-sensitive operations, and using slabs complicates that somewhat. The fact that we make use of tmpfs internally as the kernel makes this situation quite different than other filesystems.
2. Back when I was writing the patch, each memory cgroup had its own set of slabs, which greatly complicated being able to reuse inodes as slabs between different services (since they run in different memcgs).
After it became clear that slab recycling wouldn't work, I wrote a test patch that uses IDA instead, but I found that the performance implications were also not tenable. There are other alternative solutions but they increase code complexity/maintenance non-trivially and aren't really worth it.
A 64-bit per-superblock inode space resolves this issue without introducing any of these problems -- before you go through 2^64-1 inodes, you're going to have other practical constraints anyway, at least for the timebeing :-)