This is typically the kind of post which makes me want to actually apply, mostly because I can relate to it : it's not magical, doesn't require gazillions of tpus, or petabytes of storage : it's plain old, excellent engineering.
I wonder how often these issues creep up in practice, and how long it took the author to sort it out though! I've had my share of compiler bugs/kernel bugs, and they're usually quite expensive to understand, and it takes a long time to convince yourself that the bug is not in your code (granted, there's an oops here!)
I had a similar experience recently when we were trying to get AMDGPU Linux kernel drivers to run without panic'ing.
./scripts/decodecode is what produces the disassembly of the code trace from the panic. (Seeing its output converted in Intel syntax in this post is...heresy)
For AMDGPU, the issue was that the x86 Linux kernel doesn't use 16B stack alignment (it uses 8B stack alignment), yet just the AMDGPU driver was forcing the stack alignment for itself back to 16B. The AMDGPU driver uses SSE2 instructions that require 16B stack alignment. From the trace, seeing RSP as a multiple of 8 and not 16 was the smoking gun (ie. a single register).
Superb article:
- shows how to debug a kernel oops
- show use of extra tools (bpf, scapy, kasan)
- dives deep into more esoteric bits of networking explaining from basic to how the kernel implementation works
- demonstrates proof of the theory in multiple ways
Yes, it is a big "come work here on interesting stuff with fantastic people" (and show off your own skills and learn something neat) - and it's done without bragging and I'm sure will help others to debug their next OOPS :)
Nobody should have to decode register values, look up source files versions, or reconstruct a stack trace by hand. Crashes, both in user- and kernel-space, should produce neat, tidy, self-contained dumps that include not only the entire machine state, but globally unique build IDs for all binaries involved in the crash. And the debugger ought to be able to load one of these crash dumps and find the debug symbols and source files automatically.
Windows has been able to do this for decades. Why, in Unix-land, are we still reading text reports about crashes and puzzling over specific register values?
I don't know why this is downvoted. A crashdump would have made this 10x easier to debug.
The answer is because "its linux" and crashdumps (except on a few distros) are not part of the culture. These days linux actually kexec's another kernel with reserved physical memory to handle doing the dumps. This was absurdly hard to setup the last time I needed to do linux kernel debugging (years ago).
The linux kernel is generally compiled without frame pointers, and with other optimizations that are hostile to debugging.
We have kernel and userspace crashdumps enabled on our fleet of FreeBSD CDN servers, and having a real crashdump and not just a stack has been super helpful.
I agree with you but we should leave frame pointer out of this — as it almost has nothing to do with it? libunwind knows how to read DWARF anyway. It’s more an implementation detail.
Microsoft publishes symbols that you can use to decode kernel crashes. The availability of source code isn't my point though: the point is that parsing text debug dumps is stone age crap and we should move to a model where crashes produce self-describing crash dumps that users can load into debuggers that find all relevant metadata automatically.
You shouldn't have to puzzle over the meaning of each register in the x86_64 to figure out what value a function parameter had at crash time
Debugging tools in Linux are in a really sad state indeed, certainly for collecting intel on production machines when crashes happen. Ironically the kernel is almost advanced here, because at least you can rely on the stacktrace working.
In userland, there is ancient stuff like libunwind - last I tried it on ARM, it was too dumb to follow the instruction pointer on a NULL function call. Not to mention there is some sort of deadlock when you use an external crashdump handler - the crashed process hangs in the kernel trying to feed your handler the dump and that interferes with using ptrace to find out what happened. The only reliable way is to put the crash handling hooks directly into libc, like Android does it - but that is not a thing at all with musl/glibc.
Yes, excellent article but clickbaity title.
Honestly, all memory access errors eventually boil down to a "one register value" and many other error types probably too. I mean, what else is there if we go down far enough, really?
Wow, that was a journey and very well written technical article! I didn't understand a good half of it since it's been 20 years since my single class in assembly language. But it makes me feel good that people like this exist!
When was this bug introduced? Does anyone maintain a list of when all the known Linux kernel bugs were introduced? I'd love to know how many bugs are added to the kernel each year, and if the rate is changing.
I'm not trying to troll. I think the Linux kernel is an amazing piece of software engineering. I just think this would be an interesting metric.
The kernel folks do a pretty good job of keeping track of which past commits a new commit "fixes", which they put in the commit message. For example, the patch linked in the article says:
That said, using this to track how many bugs are introduced each year is problematic. It's often the case that commit A introduces a bug, commit B aims to fix it and says "Fixes: A" but turns out to only be a partial fix, and then commit C completes the fix and says "Fixes: B". Naively, based on the se annotations it would make sense to say "B introduced a bug", but as in my example, this isn't always the case.
Brilliant!! What's really cool is that Jakub approached the crash systemically. There are hard bugs in life e.g. cache inconsistencies, kernel bugs. What' more important than fixing them with one-off solution is to come up with a systemic approach to them. Great job!
What a fun read! Was the bug fixed as as a result of this investigation or was the fix already in a patch that these machines didn't have. I wasn't able to figure that out by reading the article.
It was fixed as a result of the article. Jakub opens with "About a year ago...." and closes with a link to the fix thread where he gets it accepted into the kernel back in July 2021.
Microsoft is working on a project to formalize various parts of the web stack and it would be interesting if their work also carried over to the lower parts of the networking stack like in this article. [1] I suspect this bug would have been caught if the segment handling logic was implemented in a language with a formal specification for the segment headers.
Is cloudflare working on any formalization efforts like Microsoft's Project Everest?
Oh you think Google is cool? Bro CloudFlare hackers can debug a kernel panic through just a single register. I'm going to install CloudFlare on my website because everyone's doing it!
This is typically the kind of post which makes me want to actually apply, mostly because I can relate to it : it's not magical, doesn't require gazillions of tpus, or petabytes of storage : it's plain old, excellent engineering.
I wonder how often these issues creep up in practice, and how long it took the author to sort it out though! I've had my share of compiler bugs/kernel bugs, and they're usually quite expensive to understand, and it takes a long time to convince yourself that the bug is not in your code (granted, there's an oops here!)