Debugging a Linux network stack crash via a single register value

Agingcoder · on Nov 19, 2021

Brilliant article, thanks for posting.

This is typically the kind of post which makes me want to actually apply, mostly because I can relate to it : it's not magical, doesn't require gazillions of tpus, or petabytes of storage : it's plain old, excellent engineering.

I wonder how often these issues creep up in practice, and how long it took the author to sort it out though! I've had my share of compiler bugs/kernel bugs, and they're usually quite expensive to understand, and it takes a long time to convince yourself that the bug is not in your code (granted, there's an oops here!)

babelfish · on Nov 19, 2021

https://cloudflare.com/careers :)

ndesaulniers · on Nov 19, 2021

I had a similar experience recently when we were trying to get AMDGPU Linux kernel drivers to run without panic'ing.

./scripts/decodecode is what produces the disassembly of the code trace from the panic. (Seeing its output converted in Intel syntax in this post is...heresy)

For AMDGPU, the issue was that the x86 Linux kernel doesn't use 16B stack alignment (it uses 8B stack alignment), yet just the AMDGPU driver was forcing the stack alignment for itself back to 16B. The AMDGPU driver uses SSE2 instructions that require 16B stack alignment. From the trace, seeing RSP as a multiple of 8 and not 16 was the smoking gun (ie. a single register).

The fix was to use the same stack alignment (8B) consistently in the driver when using sse2 (except for old GCC versions), see this series: https://lore.kernel.org/lkml/20191016230209.39663-2-ndesauln.... Stack alignment has obvious ABI implications.

abridgett · on Nov 19, 2021

Superb article: - shows how to debug a kernel oops - show use of extra tools (bpf, scapy, kasan) - dives deep into more esoteric bits of networking explaining from basic to how the kernel implementation works - demonstrates proof of the theory in multiple ways

Yes, it is a big "come work here on interesting stuff with fantastic people" (and show off your own skills and learn something neat) - and it's done without bragging and I'm sure will help others to debug their next OOPS :)

Kudos to Jakub (and presumably reviewers)

quotemstr · on Nov 19, 2021

Nobody should have to decode register values, look up source files versions, or reconstruct a stack trace by hand. Crashes, both in user- and kernel-space, should produce neat, tidy, self-contained dumps that include not only the entire machine state, but globally unique build IDs for all binaries involved in the crash. And the debugger ought to be able to load one of these crash dumps and find the debug symbols and source files automatically.

Windows has been able to do this for decades. Why, in Unix-land, are we still reading text reports about crashes and puzzling over specific register values?

drewg123 · on Nov 20, 2021

I don't know why this is downvoted. A crashdump would have made this 10x easier to debug.

The answer is because "its linux" and crashdumps (except on a few distros) are not part of the culture. These days linux actually kexec's another kernel with reserved physical memory to handle doing the dumps. This was absurdly hard to setup the last time I needed to do linux kernel debugging (years ago).

The linux kernel is generally compiled without frame pointers, and with other optimizations that are hostile to debugging.

We have kernel and userspace crashdumps enabled on our fleet of FreeBSD CDN servers, and having a real crashdump and not just a stack has been super helpful.

uvdn7 · on Nov 20, 2021

I agree with you but we should leave frame pointer out of this — as it almost has nothing to do with it? libunwind knows how to read DWARF anyway. It’s more an implementation detail.

bitcharmer · on Nov 19, 2021

How on earth are you going to map a particular spot in the code without access to the sources?

I'd love you to explain how hunting kernel bugs is easier on windows.

quotemstr · on Nov 19, 2021

Microsoft publishes symbols that you can use to decode kernel crashes. The availability of source code isn't my point though: the point is that parsing text debug dumps is stone age crap and we should move to a model where crashes produce self-describing crash dumps that users can load into debuggers that find all relevant metadata automatically.

You shouldn't have to puzzle over the meaning of each register in the x86_64 to figure out what value a function parameter had at crash time

stefan_ · on Nov 20, 2021

Debugging tools in Linux are in a really sad state indeed, certainly for collecting intel on production machines when crashes happen. Ironically the kernel is almost advanced here, because at least you can rely on the stacktrace working.

In userland, there is ancient stuff like libunwind - last I tried it on ARM, it was too dumb to follow the instruction pointer on a NULL function call. Not to mention there is some sort of deadlock when you use an external crashdump handler - the crashed process hangs in the kernel trying to feed your handler the dump and that interferes with using ptrace to find out what happened. The only reliable way is to put the crash handling hooks directly into libc, like Android does it - but that is not a thing at all with musl/glibc.

uvdn7 · on Nov 20, 2021

Is that the same as the decode_stacktrace.sh mentioned in the blog post or significantly more sophisticated?

I am sure we can always automate more things :)

quotemstr · on Nov 20, 2021

A debugger lets you follow pointers and inspect the entire memory space of a program. It gives you a lot more than just a text dump of a stack.

throwawaylinux · on Nov 20, 2021

Linux has debuggers and crash dumps.

megous · on Nov 19, 2021

"One register value", and the article starts with a full stack dump even with source code references available. :)

jerjerjer · on Nov 19, 2021

Yes, excellent article but clickbaity title. Honestly, all memory access errors eventually boil down to a "one register value" and many other error types probably too. I mean, what else is there if we go down far enough, really?

stuff4ben · on Nov 19, 2021

Wow, that was a journey and very well written technical article! I didn't understand a good half of it since it's been 20 years since my single class in assembly language. But it makes me feel good that people like this exist!

kingcharles · on Nov 19, 2021

This was my thought too. I can follow the x86 and C code, but the networking and kernel debugging was way above my pay grade.

Thank FSM for cleverer people than myself. Another case of feeling like an imposter...

abainbridge · on Nov 19, 2021

Impressive bug hunting skills.

When was this bug introduced? Does anyone maintain a list of when all the known Linux kernel bugs were introduced? I'd love to know how many bugs are added to the kernel each year, and if the rate is changing.

I'm not trying to troll. I think the Linux kernel is an amazing piece of software engineering. I just think this would be an interesting metric.

LukeShu · on Nov 19, 2021

The kernel folks do a pretty good job of keeping track of which past commits a new commit "fixes", which they put in the commit message. For example, the patch linked in the article says:

    Fixes: bf296b125b21 ("tcp: Add GRO support")
    Fixes: f993bc25e519 ("net: core: handle encapsulation offloads when computing segment lengths")
    Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")

That said, using this to track how many bugs are introduced each year is problematic. It's often the case that commit A introduces a bug, commit B aims to fix it and says "Fixes: A" but turns out to only be a partial fix, and then commit C completes the fix and says "Fixes: B". Naively, based on the se annotations it would make sense to say "B introduced a bug", but as in my example, this isn't always the case.

Greg KH discusses this in his talk "CVEs are dead" (video: https://www.youtube.com/watch?v=HeeoTE9jLjM slides: https://github.com/gregkh/presentation-cve-is-dead/blob/mast... ).

sbierwagen · on Nov 20, 2021

For the curious, bf296b125b21 was from 2008, f993bc25e519 was from 2014 and e20cf8d3f1f7 was in 2018.

uvdn7 · on Nov 19, 2021

Brilliant!! What's really cool is that Jakub approached the crash systemically. There are hard bugs in life e.g. cache inconsistencies, kernel bugs. What' more important than fixing them with one-off solution is to come up with a systemic approach to them. Great job!

Cloudflare is indeed a really cool place.

diskzero · on Nov 19, 2021

What a fun read! Was the bug fixed as as a result of this investigation or was the fix already in a patch that these machines didn't have. I wasn't able to figure that out by reading the article.

xbar · on Nov 19, 2021

It was fixed as a result of the article. Jakub opens with "About a year ago...." and closes with a link to the fix thread where he gets it accepted into the kernel back in July 2021.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

azinman2 · on Nov 19, 2021

If you click the link to the patch you see it was submitted by cloudflare.

Hikikomori · on Nov 19, 2021

Reminds me of https://quickview.cloudapps.cisco.com/quickview/bug/CSCso053...

holonomically · on Nov 19, 2021

Microsoft is working on a project to formalize various parts of the web stack and it would be interesting if their work also carried over to the lower parts of the networking stack like in this article. [1] I suspect this bug would have been caught if the segment handling logic was implemented in a language with a formal specification for the segment headers.

Is cloudflare working on any formalization efforts like Microsoft's Project Everest?

1: https://www.microsoft.com/en-us/research/project/project-eve...

p2p_astroturf · on Nov 20, 2021

Oh you think Google is cool? Bro CloudFlare hackers can debug a kernel panic through just a single register. I'm going to install CloudFlare on my website because everyone's doing it!