You're neglecting the fact that which distro you choose has a large influence on the kernel version you get to run
That's nonsense. Most EC2 AMIs are linked to the amazon AKIs which are unrelated to whatever distro the AMI contains. Most of my debian instances run on a kernel tagged "fc8xen".
The ability to chainload a self-compiled kernel on EC2 is a relatively recent invention (mid-2010) and I have yet to see a good reason to do that for linux.
The article does unfortunately not mention which AKI(s) are affected, but it seems likely this bug was introduced because someone figured "newer is better" and went with the latest Ubuntu kernel instead of sticking to a proven amazon AKI.
This bug affected several "supported" AMIs running 2.6.32 series kernels that we tested at SimpleGeo, including the official AMI released by Canonical. After we ran out of patience debugging this stuff we contacted Amazon and worked on the issue with a guy from their kernel team (who was really helpful, fwiw). He agreed that the behavior was bizarre and opened an upstream bug with Canonical [1].
You're sort of contradicting yourself here. You suggest that the distro you're running is independent of the kernel version you're running. But then you go on to claim that this bug was introduced by someone who was not running the default supported kernel. Are you saying that people should run the supported kernel, and be tied to whatever's supported upstream, or are you saying they should risk building their own? Clearly there are benefits and drawbacks either way.
Amazon's official linux build, at this time, is a custom distribution that uses RPM and is different from the Ubuntu/Debian world in several significant ways (e.g., a new libc implementation). Migrating your WordPress blog to a new platform might be easy, but when you're managing hundreds of machines running thousands of packages that sort of change is not trivial.
Running "time-tested" kernels is not really the best advise either in this case. Xen is a fairly new environment, and EC2's implementation has some quirks, so there's a pretty regular stream of bug fixes and other improvements in recent kernels that are often worth picking up. If I went to Canonical with a "time-tested" kernel bug they'd tell me to upgrade before they'd give any real support.
When we talked to Amazon about switching to their AMIs they advised us that it was probably _not_ worth switching, that switching might not fix the problem, and that the AMIs we were running were widely used and supported. They made it clear that they work closely with Canonical and other providers to get high quality AMIs into their ecosystem. Long story short, the people who you admit know the most about the EC2 environment advised us that they weren't necessarily the best option, or at least not the only option, for good AMIs (sort of like how hardware manufacturers aren't the best option for an operating system).
So the answers aren't really cut-and-dry here. Every time Amazon changes their dom0 there's a chance your "time-tested" kernel will stop working. And just because Amazon runs the infrastructure doesn't mean they're the best choice for a Linux distribution.
You keep lumping the linux distribution in with the linux kernel. They are separate things. You can run Ubuntu on an Amazon AKI. I run Debian on Amazon AKIs. And if the latest Ubuntu depends on a particular kernel feature that the tested kernels don't have then it's probably not a good idea to run the latest Ubuntu.
Canonical maintains and updates their own AKIs for their official Ubuntu AMIs, which is what we're running in production. I suggest deferring to their judgement. Here's a full release history of the Ubuntu 10.04 server AMI's, note the changing AKIs and note that none of them are the default ones provided by Amazon:
Looks like their judgement didn't work out so well this time. I'd be wary of running the latest untested kernel for no reason other than "because we can".
So Canonical, a distribution maker, is not to be trusted for their kernel suggestions, but Amazon is infallible? On what basis?
We asked the Amazon kernel team if we should try switching to one of their kernels/distros, and they said "No, just upgrade to Maverick and the accompanying kernel." It's been pointed out that Maverick has its own set of Xen bugs. I guess Amazon doesn't know everything.
The horse you're getting on about using the "proven" Amazon kernels is a bit high. Turns out this whole virtualization thing is somewhat new, and the kinks are still being worked out. Old kernel builds don't work particularly well because a lot of their assumptions are broken by virtualization; new kernels are what they are - new.
(Edit, forgot initially): Finally, we ran 10.04 - the Long Term Support release of Ubuntu from a year ago. There was no "because we can."
Frankly, I'm a bit amazed at your disdain for people sharing their findings from practical experience running into these issues in high-load production environments.
So Canonical, a distribution maker, is not to be trusted for their kernel suggestions, but Amazon is infallible?
Neither is infallible. But Amazon probably knows the intricacies of their platform better than Canonical. And they likely run some of their own stuff on these kernels for a while before releasing them to the public.
Old kernel builds don't work particularly well
Don't work as in what? This is the first time I hear about a kernel problem on EC2.
disdain
I don't see where I voiced disdain. I merely responded to the guy who claimed your EC2 kernel is linked to the distro you run. That's simply not true.
If this is the first time you've heard about a kernel problem on EC2 you're probably not managing a very large EC2 infrastructure [1, 2]. Even in non-virtualized environments, at scale, it's common to run into linux kernel bugs, or at least peculiarities. Which is why large tech organizations invariably employ kernel dev teams.
The guy who claimed EC2 kernels are linked to the distro you run was simply claiming that, unless you want to go it on your own, you're tied to the kernel provided by a supported AMI. As you've suggested multiple times, there are benefits to running an environment that is supported and that other people have operational experience with. Honestly, I'm not even sure what you're arguing anymore... seems like you're just being antagonistic.
There are plenty AMIs based on stable AKIs out there. Moreover if you manage a "very large EC2 infrastructure" then you don't rely on 3rd party AMIs, do you?
Finally, your links point to... Ubuntu bugs.
If I missed one that was tracked back to an amazon AKI then a deeplink would be appreciated.
That's nonsense. Most EC2 AMIs are linked to the amazon AKIs which are unrelated to whatever distro the AMI contains. Most of my debian instances run on a kernel tagged "fc8xen".
The ability to chainload a self-compiled kernel on EC2 is a relatively recent invention (mid-2010) and I have yet to see a good reason to do that for linux.
The article does unfortunately not mention which AKI(s) are affected, but it seems likely this bug was introduced because someone figured "newer is better" and went with the latest Ubuntu kernel instead of sticking to a proven amazon AKI.