I can't read something like this without feeling really bad for everyone involved and taking a quick mental inventory of things I've screwed up in the past or potentially might in the future. Pressing the enter key on anything that affects a big-dollar production system is (and should be) slightly terrifying.
I have the same fear, but I wonder if that fear stems from lack of training and or documentation and or time.
When I ask myself why I am afraid of deploying to production servers its always because I don't fully understand what the deployment process does. If I had to deploy manually I would be lost at sea.
Is it just nievete thinking that enough documentation and training makes that fear go away?
Computer systems are rarely useful on their own. They need to be attached to some other process in order to derive any value from them. Sometimes it's a network, somethings it's a factory, sometimes it's just process run by humans.
In most domains that you attach computers to, it's possible for one person to understand both the computer side of it and the other side of it. Web development is like this, you can easily understand both the web and computer systems, as one is really just another version of the other.
If you're attaching the computer system to another very hard-to-understand system, like, say, a surgical robot, then the person that understands both domains well enough to avoid problems like this is a unicorn, there might well be only one person in the world that's built up expertise in both fields.
To get around this you need careful, effective management of the two pools of labor, and robust dialogue between the two teams. The second this starts breaking down, is the second you start marching down the road to disaster.
In the aforementioned example, everyone would be well-aware of the failure modes. So it's a bit easier to manage. In finance, failure modes can be so subtle, particularly as the system grows more complex, that they can escape detection by both teams unless they're both checking each other's work and keeping each other honest.
Institutional knowledge transfer has to constantly be happening, areas of ignorance on both sides have to constantly be appraised and plans undertaken to reduce said ignorance. The more everyone knows about the entire system, the more likely it will be that critical defects like this can be discovered before they strike.
This kind of effective interaction of those at the bottom can only be organized and directed at the top. It's very much a "captains winning the war"-type situation, but captains can't lead without support from the generals.
Do you have any advice with regards to institutional knowledge transfer, or have you seen any examples of when this was done exceptionally well?
Knowledge transfer problems have been a running theme at many of my previous work places, I'm interested in what I can do to help, I'm a documentation proponent but there must be more.
I've always thought that it's beneficial to wear multiple hats and sit with multiple teams/people either throughout your employment or at the beginning. I think the more hands on you are with every aspect of a business the less likely you are to insulate yourself or create silos and barriers.
It's a really hard problem. From a personal standpoint as a coder, the problem with documentation is that it has to be maintained same as the rest of the application it's documenting. That's why you see the push towards self-documenting code in the Ruby world, where you can just look at a code file and know exactly what it's doing because convention. Every tool you add on to your workload doesn't just impose an initial dev cost, but also an ongoing maintenance cost.
When you're also dealing with humans, you have to pay a management cost too, so you have some documentation, where are you going to put it? Is there a repository somewhere where you could put it where anyone who wants to work on the program later will know to look? Often there isn't. So you have to make one. It will need the use of company resources. You need to educate people that there's this place for documentation and everyone needs to use it, because it won't make any sense for a documentation repository to have just your own documentation in it.
The size and scope of what it takes to be effective at this makes it a management problem, resources need to be allocated, and directions have to be given. Someone has to drive the project, to make it happen above and beyond his job duties.
Most of the time this happens when something big fails. Today, an important scheduled email was just found out to not have happened for the last eight weeks. The stakeholders did not notice that the email was not hitting their desks every MWF as usual. When the problem was fixed because another process that was less important but actually had someone on the ball enough to notice it was failing complained, the important email started coming in again, prompting a giant WTF. My reaction is a giant shrug, if it's important to you, you need to be monitoring it, I'm not omniscient, pushing the responsibility back onto the management.
So the knowledge transfer in this situation goes as follows, I need to know which business processes are important, "all of them" is not an acceptable answer. Second, other teams need to be aware that when systems are automated, that means that they're not really being monitored, that's what automation means. Whoever is in charge needs to delegate a human to do it manually. That person can't be me, my time is too valuable for that shit.
Eventually, I can build a system for getting the kind of feedback that wasn't built into the system in the first place, maybe some kind of job verification system, that, after enough tweaking makes the system as a whole more reliable. But that still won't remove the necessity for some human to have the job at the end point of the system to ensure that the information is flowing on time and on target. No matter how robust the system is, there will still be silent failure modes that can go for months or years unnoticed.
Wow, thanks for your thoughtful response, you've helped me realize I already have it rather good by comparison. Eg we already have a designated One Source Of Truth for documentation.
Having been a coder I do understand that documentation is just something else to maintain. And this is a sentiment I hear from many people in development. Read the code is a common response.
As someone who shares part of the on call rotation though, I've grown to see it differently. While that documentation is extra overhead to maintain, having that documentation ready will save me from having to call you at 3AM when your project breaks in production and your documentation was written without consideration for on call. When this happens, I have no choice and you're getting woken up at 3am. I've found that when cast in this light, I have yet to meet a developer who wasn't eager to avoid that early morning phone call.
Thank you for your comment, its good to know the color of the grass on the other side of the fence sometimes :)