So now let's talk about the management problem of improving system availability. One plausible idea I've heard is to base the approach on the observation, attributed by some to Gartner, that 80% of unplanned outages are caused by "change". As an aside, I'm not myself sure that this is really what Gartner says. I did some Internet searches, and the actual statement, by Donna Scott, seems to be this: "80% of unplanned downtime is caused by people and process issues, including poor change management practices, while the remainder is caused by technology failures and disasters." I read that as including change but possibly also including other things like ill-conceived processes, carelessness, prioritizing firefighting to such an extent that root cause analyses are neglected, etc. (Though the article I just linked to would seem to agree that change is the primary cause.) Anyway, I'll just take it as a given that change is an important cause of outages, partly because of what Gartner says and partly because it just jibes well with my personal experience on the matter. I've pulled the botched-release all-nighter a time or two in my career. But does that really mean that change management is the silver bullet for availability woes?
Changes to the operational environment certainly cause outages, but so do a lot of other things. Poor capacity planning causes outages. Poor response processes (front-line and escalation) unnecessarily prolong outages and hence contribute to unavailability. Buggy software causes outages. So does poor monitoring, outdated distribution lists, inattention to root cause analyses and problem resolution, fault-intolerant system design, and a lot of other things. The key is to figure out which of those components is causing the most grief and to hit that first. (In this respect, it's very much like working on system performance issues: you look for the bottleneck, fix it, and find the next bottleneck if you're still having issues.) It may well be that we discover that change is responsible for our problems. In that case we'll reevaluate our change management processes and work with our teams to improve them.
Many environments are sufficiently complex that it's not always feasible to get to root cause for every issue. But it's not necessary to do so. As long as we can get at root cause for a reasonably representative sample of the issues that occur, we have enough information to steer us in the right direction. Sometimes we may have to just make a judgment call about the root cause, based on the available evidence, and that's probably unavoidable until we've established a certain capability and rhythm.
Here are the suggestions that I would offer:
Track your outages in some central location. It's a lot harder to figure out what's causing outages if you don't have a list of actual outages in front of you. If tech support has one list and operations has another, try to get some consolidation or at least consolidated reporting around those.
Don't immediately jump to conclusions about what's causing your outages, even if Gartner says that on average 80% of unplanned outages are caused by change (and again I'm not even sure that's what Gartner is saying). That's just an industry average, and like any distribution, there's variance around that average. Your IT shop may be struggling so much with capacity planning, for example, that the percentage of issues caused by change skews downward from the industry average. If you allocate 500GB of SAN for an application that requires 1TB, you will eventually have an outage. If you set your monitoring thresholds at 97% disk utilization, there's a good chance that you won't be able to respond quickly enough when that alert actually trips.
Identify a reasonably representative sample of outages from your list and try to get to root cause on them, even if you have to make judgment calls along the way. Root cause analysis will always involve judgment calls anyway because you always have to make a judgment as to when you stop asking "why did that happen"? The point is to get a good feeling for what the likely causes are. It's not important to be 100% correct. And you may very well find that poor change management is causing a lot of your outages.
Prioritize the root causes and implement the appropriate fixes. It may help to matrix it out so you can see which problems are high-impact, low-effort and fix those first. If the root causes are technical in nature, implement the technical fixes. If they're process-related, work with the people who actually use the processes on a daily basis to understand how those processes might be improved. It's nearly certain that they will have important insights about the specific process that you as a manager need to understand.