Most of us would like to spend less time firefighting and more time doing proactive work. Reactive firefighting gets in the way of the proactive things we really want to do, as it happens on its own schedule. In other words, while we get to plan when we want to evaluate that new cloud service or upgrade the SAN, we generally do not get to decide when we respond to a service outage or a mission critical hardware failure. While Incident Management is (by definition) a purely reactive endeavor, Problem Management is the key to this shift from proactive to reactive.
So why do so many organizations struggle with implementing Problem Management? It certainly isn’t for lack of effort or good intentions. Tools to support Problem Management are generally available, as well. Fortunately, Problem Management is actually one of the easiest processes to get off the ground – providing you stick to a few key principles.
Identify a Problem Manager
This first and most important step to implementing Problem Management is to… wait for it… Identify a Problem Manager. Did I just blow your mind? Thought so. Seriously, though – it is absolutely crucial to appoint someone who will be the single point of accountability for managing problems throughout their lifecycle. This person is the Problem Management process owner – the “A” in the RACI model – for a lot of Problem Management activities.
Now, I know some of you are thinking you already have a Problem Manager, and that it’s the Service Desk Manager (or whoever owns Incident Management in your organization). At this point I want you to stop reading, walk over to the corner of the room, and slap yourself in the face. I’ll wait.
Ok. Sorry I had to ask you to do that, but it had to be done. The Problem Management process owner should NEVER be the same as the person who owns Incident Management. I get emails from recruiters now and then looking for an “Incident and Problem Manager”, which makes me shake my fist at the ceiling and silently rage “THAT’S TWO PEOPLE!!!” as I sit at my desk. My coworkers think I’m extremely well adjusted. (Here’s more on Incidents vs. Problems vs. Service Requests.)
The Incident Manager can never be effective as a part time Problem Manager, because the two roles have fundamentally opposing goals. The primary objective of Incident Management is to restore normal service operation as quickly as possible (often through a workaround or temporary solution). Problem Management, on the other hand, has the mandate to identify the root cause(s) of the underlying problem and identify a permanent resolution (often implemented via the Change Management or Release Management processes. So, when the server is down, Incident Management is working to restore it as soon as possible, whereas Problem Management might want 20 minutes to look at the log files and figure out why it failed. In actuality, Problem Management does not have restoration of service as an objective – that’s Incident Management.
However, the biggest argument for separating the two roles is simpler: combining the roles means Problem Management doesn’t actually happen. The Incident Manager will always be waiting for that magical time just over the horizon when there are no incidents to deal with. “Maybe next week we can work on some of the proactive things, once the firefighting is done.” Sound familiar? Thought so. Without appointing a different person to that role, it’s next to impossible to make meaningful progress in Problem Management.
The problem manager does not have to be an additional headcount in your organization – this role is most successfully performed by someone who is already a senior technical manager (for example, manager of the System Administration teams, or Network Administration). This person actually acts as a sort of mini-project manager – keeping a backlog of all open problems, prioritizing the backlog, and driving problems to a successful resolution so they don’t sit out there forever.
What does the Problem Manager do?
Ok. You’ve appointed a Problem Manager. What’s next? Whereas Incident Management is purely reactive (and we have a Service Desk team that exists to react quickly and efficiently to these incidents), Problem Management does not have the same setup. In other words, most organizations will not have – or need – a dedicated Problem Management team. Rather, the people who are performing the other roles in the Problem Management process are mostly going to be people from across the various engineering/support/administration teams you already have in place. In fact, these people are already doing a lot of the things that fall under Problem Management; they’re just doing these things informally and in a reactive manner.
If you think about the last BHNP (Big Hairy Nasty Problem) you had in your organization, it probably went something like this: The problem existed for some period of time, until a VIP customer or someone in senior management was impacted, at which point it became the number one issue for the day. Conference calls were convened, meetings were held, fingers were pointed, vendors were blamed. Lots of teams worked together to eliminate possibilities and ultimately identify the root cause. Someone used the phrase “It was just a perfect storm”, because it turns out this issue could only have occurred on the second Tuesday of a month that does not end in an R when Mercury was in retrograde and someone mistyped their password twice between 3:22pm and low tide. Nice job, everyone.
The thing is, the next problem will be different, and it will require different individuals and teams to work through it. The Problem Manager is the one who drives and coordinates all of the necessary activities across the various functions and teams in order to get to the resolution. Once the root cause is known, it’s the Problem Manager who communicates the workarounds to the Service Desk, conducts the major Problem Review (if it was in fact a BHNP), and submits the Change Request if needed to implement the resolution.
How to Monitor and Measure Problem Management
Finally, in order to make sure our newly implemented process is successful, we have to make sure we are tracking the right Key Performance Indicators (KPIs). Your mileage may vary, but the following KPIs are a good place to start in terms of Problem Management Metrics:
- Number of problems in backlog, sorted by priority
- Average time to resolve a problem, based on priority
- Number and percentage of “recurring” incidents (should trend downward as Problem Management improves)
- Number and percentage of incidents resolved via workaround
Obviously this isn’t an exhaustive list (and one should always be careful to use a balanced set of KPIs to avoid inadvertently driving the wrong behaviors), but this should be a good place to start!