In my last piece, I talked about Problem Management and how it is often confused with Major Incident management. In fact, while the two are linked, they are very different beasts. However strong your Problem Management, you must be ready for business critical incidents.
Do you know a major incident when you see it?
Often the signs are quite obvious- the desk gets flooded with similar tickets from many users, critical business functions are being affected and steam is pouring from the ears of the management team. While these are obvious indicators, it isn't very helpful to use panic levels as your definition. ITIL requires that the IT team and the business agree on what constitutes a major incident- this will be based on severity, urgency and impact. ISO 20000 requires the following steps for major incident management:
- Agreement on what constitutes a major incident
- A distinct and separate procedure for ‘major’ vs other incidents
- An outline of responsibilities and responsible parties
- A defined review process
Let’s take a closer look at each of these requirements to help you create your Major Incident plan before you need it.
Agreeing on Major Incident definitions
What constitutes a major incident for one business might not for another. The same goes for business units within one business. At Plan-Net, we have SLAs set up to manage Incident classification that is specific to each client. If you run an in-house team, SLAs are just as important. Each department will likely need its own set of resolution times, resources and communication lines according to its needs and business function.
Major incident roles & responsibilities
Running around like a headless chicken during a Major Incident is not a good look. Roles and processes should be strictly defined before a Major Incident strikes. When kicking off your MI process, nothing quite beats a war-room style meeting by getting all relevant parties together and reminding them of their roles.
1) Major Incident Manager
He or she will be responsible for overseeing the major incident process, ensure that the appropriate resources are engaged and the users and management team are kept informed of the progress. Depending on the size of the IT team, it could be a Service Desk analyst or a more senior technical manager with knowledge specific to the incident type.
2) Problem Manager
While this resource will need to be involved, it should be a different person to the Major Incident Manager. A Problem Manager will be most useful after the resolution to help with root cause analysis but this can take time. The Incident Manager will be pushing for an immediate fix so that normal business can resume ASAP.
3) Service Desk
It goes without saying perhaps, but it must be decided how much of your Service Desk should be allocated to the Major Incident. In serious cases, it might be decided that it should be all hands on deck for the Major Incident and everything seemingly unrelated should go on hold.
4) Change manager
If major changes had to be implemented in order to restore service, your Change Manager will need to be involved.
5) SLA manager
Someone needs to be recording downtime and SLA misses so that this can be reported internally and to the customer or management teams.
There is one vital role missing from the above list...the customer!
Whether your customer is within your own business or a paying client, they need to be kept in the loop as much as possible. Your Incident Manager should be providing a quick and concise summary at least every hour - more frequently if possible.
Here are the main points to provide to the customer, it needn’t be war and peace but should include the following:
- Short description of the cause of the downtime
- Impact of the downtime
- Estimated time of resolution
Creating a template in advance will help the MI manager keep to the point and ensure timely delivery of updates.
Root Cause Analysis
Once the incident has been resolved, you will need to produce a report on how it happened, why it happened and how to prevent it happening again. This is where your Problem Manager steps in. By working through the tickets you can perform an RCA. This should be checked against the solution used to get up and running again so that any loose ends are dealt with. A patched-together temporary solution may not be sufficient.
If you are experiencing repeated Major Incidents, go back to your Problem Management processes and look for root causes regularly.
If you didn’t get a chance to read my tips on Problem Management yet, you can find it here.