Interruptions will happen. Systems will freeze – and no software or hardware is 100% failsafe. These things are facts of life. But it’s not the individual technical incident itself that poses the biggest risk to your organisation: it’s what happens next that really matters.
Downtime costs UK businesses more than £10 billion a year – and the loss to a business of each hour of downtime is estimated at more than £250,000 on average. The smoother and swifter your incident management journey, the lower the overall impact on your organisation.
Here’s how to design your management journey for speedy and effective incident resolution.
The Start Point: Define an ‘Incident’
An "incident" is defined as an interruption to an IT service that is both:
- UNPLANNED and
- Affects the QUALITY of that service.
The first element raises an important point concerning communication. If your IT team has scheduled a major patch or upgrade that will necessitate service interruption, make sure you let the end users know what is happening in advance. This means those users can plan for it – and it helps prevent your service desk being overwhelmed by a flurry of calls.
On the second point, most quality issues will be self-evident to the end user – e.g. the platform that suddenly cannot be accessed or the printer that has blown. But other issues – especially relating to complex systems – may be subtler: in other words, the system may still be usable, albeit with reduced functionality. These issues might seem trivial to the end-user, but could be indicative of a problem in need of attention.
The IT team should consult with the wider business to define the criteria that have to be met before an event ought to be classed as an actionable incident. You should also have agreed criteria in place to determine the level of resources to deploy to tackle the problem. Escalation tiers could be along these lines:
- Tier 1: Normal Incidents. Based on the information capture and subsequent severity assessment, sufficient staff are deployed to respond to the incident. If more support is needed, the incident can be subsequently escalated further.
- Tier 2: Major incidents. This is the ‘all hands-on deck’ approach. A senior team member takes the lead and all team members required to resolve the issue are instructed to focus on this incident immediately.
The remainder of this guide focuses on the resolution of major (Tier 2) incidents.
Stage 1: Information Capture.
This is essential for initial assessment the nature, scope and urgency of the incident, the steps likely to be required for resolution. It sets the scene for the entire management journey.
Your initial response team needs to ascertain and record the following:
- Name, location and contact details of the individual who discovered the incident.
- Date and time of the incident first becoming apparent.
- The nature of the incident and the business processes that are being impacted.
- The systems, endpoints, drives, software/hardware, persons and locations affected.
Stage 2: Identification & assessment
It’s natural for an end-user to view their particular problem as “urgent” – no matter what. You need a consistently-applied procedure for prioritising incidents, for establishing the resources to deploy – and how quickly.
Broadly, there are two parts to this:
Severity. Here, it’s useful to sub-categorise into multiple levels. This could be ‘Everyday’ (minimal impact on business processes), ‘Critical’ (significant disruption to workflows and processes) and ‘Major’ (preventing core service delivery).
Business priorities. This becomes especially relevant where your incident management team receives multiple requests from different areas of the business. For instance, you may want to set a policy whereby relatively minor incidents affecting Customer Service take priority over significant incidents confined to HR.
Stage 3: Escalation.
Effective and swift escalation demands the ability to reel in precisely the right resources when you need them. For this, you need:
- A clear definition of roles & responsibilities for response team members.
- Points of Contact: who needs to be contacted in the event of a major incident? Should the primary POC vary depending on the incident type? What is the procedure for contact – including contingency arrangements if the primary POC is unavailable?
- Provision of key information. This may include a standard ticketing template, giving the escalation team the relevant information on the incident that’s available to date.
Stage 4: Communicating with the business
Unless you proactively inform the business of the incident and advise them that it is in hand, you could be hit by a stream of user contacts – all concerning the exact same thing. To stem this, you could consider the following:
- An email/SMS alert to relevant users
- A custom voice response on your phone system
- A pop-up message for users logging in.
Whatever you decide, make sure you have a plan.
Stage 5: Keep the business updated.
Rectification of major incidents can be complex – and may lead to different parts of your IT architecture going off-line at different times. The people affected need to know what to expect.
Stage 6: Notification of resolution
Notify all business users once the issue has been fixed. If appropriate explain what happened.
Stage 7: Perform an investigation
Major incidents should be formally investigated - usually by the group responsible for the resolution. investigated and the findings should be sent to IT, including the Service Desk team. These findings can help inform and improve your response to similar incidents in the future.
Step 8: Add to the incident tracker
Incidents should be comprehensively logged in your tracker, including details of what happened, impact, reason, details of who was part of the resolver team – and any relevant commentary on how the management of similar future incidents might be improved. This knowledge base containing details of all incidents should help promote more effective, targeted response management in the future.