Keeping up with Klaxon

Keeping up with Klaxon - Episode 1

by:

Neil Conchie

on:

December 7, 2020 4:17 PM

The 5 Major Pitfalls of Major Incidents with John Sansbury

Welcome to Episode 1 of Keeping up with Klaxon where I met up with John Sansbury of Infrassistance. John is a bestselling author as well as a recognised ITIL adviser who has over 50 year’s industry experience. Alongside this, John has experience of setting up over 200 service desks and is currently ranked number one worldwide for the number of people he has trained in ITIL Foundations. John joins us to talk about the 5 major pitfalls companies and organisations make when it comes to major incidents.

Here are some notes from the interview with John on the 5 major pitfalls of major incidents.

Pitfall 1: Inadequate incident categories

Tagging the incident with the right category ensures that it is assigned to the right team to be resolved if it needs to be escalated by the service desk and can help with prioritisation of incidents.

70% of initial incident categorisation is incorrect, which can mean that the incident is assigned to a team wrongly and bounces between teams.

Make sure your service desk knows how to categorise incidents through training or with an effective configuration management system.

Correct categorisation of incidents can also aid analysis by problem management teams to enable root cause analysis. The best practise for reviewing incident categorisation is to verify it once the incident is closed (record closure).

The most common issue with incident categories is having too many or too few. Too few categories can lead to not enough information being recorded, and too many categories can lead to confusion and mislabelling of incidents. The example John gives of too many categories is a client of his with over 400 categories. When auditing he found that 92% of incidents were in the ‘other’ category as the service desk team had found the list overwhelming. As well as having categories, it is important to consider that they should be multi-level and based on incident impact which escalates with severity and verified on record closure.

Pitfall 2: Lack of incident costing

Most organisations don’t realise how much incidents cost them.

There are 4 major costs of incidents –

Internal rework - amount of time, effort and money of incidents escalated to second line support teams by the service desk, and this cost is often understated. Second line support teams are not primarily incident managers, and incident management tasks distract them from this. John has previously found that the amount of time second line teams have spent on incidents is up to 80%. In his example he explains that if you have a second like support team of 100 people, and 80 of them are managing incidents you have the equivalent of 80 FTEs (Full Time Employees) at an average cost of £50,000 per year meaning an overall incident management cost of £4m per year.
Lost User Time – this relies on recording how many people are affected by an incident, but this is not usually well-reported. Teams tend to use brackets to determine people effected but this can mean the true impact is often understated. If you have 1000 people impacted by an incident, and it lasts 4 hours, that is 4,000 lost user hours.
Revenue Lost – What would your users have been doing to generate revenue and how would an incident affect them? John’s example is about a logistics company with a contract to deliver parts to Ford on a ‘just-in-time’ basis where a major IT incident could potentially mean them missing their 1-hour window for delivery which would cost them £1million euros by way of a fine.
Reputation Damage – John talks about three well documented examples of this recently. The first was when BlackBerry had a major IT incident 11 years ago, which they reported as not major, since fixed and not affecting everyone. This was not true and the incident actually lasted 4 days, with all of this happening just a few months before the iPhone launch. The second is Royal Bank of Scotland who had a major incident with their scheduling software which wiped out their live and test databases simultaneously, this incident led to a 4 month recovery time. The last example is TSB who three years ago decided to build independent systems away from their parent company Lloyds Bank. Their development was 7 months behind and they went went live before full testing had taken place leading to a systems crash. It is estimated that they lost £300million from customers who left as a result of the incident, customers who would have otherwise switched to them and development costs.

Pitfall 3: No ‘Shift Left’ Policy

When an incident is received by the service desk that they cannot rectify themselves it gets passed up to second or third line support or a third party support provider. Shift Left is defined as moving things back downstream to the service desk or even to self-service.

There are often three main reasons the service desk cannot rectify the issue:

No Time – This can be resolved by providing more resource.
Expertise – More training can be provided to service desk staff.
Authority – A system admin may be required or more system permissions may be allocated to the service desk team.

By giving more power to the Service Desk and putting more information on resolved incident tickets can mean that less incidents need to be escalated and as the service desk is often cheaper to run, and have more capacity, than second or third line support this can provide a cost-effective solution.

Pitfall 4: Difficulty Measuring Incident Impact

According to leading frameworks such as ITIL, incident priority is based on impact and urgency but impact is hard to measure. Most systems use number of users affected to assess impact, but a better way to measure it could be based on the actual application or criticality of that application. If it has few users but is central to the business it is potentially more urgent than an application with hundreds of users that isn’t central to business function. Another factor to consider is the potential reputational damage to the company in the event a small incident recognised by customers as reputation is arguably the most damaging impact of an incident. The last one John talks about is regulatory impact (such as in finance) where reports not submitted accurately or on time could have financial repercussions.

Pitfall 5: Lack of Negotiated Service Levels

Quite often, service desks are providing the service they believe business or customers need without validation via a service level agreement (SLA). Outsourced IT services will almost always have an SLA, but internal services don’t often have a SLA so users reporting an incident might believe that the service desk is only working on their incident, where in reality a service desk might actually be working on 200-300 open incidents. They may also believe the incident will be fixed within half an hour. SLA often include to a priority system and as a general rule high priority incidents are often resolved within 4hrs with the lowest priority incidents being up to 5 working days. Users reporting incidents will probably then call back within about an hour as no set time-frames have been published so there is no management of expectations. John has found that up to 70% of calls to service desk are ‘chase-calls’ to check on the status of an incident.

We were thrilled to have John join us for this interview and are very grateful to him for sharing his experience with us.

To get in touch with Neil email neil.conchie@klaxon.io

To get in touch with John email john.sansbury@infrassistance.com

Find us on Anchor to subscribe on your favourite podcasting platform!

Blogs