ITIL Major Incident Management – How to handle it (2024)

Various authors have discussed Incident Management here on several occasions. Being one of the most elaborated key functions, there are a number of issues we could address in depth. Major incident management is one of them, and due to its significant impact and visibility, it deserves a few more words.

ITIL Incident Management Overview

Any unplanned interruption or service degradation is, according to ITIL, considered as incident. So once incident happens, and they will, primary goal of ITIL Incident Management is to restore service as quickly as possible in order to minimize the business impact. Any event that disrupts or could disrupt a service itself is within the scope of incident management. For an example, single failed disk drive within mirrored array is not causing any interruptions, but there is a service degradation in terms that risk of data loss has increased, and that’s why such event is also considered as an incident.

ITIL Incident Management, as part of ITIL Service Management, is responsible for incident identification, logging and categorization. Reports about incidents may come from Service Desk (by call, e-mail, web), event management or directly by technical staff, but all of them have to be recorded, time stamped and contain sufficient data in order to be properly managed.

In order to effectively manage incidents, we need to have means to prioritize them, because they rarely appear only one at the time. And we prioritize incidents by Impact vs. Urgency matrix. Impact is the effect incident has on a business, and Urgency basically defines time business (or customer) is ready to wait for resolution. In example, we may have high impact incident (high level – 1) affecting whole finance department, but low urgency (low level – 3) because they use that service only on the end of the fiscal year which is 6 months away. In such scenario, this incident is categorized as moderate priority – 3. Details about time frames in which each level of priority is expected to be resolved is part of Service Level Agreement (SLA). Read this blog post for more information: All About Incident Classification.

ITIL Major Incident Management – How to handle it (1)

Incidents are generally results of errors or malfunctions within IT equipment. In such cases, root cause is apparent, and resolution is simple as repairing faulty part, or applying a workaround. But in a case where seriousness of the incident is great, or avalanche of similar incidents are recorded, Problem Management process steps in and takes over the search for the root cause. Once root cause has been identified, problem is referred as known error, and is registered in Known Error Database (KeDB). Service Desk, as a function of the Incident Management relies on known error database and workarounds provided. If you need a tool that will help you manage incidents, here is the list of Free tools for ITSM that you may try, and use for free.

What is a major incident?

In theory, a major incident is a highest-impact, highest-urgency incident. It affects a large number of users, depriving the business of one or more crucial services. Business and IT have to agree on what constitutes a major incident. It is one of the rare occasions where ITIL is strict in terms of definition: it MUST be agreed on. ISO 20000 requirements on major incident management are short, but demanding: agreement, separate procedure, responsibility and review.

In practice, you know a major incident when you see it: a large number of Service Desk calls, customer impatience, rage of the management, panic. All the more reason to get it straight before it happens. In most cases, it will simply be the highest-priority incident in the impact/urgency matrix. You might have a look at myIncident Classificationarticle. In some cases, IT and the business can decide that only special types of high-priority incidents will be marked as major incidents. This can be due to different SLA parameters with various businesses. For example, when you support a chain of pharmacies or tobacco shops, they will want their cash register service malfunction to be marked as priority one, with strict resolution times defined. If you support another organization, say finance or marketing departments in the same corporation, their SLA will tend to address different issues, different response and resolution times, and probably a different amount of resources for the resolution.


Who should be involved?

When a major incident occurs, roles and the process should be strictly defined. Mind you, we are talking about the roles here, not the actual day-to-day jobs. Roles will differ according to the size of the IT service management organization and the scope of its service management. Smaller organizations will tend to aggregate a few roles into one job definition, while larger organizations will elaborate sub-roles for each major incident type, customer or technical expertise field.

Major incident manager. Accountable for the general procedure management, taking care that the required resources for incident resolution are engaged and the customer is informed appropriately about the progress. He shall also have basic technical knowledge about the outage. In smaller service management organizations with a lower frequency of major incidents, this role will be taken by theService Desk manager, who also acts as theIncident Manager. In larger organizations, the appointment of major incident manager will depend on the particular expertise area. It could be thetechnical account managerbest acquainted with the respective business organization specifics, someone from theTechnical management functionor theApplication managementfunction.

Problem manager. This role will often have to be involved, since major incident resolution usually requires finding the underlying cause (root cause analysis) of the major incident. This role can’t be combined with the incident management role, due to the well-known conflict of interests between the incident management and problem management processes. The major incident team will be struggling to restore the service, and problem management tends to take its time finding the root cause.

Change manager. Involved in case some urgent changes have to be implemented to restore the service.

SLA manager. Must be informed in order to keep a record of the downtime and to inform the customer if the procedure requires this.

Service Desk. Responsible for keeping incident records up to date and for primary customer communication.

Communication

We mentioned major roles in the process. Guess whose is the most important role, and who is often omitted from the loop? The customer! It is the most common mistake for growing service management organizations – to get involved so deeply in incident resolution that thecommunication with the customer is neglected.

The moment you receive the call from the customer to inquire about the resolution progress, you should know that there is something wrong. Frequency, form and the scope of communication with the customer should be clearly stated in the SLA. The customer should always know what to expect. His vital business process is endangered; he must be on his heels. Short, concise information every half an hour or at least every hour should contain info about:

  • Start of downtime
  • Short description of the known cause of the downtime
  • The impact of the downtime
  • Estimated time for restoration
  • Next scheduled information

The major incident teamshould maximize its resources in service restoration, so the Service Desk should regularly ping them to receive a quick update about the process, which they will forward formally to the customer.

The after party

The incident is resolved, the service is restored, and the customer returns to his day-to-day business. The aftertaste remains. Why did it happen? What is to be expected going forward – have we done anything to prevent these downtimes in the future? How do we deal with these questions?

In short, the best practice is to resolve the incident and to continue working on a related problem ticket. This will produce a so-called problem report, or at least a root cause analysis (RCA) report in a brief, SLA-defined period of time to the customer. Recommended info in this report should consist of at least the following:

  • Short description of the incident
  • Downtime duration
  • SLA impact
  • Short incident history
  • How we resolved the incident
  • What is the root cause
  • A set of activities scheduled in order to prevent this kind of downtime

This report will soothe the customer’s concerns, and let him know that he is dealing with mature service management which understands his business needs, and is doing its best to protect his core business.

If you were the customer, what more would you expect?

Download this free sample of an Incident Management Processtemplate to learn more.

ITIL Major Incident Management – How to handle it (2024)

FAQs

How do you handle a major incident in ITIL? ›

ITIL Incident Management Overview

So once incident happens, and they will, primary goal of ITIL Incident Management is to restore service as quickly as possible in order to minimize the business impact. Any event that disrupts or could disrupt a service itself is within the scope of incident management.

What are the 3 main steps to follow in case of major incident? ›

The 3 Phases of a Major Incident
  • The initial 15 minutes (of major incident identification)
  • The post 15 minutes (n.b. this can last hours or sometimes days)
  • The resolution (and closure of the major incident)
27 Feb 2020

What is the most important question to focus on when resolving an incident? ›

But when starting out with incident management, it's recommend that the focus is on asking the most critical questions such that the fix effort can get under way as soon as possible. Some example questions include: ½ What's happening?

What is P1 P2 P3 incidents? ›

P1 – Priority 1 incident tickets (Critical) P2 – Priority 2 incident tickets (High) P3 – Priority 3 incident tickets (Moderate) P4 – Priority 4 incident tickets (Low) SLA success rate is given as percentage.

What are the 4 main stages of a major incident? ›

What is a Major Incident? enquiries likely to be generated both from the public and the news media usually made to the police. Most major incidents can be considered to have four stages: • the initial response; the consolidation phase; • the recovery phase; and • the restoration of normality.

What is MIM process in ITIL? ›

What Is Major Incident Management? The goal of the overall Incident Management process is to effectively manage the lifecycle of all incidents and to restore IT services for users or customers as quickly as possible when an interruption takes place.

What makes a good major incident manager? ›

An Incident Manager must be adept at finding solutions to problems and trialling different ways to find a resolution. Talking of the skills needed to be an Incident Manager, Orla O'Brien, Incident Manager at Vodafone, says a 'good technical grounding' is key to having a 'perspective on how to approach problems'.

When finishing the Major incident Report What are the steps you should take? ›

5 Steps to Take After a Safety Incident
  1. Step 1: Get Medical Attention and Care Immediately. ...
  2. Step 2: File an Incident Report As Soon As Possible. ...
  3. Step 3: Inform All Necessary Parties. ...
  4. Step 4: Review of Safety Procedures. ...
  5. Step 5: Be Alert but Remain Courteous.

What are the SLA for major incident? ›

SLA management and escalation

An SLA is the acceptable time within which an incident needs response (response SLA) or resolution (resolution SLA). SLAs can be assigned to incidents based on their parameters like category, requester, impact, urgency etc.

What is P1 and P2 incident? ›

Depending on the impact and urgency, a major incident will be categorized as a P1 or P2. Incident Coordinators utilize a priority matrix to determine the appropriate impact and urgency. All P1 tickets are considered major incidents. P2 tickets are considered major if the impact is "multiple groups" or "campus."

What are the 5 stages of the incident management process? ›

6 Steps to Incident Management
  • Incident Detection. You need to be able to detect an incident even before the customer spots it. ...
  • Prioritization and Support. ...
  • Investigation and Diagnosis. ...
  • Resolution. ...
  • Incident Closure.

What are KPIs in incident management? ›

KPIs (Key Performance Indicators) are metrics that help businesses determine whether they're meeting specific goals. For incident management, these metrics could be number of incidents, average time to resolve, or average time between incidents.

What are 3 types of incidents? ›

3 Types of Incidents You Must Be Prepared to Deal With
  • Major Incidents. Large-scale incidents may not come up too often, but when they do hit, organizations need to be prepared to deal with them quickly and efficiently. ...
  • Repetitive Incidents. ...
  • Complex Incidents.
16 Dec 2015

How do you prioritize incidents? ›

Definition: An Incident's priority is usually determined by assessing its impact and urgency: 'Urgency' is a measure how quickly a resolution of the Incident is required. 'Impact' is measure of the extent of the Incident and of the potential damage caused by the Incident before it can be resolved.

What is SLA priority? ›

Priority is the importance or attention given to a ticket based on the SLA. By default, there are four types of priority: Low, Medium, High and Urgent. Tickets with low priority are the least important and do not need to be solved immediately, while urgent tickets should be dealt with as soon as possible.

What is SLA time? ›

SLAs in customer support service are time-based deadlines agreed upon by the customer and outlined in contracts or in the terms of service. They define the specific amount of time the company has to respond and resolve different types of incoming inquiries from customers.

What is a Priority 1 issue? ›

Priority 1 (P1): These issues are usually business-critical. They represent an issue for which no workarounds exist, or there is a severe outage.

What is the criteria for a major incident? ›

A major incident is beyond the scope of business-as-usual operations, and is likely to involve serious harm, damage, disruption or risk to human life or welfare, essential services, the environment or national security.

How is a major incident declared? ›

A major incident can be defined as any incident where the location, number, severity or type of live casualties requires extraordinary resources.

What is incident management interview questions and answers? ›

Interview Questions for Incident Managers:
  • How would you go about leading an incident investigation? ...
  • How would you manage a large team of technical staff? ...
  • How do you keep up to date with the changing IT industry and new software programs? ...
  • Which incident management software systems do you enjoy working with?

What are the 6 stages in the incident management life cycle? ›

The NIST incident response lifecycle breaks incident response down into four main phases: Preparation; Detection and Analysis; Containment, Eradication, and Recovery; and Post-Event Activity.

How do you classify incidents in ITIL? ›

According to ITIL, the goal of Incident classification and Initial support is to:
  1. Specify the service with which the Incident is related.
  2. Associate the incident with a Service Level Agreement (SLA )
  3. Identify the priority based upon the business impact.
  4. Define what questions should be asked or information checked.

What is the difference between incident and major incident? ›

Failure of a configuration item that has not yet impacted one or more services is also an incident. For example, the failure of one disk from a mirror set. Major Incident – An event which significantly affects a business or organization, and which demands a response beyond the routine incident management process.

How can I improve my incident management skills? ›

Best Practices to Improve Incident Management
  1. Clearly Define Incident. ...
  2. Create A Robust Workflows. ...
  3. Execute the Right Resources. ...
  4. Provide Training to Employees and Equip them with the Right Tools. ...
  5. Keep Your Stakeholders Informed. ...
  6. Tie Major Incidents with Other ITIL Processes. ...
  7. Report on Significant Incidents.
29 Oct 2021

What skills should an incident manager have? ›

Incident Manager Skills

In order to successfully complete all tasks, an Incident Manager needs to possess strong problem solving, analytical and time management skills. They should also be able to apply organizational, critical thinking and oral and written communication skills.

What are the skills of incident manager? ›

Skills needed for this role level
  • Asset and configuration management. ...
  • Availability and capacity management. ...
  • Change management. ...
  • Community collaboration. ...
  • Continual service improvement. ...
  • Continuity management. ...
  • Incident management. ...
  • Ownership and initiative.
7 Jan 2020

What are the 5 elements of a good incident report? ›

Facts related to the incident include:
  • The Basics. Identify the specific location, time and date of the incident. ...
  • The Affected. Collect details of those involved and/or affected by the incident. ...
  • The Witnesses. ...
  • The Context. ...
  • The Actions. ...
  • The Environment. ...
  • The Injuries. ...
  • The Treatment.
18 Oct 2021

What are the 4 steps to an investigation? ›

Investigate the incident, collect data. Analyze the data, identify the root causes. Report the findings and recommendations.

What six points should be included in an incident report? ›

8 Items to Include in Incident Reports
  • The time and date the incident occurred. ...
  • Where the incident occurred. ...
  • A concise but complete description of the incident. ...
  • A description of the damages that resulted. ...
  • The names and contact information of all involved parties and witnesses. ...
  • Pictures of the area and any property damage.
28 Jul 2021

What is the response SLA time for Priority 3 incidents? ›

4 hours

What is P1 incident response time? ›

P1, Critical Priority. "P1" or "System Down" Your reports are not showing data, or the interface is unavailable to multiple users. Initial target response: Two (2) hours after ticket submission. Target resolution or workaround: Priority reduced to P2 within 12 hours.

What is impact name for P1 issue? ›

(P1) Complete Outage / Significant Traffic Impact

“Emergency situation; critical impact”

What is a P1 P2 P3 P4? ›

The P1, P2, P3, and P4 are the P visa types. These visas are issued to a foreign athlete, famous artist, a member of an entertaining group, coach, and their family members. In this article, about each one of them is told clearly and the requirements that must be satisfied to get the visa.

How do you link incident with problem ticket? ›

Within the incident ticket, click the Problems tab. Search for the problem ticket that you want to make the parent of the incident. In the Action column, click Link to make the selected ticket the parent of the current ticket. Save your changes.

What is the difference between impact and urgency? ›

Impact is generally based on how your quality of service is affected. Urgency is a measure of the time for an incident to significantly impact your business. For example, a high impact incident may have low urgency if the impact will not affect the business until the end of the financial year.

What are the 7 steps in incident response? ›

In the event of a cybersecurity incident, best practice incident response guidelines follow a well-established seven step process: Prepare; Identify; Contain; Eradicate; Restore; Learn; Test and Repeat: Preparation matters: The key word in an incident plan is not 'incident'; preparation is everything.

What's the first step in handling an incident? ›

The Five Steps of Incident Response
  1. Preparation. Preparation is the key to effective incident response. ...
  2. Detection and Reporting. ...
  3. Triage and Analysis. ...
  4. Containment and Neutralization. ...
  5. Post-Incident Activity.
26 Jun 2019

What is incident life cycle in ITIL? ›

Objective: Incident Management aims to manage the lifecycle of all Incidents (unplanned interruptions or reductions in quality of IT services). The primary objective of this ITIL process is to return the IT service to users as quickly as possible. Part of: Service Operation.

What are the 6 steps of incident response? ›

The incident response phases are:
  • Preparation.
  • Identification.
  • Containment.
  • Eradication.
  • Recovery.
  • Lessons Learned.

What are the 5 stages of the incident management process? ›

6 Steps to Incident Management
  • Incident Detection. You need to be able to detect an incident even before the customer spots it. ...
  • Prioritization and Support. ...
  • Investigation and Diagnosis. ...
  • Resolution. ...
  • Incident Closure.

Which priority will a major incident be? ›

Definition: An Incident's priority is usually determined by assessing its impact and urgency: 'Urgency' is a measure how quickly a resolution of the Incident is required.

What are the 2 SLA's for an incident? ›

An SLA is the acceptable time within which an incident needs response (response SLA) or resolution (resolution SLA).

What's the first step in handling an incident? ›

The Five Steps of Incident Response
  1. Preparation. Preparation is the key to effective incident response. ...
  2. Detection and Reporting. ...
  3. Triage and Analysis. ...
  4. Containment and Neutralization. ...
  5. Post-Incident Activity.
26 Jun 2019

What is the first priority when responding to a major security incident? ›

The first priority in responding to a security incident is to contain it to limit the impact. Documentation, monitoring and restoration are all important, but they should follow containment.

Which is the most difficult phase in incident response? ›

Phase 2: Detection and Analysis

Accurately detecting and assessing incidents is often the most difficult part of incident response for many organizations, according to NIST.

What are the 7 steps in incident response? ›

In the event of a cybersecurity incident, best practice incident response guidelines follow a well-established seven step process: Prepare; Identify; Contain; Eradicate; Restore; Learn; Test and Repeat: Preparation matters: The key word in an incident plan is not 'incident'; preparation is everything.

What are 3 types of incidents? ›

3 Types of Incidents You Must Be Prepared to Deal With
  • Major Incidents. Large-scale incidents may not come up too often, but when they do hit, organizations need to be prepared to deal with them quickly and efficiently. ...
  • Repetitive Incidents. ...
  • Complex Incidents.
16 Dec 2015

What is incident life cycle in ITIL? ›

Objective: Incident Management aims to manage the lifecycle of all Incidents (unplanned interruptions or reductions in quality of IT services). The primary objective of this ITIL process is to return the IT service to users as quickly as possible. Part of: Service Operation.

What is P1 and P2 incident? ›

Depending on the impact and urgency, a major incident will be categorized as a P1 or P2. Incident Coordinators utilize a priority matrix to determine the appropriate impact and urgency. All P1 tickets are considered major incidents. P2 tickets are considered major if the impact is "multiple groups" or "campus."

Who decides if the incident is of major incident type? ›

Once a major incident is escalated by 1st- or 2nd-level technical staff, the Incident Manager should determine what resources and expertise are required to resolve the incident and set about forming a Major Incident Team that can resolve the issue as quickly as possible.

What is the difference between incident and major incident? ›

Failure of a configuration item that has not yet impacted one or more services is also an incident. For example, the failure of one disk from a mirror set. Major Incident – An event which significantly affects a business or organization, and which demands a response beyond the routine incident management process.

What is the SLA for P1 ticket? ›

What are Calibre One's normal SLA definitions? Calibre One defines our ticket PRIORITY levels as follows: Priority 1 (P1) – A complete business down situation or single critical system down with high financial impact. The client is unable to operate.

What are KPIs in incident management? ›

KPIs (Key Performance Indicators) are metrics that help businesses determine whether they're meeting specific goals. For incident management, these metrics could be number of incidents, average time to resolve, or average time between incidents.

What is P1 incident response time? ›

P1, Critical Priority. "P1" or "System Down" Your reports are not showing data, or the interface is unavailable to multiple users. Initial target response: Two (2) hours after ticket submission. Target resolution or workaround: Priority reduced to P2 within 12 hours.

Top Articles
Latest Posts
Article information

Author: Virgilio Hermann JD

Last Updated:

Views: 5387

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Virgilio Hermann JD

Birthday: 1997-12-21

Address: 6946 Schoen Cove, Sipesshire, MO 55944

Phone: +3763365785260

Job: Accounting Engineer

Hobby: Web surfing, Rafting, Dowsing, Stand-up comedy, Ghost hunting, Swimming, Amateur radio

Introduction: My name is Virgilio Hermann JD, I am a fine, gifted, beautiful, encouraging, kind, talented, zealous person who loves writing and wants to share my knowledge and understanding with you.