Why Internal Communications is at the Center of Digital Transformation

When we talk about the digital transformation of the workplace, we often focus on the business or technical aspects: supply chains, artificial intelligence, product services, and a whole host of…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Overview of Incident Lifecycle in SRE

Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve. Our latest blog is a deep dive into best practices to follow across the lifecycle of an incident, helping teams build a sustainable and reliable product — the SRE way

As the saying goes, “Every problem we face is a blessing in disguise”. On similar lines, every incident in system infrastructure, helps product development & engineering teams understand better about the capabilities of system architecture. This can further help organizations in building a sustainable and reliable product.

In this blog, we are quantifying all complexities of handling an incident in a well-structured format with an intent to help you handle every incident effectively.

ITIL 2011 defines Incident as,

an unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a Configuration Item that has not yet impacted an IT service [but has potential to do so]

Clearly, in order to maintain acceptable service levels, it is important to resolve incidents and restore normal services as quickly as possible.

ITIL defines a standard lifecycle of an incident. While the actual activities that occur during each phase have changed over time, it is still a good starting point for a detailed description of incidents.

Incidents are identified through reports from monitoring systems, or by manual identification. Once an incident is identified it is logged. An incident log can be used to validate that all incidents have been addressed and to identify trends. At this point, the incident is categorized by adding additional information like severity, functional area, and ownership. These three activities were once the responsibility of a first-level monitoring technician, nowadays they are normally automated.

The responder team applies the fix proposed in the previous step and, typically, observes the system for a little while to confirm that the incident has been resolved. Normally, it can take several iterations of trial and error before an incident is resolved. Each trial provides more information to evolve the hypothesis and formulate better fixes.

Note: The OODA Loop

The description of the phases of an incident gives the impression of a structured, systematic engineering process that is calmly applied by experts. However, reality is rarely so neat and clean. Incidents, particularly major ones, are more akin to a battle than an engineering process. Everyone is under pressure, failure has catastrophic consequences and there is always insufficient information to understand what is really happening.

The OODA loop requires the responder to:

The incident is marked closed when confirmation is received that normal services have resumed. The definition of confirmation varies but it is often wise to use multiple independent confirmations, for example:

Incident lifecycle now gives a clear picture of various activities an incident management team is practically following while encountering an incident. Now let’s look into the best practices a team should have in order to make incident management a less stressful activity.

ITIL incident lifecycle provides a way to handle an incident, but the best practice comes only with extensive practical experience towards managing an incident. This section is about keeping an incident management team productive with a structured format. These are some of the practices that would greatly encourage a team towards efficiency and avoid burnouts.

The first step is to delegate the work involved among all team members. Handling incidents needs a lot of awareness about who has to do which work. Adequate information about each individual’s roles and responsibilities would help them in taking key decisions independently. Now the basic roles in handling an incident are,

The framework of incident response revolves around 3'C’s or the goals of effective incident management. They are,

This is about delegation of roles among an incident management team.

This stage is about setting up a designated war room, a centralized space where team members can coordinate with each other in resolving an incident at a faster pace. Here, the team can use Slack/Telephone/Video conferencing for maintaining and recording a communication log between team members about incident related traffic and alerts.

This stage is about the role of an incident commander to maintain a concurrent live incident document where all details of an incident are recorded diligently. This live document can be hosted on wiki and must be accessible to every other team member, enabling them to contribute data about an incident. This practice ensures transparency among team members and stakeholders.

This happens when the incident responders need to change in an ongoing incident. This could be because their shift has ended or even because they are exhausted. When the team changes whatever work they were each doing must be seamlessly handed over to the new team. This includes the overall status, the progress of investigation or corrective action, and more. A real-time incident state document is invaluable for this.

After every non-trivial incident, it is important to run a postmortem. There are some important outcomes of a good postmortem:

The outcomes are achieved by reviewing the incident and identifying its root cause.

Blameless postmortems

When postmortems are focussed on assigning responsibility (i.e, blame) then most participants will be primarily concerned with not being blamed. Conversely, a focus on what went wrong will allow the participants to be more objective and less worried about protecting themselves. It also recognizes that humans make mistakes and that it is more effective to address circumstances that contribute to errors than to seek humans who don’t make mistakes.

Track and Reward Outcomes

There is no value in postmortems if they do not generate results. Track and reward postmortem outcomes:

Postmortems without outcomes or action items are usually a sign that they’re ineffective.

Encourage Transparency

The lessons from a postmortem are wasted if they are not applied to all systems and teams organization-wide. Sharing and transparency help ensure that lessons learned to percolate throughout the organization. Some steps to encourage transparency:

Address Postmortem Culture Failures

Signs of a failing postmortem culture must be immediately addressed. Culture is not a set of principles in a document but behavior that is rewarded or penalized. Some failings are:

Incidents are common events that should be handled in a standard pattern. ITIL defines a good template to follow. A few good practices can really help improve the effectiveness of an incident management process:

We hope this blog gives you a better and deeper understanding of the best practices to follow during the lifecycle of an incident, enabling you handle critical incidents in your organization without much hassle and burnouts.‍

Add a comment

Related posts:

My 2nd Day as a Food Network PA

I want to start this story off a bit different, and hopefully this doesn’t strike anyone the wrong way, but I feel like this needs to be heard — by myself included. I want to start this article by…

30 Most Dangerous Cities in the World

We see trees of green, red roses, too, and have to agree with Louis Armstrong that this is a wonderful world. Unfortunately, that doesn’t mean that it’s always a safe one. Safety is always a top…

Being You Without Politics and Social Movements

As the world keeps evolving, the internet tends to make us feel more divided instead of united. With every year that goes by, it’s becoming increasingly scary to openly share our true identities…