Handling incidents

How the CKI team is tracking and resolving incidents

This page has an internal companion page which might contain additional information.

General idea

GitLab issues are used to track incidents. They not only cover the immediate incident response, but also further work such as documentation improvements or root cause remediation to prevent similar incidents.

Life cycle of an incident issue

Incidents go through the following phases visible on the incident issue board:

CWF::Incident::Active: The incident is significantly affecting the production environment.
CWF::Incident::Mitigated: The incident is affecting the production environment in a limited way.
CWF::Incident::Resolved: The incident is no longer affecting the production environment. Some work remains to be done, e.g. further monitoring, documentation or root cause remediation.
Closed: The issue is closed when all outstanding work items are completed. Any remaining CWF::Incident::* labels are removed.

To keep track of issues responsible for incident tracking, the label CWF::Type::Incident is applied. This allows the incident board to select issues related to incidents. This label will stay on the issues even after they are closed.

Consistent application of the labels is enforced by the sprinter webhook.

Creating a new incident issue

Create a new GitLab issue, e.g.

on the top bar on a project page, select the plus sign (+) and then, under This project, select New issue from the top bar ("+")
on the left sidebar on a project page, select Issues and then, in the upper-right corner, select New Issue
on a project page, press the i shortcut
on the incident issue board, select the appropriate list menu (⋮) and then Create new issue

Unless the issue was created through the incident issue board, make sure to tag the issue with at least CWF::Type::Incident.

Converting between an incident issue and a normal issue

To convert a normal issue into an incident issue, add the CWF::Type::Incident label to it, e.g.

on the right sidebar on an issue page, select Edit next to Labels, and then select the CWF::Type::Incident label
in the comment box on an issue page (e shortcut), write /label ~"CWF::Type::Incident" and submit the comment

To convert an incident issue into a normal issue, remove the CWF::Type::Incident label from it, e.g.

on the right sidebar on an issue page, select Edit next to Labels, and then deselect the CWF::Type::Incident label
in the comment box on an issue page (e shortcut), write /unlabel ~"CWF::Type::Incident" and submit the comment

Transitioning an incident issue between the different phases

To transition an incident issue to a different phase (<PHASE>), e.g.

on the right sidebar on an issue page, select Edit next to Labels, and then select the appropriate CWF::Incident::<PHASE> label
in the comment box on an issue page (e shortcut), write /label ~"CWF::Incident::<PHASE>" and submit the comment
on the incident issue board, drag the issue card to the appropriate list

Closing an incident issue

Close the GitLab issue, e.g.

at the top of an issue page, select Close issue
in the comment box on an issue page (e shortcut), write /close and submit the comment
on the incident issue board, drag the issue card to the Closed list

Weekly review meetings

Next to the short-term components such as the immediate mitigation and resolution of the incident itself, the incident response also has to contain strategic improvements to prevent recurrence. Nevertheless, once an incident is mitigated or resolved, the motivation to improve on its root cause is severely reduced.

A dedicated weekly incident review meeting is scheduled to ensure consistent progress in the handling of incidents in all phases:

CWF::Incident::Active: reduce the impact on the production environment
CWF::Incident::Mitigated: resolve the direct cause of the incident
CWF::Incident::Resolved: improve on the root cause of the incident, e.g. by improving
- monitoring/alerting of the conditions that led to the incident
- monitoring/alerting for a similar incident
- logging to aid in faster detection/recovery for a similar incident
- documentation
- the underlying code and/or architecture

Further ideas

In the future, it would be neat to automatically create incidents for persistent Prometheus alerts and Sentry exceptions.