Incident handling
General idea
GitLab issues are used to track incidents. They not only cover the immediate incident response, but also further work such as documentation improvements or root cause remediation to prevent similar incidents.
Life cycle of an incident issue
Incidents go through the following phases visible on the incident issue board:
CWF::Incident::Active
: The incident is significantly affecting the production environment.CWF::Incident::Mitigated
: The incident is affecting the production environment in a limited way.CWF::Incident::Resolved
: The incident is no longer affecting the production environment. Some work remains to be done, e.g. further monitoring, documentation or root cause remediation.Closed
: The issue is closed when all outstanding work items are completed. Any remainingCWF::Incident::*
labels are removed.
To keep track of issues responsible for incident tracking, the label
CWF::Type::Incident
is applied. This allows the incident
board to select issues related to incidents. This label will
stay on the issues even after they are closed.
Consistent application of the labels is enforced by the sprinter webhook.
Creating a new incident issue
Create a new GitLab issue, e.g.
- on the top bar on a project page, select the plus sign (
+
) and then, underThis project
, selectNew issue
from the top bar ("+") - on the left sidebar on a project page, select
Issues
and then, in the upper-right corner, selectNew Issue
- on a project page, press the
i
shortcut - on the incident issue board, select the appropriate list
menu (
⋮
) and thenCreate new issue
Unless the issue was created through the incident issue
board, make sure to tag the issue with at least
CWF::Type::Incident
.
Converting between an incident issue and a normal issue
To convert a normal issue into an incident issue, add the CWF::Type::Incident
label to it, e.g.
- on the right sidebar on an issue page, select
Edit
next toLabels
, and then select theCWF::Type::Incident
label - in the comment box on an issue page (
e
shortcut), write/label ~"CWF::Type::Incident"
and submit the comment
To convert an incident issue into a normal issue, remove the
CWF::Type::Incident
label from it, e.g.
- on the right sidebar on an issue page, select
Edit
next toLabels
, and then deselect theCWF::Type::Incident
label - in the comment box on an issue page (
e
shortcut), write/unlabel ~"CWF::Type::Incident"
and submit the comment
Transitioning an incident issue between the different phases
To transition an incident issue to a different phase (<PHASE>
), e.g.
- on the right sidebar on an issue page, select
Edit
next toLabels
, and then select the appropriateCWF::Incident::<PHASE>
label - in the comment box on an issue page (
e
shortcut), write/label ~"CWF::Incident::<PHASE>"
and submit the comment - on the incident issue board, drag the issue card to the appropriate list
Closing an incident issue
Close the GitLab issue, e.g.
- at the top of an issue page, select
Close issue
- in the comment box on an issue page (
e
shortcut), write/close
and submit the comment - on the incident issue board, drag the issue card to the
Closed
list
Weekly review meetings
Next to the short-term components such as the immediate mitigation and resolution of the incident itself, the incident response also has to contain strategic improvements to prevent recurrence. Nevertheless, once an incident is mitigated or resolved, the motivation to improve on its root cause is severely reduced.
A dedicated weekly incident review meeting is scheduled to ensure consistent progress in the handling of incidents in all phases:
CWF::Incident::Active
: reduce the impact on the production environmentCWF::Incident::Mitigated
: resolve the direct cause of the incidentCWF::Incident::Resolved
: improve on the root cause of the incident, e.g. by improving- monitoring/alerting of the conditions that led to the incident
- monitoring/alerting for a similar incident
- logging to aid in faster detection/recovery for a similar incident
- documentation
- the underlying code and/or architecture
Further ideas
In the future, it would be neat to automatically create incidents for persistent Prometheus alerts and Sentry exceptions.
Last modified March 23, 2023: Document working the GitLab issues (04cec5d)