CKI-004: Service Level Objectives and error budgets

How reliable do we want our service to be and how do we get there

Abstract

We’re providing a service, and it should be pretty reliable. To manage our customers’ and our own expectations, we need to capture the needs and formalize what that pretty means. If the service is not reliable enough, instead of focusing on new features we need to focus on increasing the stability of the service.

The approach of Service Level Objectives and error budgets is explained in detail in the Site Reliability Engineering book in chapters 3 and 4.

Motivation

Infrastructure on top of which CKI runs is often unreliable. This leads to growing frustration of both our customers and the CKI team itself. We have to drop our work and focus on working around the problem or retry/resubmit the testing manually. We’re unable to continue development as our development workflow is relying on the same infrastructure setup.

With growing customer frustration, we’re losing their trust in the service. With our own frustration, the team morale and excitement is in shambles leading to worse performance. We lose valuable time we could spend on shiny new features and improvements by fixing up and working around broken infrastructure. This has to stop, and we need some clear rules in place on how to streamline things.

Approach

Gathering expectations from customers

For upstream customers, we utilized the Kernel CI community survey from last year, as our customer base is identical. We focused on developer and maintainer roles. For internal customers, we checked with kernel engineers who belong in the same two groups, and are also active in the new workflow transition. Most of the answers we got were consistent across our upstream and internal customers. Here’s the summary of the feedback:

Testing should start right away after changes are submitted, though some delay (e.g. due to outages) is acceptable
Build results should be available within 1 hour
Testing should complete and results should be delivered within 24 hours. A decent chunk of upstream leans towards a 6 hours mark, though most people are still fine with 24 hours. Internally, longer times (24-48 hours) are acceptable as reviews don’t happen right away.
Webhooks shouldn’t lose any messages
Reliability of the hosts used for testing and the queues are a concern

These expectations match the data we had from previous conversations years ago, and it’s good to have them confirmed. We cannot gather expectations for the new GitLab workflow, as that is not used in production yet. We are expecting some feedback and expectations about that to come up in the future.

Gathering expectations from CKI team

These are some of the problems we wrote up or talked about previously with the team:

CI jobs randomly fail due to infrastructure problems and need to be manually restarted
CI is slow, especially if the runners are busy or testing (Beaker) is needed
Any problem visible in production impacts the development environment as well and has the potential to put testing and development to a halt, depending on the severity of the problem

Created SLOs based on the feedback

Assume per month averages.

99.9% of revisions have a pipeline

95% of pipelines start within 10 minutes of revisions being available

99% of pipelines finishes without manual interference

95% of CI pipelines do not fail for reasons not introduced by the tested changes

95% of all CI pipelines don’t pick a pipeline older than 2 weeks to resubmit

80% of the tests is not flaky at 95%

90% of testing has build results available within 1 hr from submission

90% of revisions has results reported within 24hrs from submission

99.99 % of webhook messages is reliably delivered/retried (not lost)

60% of test runs doesn’t fail due to machine/lab flakiness

50% of the alerts (monit/alertmanager/IRC) are meaningful and actionable

Statistics

TBD

Benefits

Finding out how reliable our service actually is compared to our perception
Have the information and guarantees available to our customers if needed
Be able to put a blocker on new development if stability and reliability needs to be our main focus, so the team is not overloaded with trying to do both at the same time

Drawbacks

N/A. Better ideas welcome though!

Alternatives

Continue doing what we do now

The current situation is not good, as any team member having to deal with the constant infrastructure outages can confirm. We really don’t want to continue with the current process of juggling stability fixups, manual workarounds and our actual workload.

Stop new development completely until the service is 300% reliable

Insert any number instead of 300%. This is a very drastic solution which doesn’t make management happy as they expect shiny new things to be delivered 😜 The proposed solution attempts to make both the management happy (with clear rules about the delivery and where the cutoff happens) and us (with giving us the power to say “stop” and focus on stability as needed).