CKI-004: Service Level Objectives and error budgets
We’re providing a service, and it should be pretty reliable. To manage our customers' and our own expectations, we need to capture the needs and formalize what that pretty means. If the service is not reliable enough, instead of focusing on new features we need to focus on increasing the stability of the service.
The approach of Service Level Objectives and error budgets is explained in detail in the Site Reliability Engineering book in chapters 3 and 4.
Infrastructure on top of which CKI runs is often unreliable. This leads to growing frustration of both our customers and the CKI team itself. We have to drop our work and focus on working around the problem or retry/resubmit the testing manually. We’re unable to continue development as our development workflow is relying on the same infrastructure setup.
With growing customer frustration, we’re losing their trust in the service. With our own frustration, the team morale and excitement is in shambles leading to worse performance. We lose valuable time we could spend on shiny new features and improvements by fixing up and working around broken infrastructure. This has to stop, and we need some clear rules in place on how to streamline things.
Gathering expectations from customers
For upstream customers, we utilized the Kernel CI community survey from last year, as our customer base is identical. We focused on developer and maintainer roles. For internal customers, we checked with kernel engineers who belong in the same two groups, and are also active in the new workflow transition. Most of the answers we got were consistent across our upstream and internal customers. Here’s the summary of the feedback:
- Testing should start right away after changes are submitted, though some delay (e.g. due to outages) is acceptable
- Build results should be available within 1 hour
- Testing should complete and results should be delivered within 24 hours. A decent chunk of upstream leans towards a 6 hours mark, though most people are still fine with 24 hours. Internally, longer times (24-48 hours) are acceptable as reviews don’t happen right away.
- Webhooks shouldn’t lose any messages
- Reliability of the hosts used for testing and the queues are a concern
These expectations match the data we had from previous conversations years ago, and it’s good to have them confirmed. We cannot gather expectations for the new GitLab workflow, as that is not used in production yet. We are expecting some feedback and expectations about that to come up in the future.
Gathering expectations from CKI team
These are some of the problems we wrote up or talked about previously with the team:
- CI jobs randomly fail due to infrastructure problems and need to be manually restarted
- CI is slow, especially if the runners are busy or testing (Beaker) is needed
- Any problem visible in production impacts the development environment as well and has the potential to put testing and development to a halt, depending on the severity of the problem
Created SLOs based on the feedback
Assume per month averages.
99.9% of revisions have a pipeline
95% of pipelines start within 10 minutes of revisions being available
99% of pipelines finishes without manual interference
95% of CI pipelines do not fail for reasons not introduced by the tested changes
95% of all CI pipelines don’t pick a pipeline older than 2 weeks to resubmit
80% of the tests is not flaky at 95%
90% of testing has build results available within 1 hr from submission
90% of revisions has results reported within 24hrs from submission
99.99 % of webhook messages is reliably delivered/retried (not lost)
60% of test runs doesn’t fail due to machine/lab flakiness
50% of the alerts (monit/alertmanager/IRC) are meaningful and actionable
- Finding out how reliable our service actually is compared to our perception
- Have the information and guarantees available to our customers if needed
- Be able to put a blocker on new development if stability and reliability needs to be our main focus, so the team is not overloaded with trying to do both at the same time
N/A. Better ideas welcome though!
Continue doing what we do now
The current situation is not good, as any team member having to deal with the constant infrastructure outages can confirm. We really don’t want to continue with the current process of juggling stability fixups, manual workarounds and our actual workload.
Stop new development completely until the service is 300% reliable
Insert any number instead of 300%. This is a very drastic solution which doesn’t make management happy as they expect shiny new things to be delivered 😜 The proposed solution attempts to make both the management happy (with clear rules about the delivery and where the cutoff happens) and us (with giving us the power to say “stop” and focus on stability as needed).