Reliability engineering
CKI provides a kernel testing service to Red Hat and the Linux kernel community, a.k.a. the customers. In general, and also especially for the sensitive kernel community, having a reliable service is of high importance. CKI can only fulfill its mission if the level of false positives because of internal infrastructure issues is kept to a minimum.
Reliability
The design of the CKI testing system is based on the assumption of an underlying unreliable infrastructure. A pipeline, once started for a revision under test (RUT), might fail because of various infrastructure issues. These might be related to gitlab.com, storage, container provisioning or other factors.
Component types
Testing system components can be categorized according to the effect a failure has on the system as a whole.
Essential
components are required for the system to run. Any failure in one
of those components will directly lead to a failure of the testing system if a
pipeline is scheduled or running. Most external components fall into this
category, as do all the software components that are used for triggering and
running a pipeline.
Necessary
components are required for the system to run, but a failure will
not lead to a failure of the testing system. These include any components that
come into play after a pipeline is finished.
Optional
components are nice to have, but testing would continue if they were
broken. A failure in one of them will not affect testing and reporting itself,
but might reduce the reliability or observability of the system.
Essential components
Outside dependencies:
- Kubernetes clusters and AWS EC2 machines
- S3 storage
- registry.gitlab.com and quay.io
- gitlab.com source code hosting
- Beaker labs
Internal components:
- pipeline-trigger
- gitlab-runner
- pipeline-definition, kpet, skt, upt
Necessary components
Necessary components have to work at least sometimes to keep the pipelines working. They are implemented either as Kubernetes CronJobs or microservices connected to the AMQP message bus.
Some cronjob examples are
- lookaside updates
- git-cache-updater
After the pipelines are finished:
- DataWarehouse and associated modules
- reporter(-ng)
- kernel workflow webhooks
Optional components
These components provide observability and increase reliability:
- pipeline-herder
- orphan-hunter
- beaker-reaper
- Prometheus stack
- slack-bot
Retries
Failures caused by unreliable infrastructure can be nearly completely mitigated by retries. Those are currently implemented at three different levels:
Pipeline jobs
GitLab provides automatic retries for jobs that failed because of system failures. The pipeline-herder extends that approach to retry jobs based on custom matchers.
Automatic redelivery of AMQP messages that failed processing
All webhooks are immediately converted into messages on the CKI message bus. Failures to process these messages results in automatic redelivery after a certain time period.
Automatic retries of REST API calls
The session module in cki-lib provides a get_session
helper to obtain a
requests
session that is configured for automatic retries. This is used by
all CKI code that uses requests, even indirectly via e.g. the GitLab API
wrapper.
Observability
The CKI testing system is monitored to detect infrastructure failures:
- all exceptions are reported to Sentry
- all microservices expose Prometheus endpoints
- all logs are sent to Loki
- metrics are visualized via Grafana
- alerts are sent to the CKI Slack channel via the slack-bot