Reliability engineering

Providing a reliable service on unreliable infrastructure

CKI provides a kernel testing service to Red Hat and the Linux kernel community, a.k.a. the customers. In general, and also especially for the sensitive kernel community, having a reliable service is of high importance. CKI can only fulfill its mission if the level of false positives because of internal infrastructure issues is kept to a minimum.

Reliability

The design of the CKI testing system is based on the assumption of an underlying unreliable infrastructure. A pipeline, once started for a revision under test (RUT), might fail because of various infrastructure issues. These might be related to gitlab.com, storage, container provisioning or other factors.

Component types

Testing system components can be categorized according to the effect a failure has on the system as a whole.

Essential components are required for the system to run. Any failure in one of those components will directly lead to a failure of the testing system if a pipeline is scheduled or running. Most external components fall into this category, as do all the software components that are used for triggering and running a pipeline.

Necessary components are required for the system to run, but a failure will not lead to a failure of the testing system. These include any components that come into play after a pipeline is finished.

Optional components are nice to have, but testing would continue if they were broken. A failure in one of them will not affect testing and reporting itself, but might reduce the reliability or observability of the system.

Essential components

Outside dependencies:

Kubernetes clusters and AWS EC2 machines
S3 storage
registry.gitlab.com and quay.io
gitlab.com source code hosting
Beaker labs

Internal components:

pipeline-trigger
gitlab-runner
pipeline-definition, kpet, skt, upt

Necessary components

Necessary components have to work at least sometimes to keep the pipelines working. They are implemented either as Kubernetes CronJobs or microservices connected to the AMQP message bus.

Some cronjob examples are

lookaside updates
git-cache-updater

After the pipelines are finished:

DataWarehouse and associated modules
reporter(-ng)
kernel workflow webhooks

Optional components

These components provide observability and increase reliability:

pipeline-herder
orphan-hunter
beaker-reaper
Prometheus stack
slack-bot

Retries

Failures caused by unreliable infrastructure can be nearly completely mitigated by retries. Those are currently implemented at three different levels:

Pipeline jobs

GitLab provides automatic retries for jobs that failed because of system failures. The pipeline-herder extends that approach to retry jobs based on custom matchers.

Automatic redelivery of AMQP messages that failed processing

All webhooks are immediately converted into messages on the CKI message bus. Failures to process these messages results in automatic redelivery after a certain time period.

Automatic retries of REST API calls

The session module in cki-lib provides a get_session helper to obtain a requests session that is configured for automatic retries. This is used by all CKI code that uses requests, even indirectly via e.g. the GitLab API wrapper.

Observability

The CKI testing system is monitored to detect infrastructure failures:

all exceptions are reported to Sentry
all microservices expose Prometheus endpoints
all logs are sent to Loki
metrics are visualized via Grafana
alerts are sent to the CKI Slack channel via the slack-bot