CKI provides a kernel testing service to Red Hat and the Linux kernel community, a.k.a. the customers. In general, and also especially for the sensitive kernel community, having a reliable service is of high importance. CKI can only fulfill its mission if the level of false positives because of internal infrastructure issues is kept to a minimum.
The design of the CKI testing system is based on the assumption of an underlying unreliable infrastructure. A pipeline, once started for a revision under test (RUT), might fail because of various infrastructure issues. These might be related to gitlab.com, storage, container provisioning or other factors.
Testing system components can be categorized according to the effect a failure has on the system as a whole.
Essential components are required for the system to run. Any failure in one
of those components will directly lead to a failure of the testing system if a
pipeline is scheduled or running. Most external components fall into this
category, as do all the software components that are used for triggering and
running a pipeline.
Necessary components are required for the system to run, but a failure will
not lead to a failure of the testing system. These include any components that
come into play after a pipeline is finished.
Optional components are nice to have, but testing would continue if they were
broken. A failure in one of them will not affect testing and reporting itself,
but might reduce the reliability or observability of the system.
- Kubernetes clusters and AWS EC2 machines
- S3 storage
- registry.gitlab.com and quay.io
- gitlab.com source code hosting
- Beaker labs
- pipeline-definition, kpet, skt, upt
Necessary components have to work at least sometimes to keep the pipelines working. They are implemented either as Kubernetes CronJobs or microservices connected to the AMQP message bus.
Some cronjob examples are
- lookaside updates
After the pipelines are finished:
- DataWarehouse and associated modules
- kernel workflow webhooks
These components provide observability and increase reliability:
- Prometheus stack
Failures caused by unreliable infrastructure can be nearly completely mitigated by retries. Those are currently implemented at three different levels:
GitLab provides automatic retries for jobs that failed because of system failures. The pipeline-herder extends that approach to retry jobs based on custom matchers.
Automatic redelivery of AMQP messages that failed processing
All webhooks are immediately converted into messages on the CKI message bus. Failures to process these messages results in automatic redelivery after a certain time period.
Automatic retries of REST API calls
The session module in cki-lib provides a
get_session helper to obtain a
requests session that is configured for automatic retries. This is used by
all CKI code that uses requests, even indirectly via e.g. the GitLab API
The CKI testing system is monitored to detect infrastructure failures:
- all exceptions are reported to Sentry
- all microservices expose Prometheus endpoints
- all logs are sent to Loki
- metrics are visualized via Grafana
- alerts are sent to the CKI IRC channel via the irc-bot