Staging environment

Which components of the CKI setup are available in a staging version

This page has an internal companion page which contains additional information.

The CKI setup contains a partial staging environment. To reduce the effort needed for maintenance and maximize usefulness, components are only duplicated into a staging version if it provides benefits for stability and testing.

In general, the staging environment is thought of as a production-like:

no additional debug options should be enabled
features should only be disabled if they would otherwise interfere with the production environment

Kubernetes/OpenShift

Staging versions of services running on Kubernetes/OpenShift are deployed into separate staging namespaces (cki-staging or preprod). As far as possible, these services are configured identical in production and staging. Where necessary, they can use is_production, is_staging or is_production_or_staging from cki-lib to differentiate between production/staging/development environments.

Service monitoring is similar to the production namespaces:

logs can be found in Loki
monitoring stack federated into AppleCrumble
all alerting rules also apply to staging services
exceptions are logged in Sentry

RabbitMQ

The RabbitMQ cluster contains staging versions of the / (a.k.a. cki) and /datawarehouse-celery virtual hosts in /cki-staging and /datawarehouse-celery-staging, respectively.

Global resources that are visible across the environments are different (virtual host, user, password variable names), while resources deployed only within one virtual host keep their name to increase portability across the environments:

resource	naming	production	staging
virtual host	different		`-staging` suffix
users	different	`cki.` prefix	`cki-staging.` prefix
passwords	different	`RABBITMQ_CKI` prefix	`RABBITMQ_CKI_STAGING` prefix
exchanges	same	`cki.exchange.` prefix	`cki.exchange.` prefix
queues	same	`cki.queue.` prefix	`cki.queue.` prefix

The following external message sources are available in the environments:

message type	production	staging
`gitlab`	yes	yes
`amqp-bridge` (UMB)	yes	yes
`amqp-bridge` (fedmsg)	yes	yes
`sentry`	yes	no
`jira`	yes	no

Retriggered pipelines

Depending on the pipeline type (internal, public, ofa), retriggered pipelines share most of their infrastructure with production pipelines. Only some components are split into production and staging, with the retriggered pipelines using the staging version.

	`internal`	`public`	`ofa`
DC GitLab runner configurations	shared	shared	shared
AWS GitLab runner configurations	split	split	split
launch templates	split	split	shared
GitLab runner machines	split	split	shared
VPC subnets	shared	shared	shared
S3 buckets	shared	shared	shared

Infrastructure has been split where necessary to allow for the testing of launch template changes via retriggered pipelines.

DC GitLab runner configurations

Currently, all Docker-based GitLab runner configurations hosted on static machines in the data center are shared between retriggered and production pipelines. In practice, this means that e.g. the pipeline-test-runner and staging-pipeline-test-runner tags are served by the same GitLab runner configuration.

These configurations could be split to e.g. allow experimentation with the Docker configuration. This would require additional changes to the gitlab-runner-config script in deployment-all to allow separate deployment of staging and production configurations.

AWS GitLab runner configurations

All docker-machine-based GitLab runner configurations hosted on AWS EC2 machines are split between retriggered and production pipelines. In practice, this means that e.g. the pipeline-createrepo-runner and staging-pipeline-createrepo-runner tags are served by different GitLab runner configurations.

Launch templates

The properties of the workers launched by the docker-machine-based GitLab runner configurations are determined by the associated launch templates. For internal pipelines, separate launch templates are used for retriggered and production pipelines. In practice, this means that e.g. the pipeline-createrepo-runner tag will spawn workers based on the arr-cki.prod.lt.internal-general-worker launch template, while the staging-pipeline-test-runner tag will use the arr-cki.staging.lt.internal-general-worker launch template.

The current setup allows to test changes to the launch templates by retriggering internal pipelines. Most of these changes should apply equally well to the other pipeline types. Nevertheless, the launch templates could also be split for the other pipeline types.

GitLab runner machines

The AWS EC2 machines hosting the docker-machine-based GitLab runner configurations are only split for internal pipelines. In practice, this means that the pipeline-createrepo-runner and staging-pipeline-createrepo-runner tags are handled by GitLab runners on different AWS EC2 machines.

The current setup allows to test changes to the EC2 machine setup by retriggering internal pipelines. Most of these changes should apply equally well to the machines for the other pipeline types.

Nevertheless, the AWS EC2 machines hosting the docker-machine-based GitLab runners could also be split for the other pipeline types. For ofa pipelines, this would require two additional service accounts for the VPN connections as these cannot be shared across machines.

VPC subnets

Currently, the same VPC subnets are used for the dynamically spawned workers of retriggered and production pipelines. In practice, this means that e.g. the pipeline-createrepo-runner and staging-pipeline-createrepo-runner tags result in workers that share the same VPC subnets.

The subnets could also be split to further separate the workers for production pipelines from the workers for retriggered pipelines. This would avoid interference e.g. in the case of subnets running out of IP addresses.

S3 buckets

Currently, the same S3 buckets are used for retriggered and production pipelines. In practice, this means that e.g. retriggered pipelines share its ccache with the production pipelines.

The S3 buckets could also be split to further separate production pipelines from retriggered pipelines. This might require bot or pipeline changes to keep short pipelines (tests_only=true) working.