Upgrading machines to a newer Fedora version
According to the official Fedora Linux Release Life Cycle, Fedora Linux has releases every six months, with releases being supported for about 13 months. The Fedora Project schedules contain the details for the individual releases.
Upgrading the container images
- Create a tracking ticket in the containers repository to keep track of which container images have already been upgraded
- File a merge request for the
containersrepository that sets the
includes/setup-from-fedorato the new version
- Check that all container images build successfully and fix the image builds if necessary
- If the upstream Beaker harness repository is missing, file a ticket similar to beaker#167 to get that resolved
Upgrading the python pipeline image
- For the pipeline, only the
pythonimage is affected by the upgrade: trigger a bot run in a comment with something like
@cki-ci-bot please test [centos/c9s][skip_beaker=false]
- If successful, deploy the MR into the
production/pythonenvironment; this will tag the
pythoncontainer image as
python:productionand gets it used in the pipeline
Upgrading the buildah image
- Deploy into the
production/buildahenvironment, and trigger a new pipeline in the MR
- Check that container images still build successfully
Upgrading the base image
In the child pipeline for the
baseimage, manually trigger the
test-basechild pipelines. This will make sure that derived images are at least buildable.
test-basechild pipelines above, follow the one for
cki-tools. Manually trigger the
cki-tools-integration-testschild pipelines. This will make sure that the derived
cki-toolsimage can run most of the CI jobs across CKI.
Where necessary, start to file fix-up merge requests in all dependent projects to test and fix problems caused by the new versions of Python, Ansible and the various linters; in these merge requests, temporarily add something like
variables: cki_tools_image_tag: p-12345
.gitlab-ci.ymlfile to use the new version of the
cki-toolsimage from the
cki-toolschild pipeline above.
Repositories that are known to cause trouble are the
Once everything seems under control, deploy into the
production/baseenvironment in the
containersrepository merge request and merge the merge request in the
Remove the temporary changes to the
.gitlab-ci.ymlfiles in the other repositories and merge any fixes as needed.
In repositories where no fixes were needed, trigger a new pipeline to get new container images built and deployed.
Upgrading the machines
In general, machines can be moved to a newer Fedora release either by reprovisioning or in-place upgrading. While reprovisioning is preferred, in-place upgrades are documented here as well for cases where reprovisioning is temporarily broken.
Independent of the way of upgrading, the individual machines comprising a service should be upgraded one by one, with the corresponding service staying available in a degraded state at all times.
The Machine kernel and OS versions dashboard in Grafana contains an overview of the kernel and OS versions currently running on CKI machines.
Create a tracking ticket similar to infrastructure#140.
Familiarize yourself with the machines in the various deployment environments and how to access them.
Follow the steps in the README file in the deployment-all repository to get access to the production environments and verify you can access all machines via
FEDORA_CORE_VERSIONto the target version via
TARGET_VERSION=37 sed -i "/^FEDORA_CORE_VERSION/s/.*/FEDORA_CORE_VERSION: $TARGET_VERSION/" vars.yml
From the Fedora Cloud Base Images, determine the AMI IDs for the needed architectures in
US East (N. Virginia)(us-east-1) for Amazon Public Cloud. Update the
vars.ymlwith the new IDs.
If upgrading machines underlying the RabbitMQ cluster, provision an additional cluster node during the upgrade process.
Before shutting down an individual machine, gracefully stop all running services on it by logging into the machine via
For RabbitMQ machines, drain the node via
sudo rabbitmq-upgrade drain
For GitLab runners, stop the gitlab-runner service via
sudo systemctl stop gitlab-runner
If necessary, GitLab runners can be disabled on the GitLab side as well. If needed, determine the corresponding runners for a machine in the
deployment-allrepository checkout via the output of
gitlab-runner-config/deploy.sh activations generate
and disable them via
gitlab-runner-config/deploy.sh activations apply --deactivate REGEX
File a merge request with the changes, but do not merge it!
Changing the AMI IDs of dynamically spawned machines on AWS
Machines spawned dynamically by gitlab-runner use the changed AMI IDs via launch templates. Deploy new staging versions via
CKI_DEPLOYMENT_ENVIRONMENT=staging \ PLAYBOOK_NAME=ansible/playbooks/aws-arr-launch-templates.yml \ ./ansible_deploy.sh
In a dummy MR in any CKI repository with the bot hooked up, retrigger a pipeline via something like
@cki-ci-bot please test [centos/c9s]. Verify that the machines spawned for the pipeline jobs via the staging launch templates are working correctly.
Submit a dummy MR in the
containersrepository. Verify that the machines spawned for the
buildahcontainer image integration tests via the staging launch templates are working correctly.
In a first step, the currently running machines need to be removed, and, in the case of machines controlled by Beaker, reprovisioned with a clean operating system.
For a Beaker-based machine, reprovision the machine in the
deployment-allrepository checkout via
podman run \ --env ENVPASSWORD \ --interactive \ --rm \ --tty \ --volume .:/data \ --workdir /data \ --pull always \ quay.io/cki/cki-tools:production \ ./beaker_provision.sh <FQDN>
For an OpenStack-based machine, navigate to
Project -> Compute -> Instancesand select
Delete Instancefrom the context menu for the existing machine.
For an AWS-based machine, navigate to
EC2 -> Instances, and disable the termination protection for the instance via
Actions -> Instance settings -> Change termination protection. Terminate the instance via
Instance state -> Terminate instance. Click on the small edit icon next to the name of the instance and replace it by
terminated. On the
Tagstab, modify the
For RabbitMQ machines, also remove the node from the RabbitMQ cluster. In the
deployment-allrepository checkout, log into any of the remaining RabbitMQ cluster nodes via
./ansible_ssh.sh, and get the list of cluster nodes via
sudo rabbitmqctl cluster_status
Running Nodeslists to find the name of the terminated node, and remove it from the cluster via
sudo rabbitmqctl forget_cluster_node <NODENAME>
After that, new machines can be configured in the
checkout via the playbook given in the table in the deployment
environments documentation via
<INSTANCE-PLAYBOOK> by the appropriate playbook name.
Finally, newly configured GitLab runner machines need to get the correct
gitlab-runner configuration in the
deployment-all repository checkout via
gitlab-runner-config/deploy.sh configurations apply
Log into the machines via
For a Beaker-based machine, manually update the Beaker repository files in
/etc/yum.repos.d/beaker-*.repoon the machine itself to the target version via
source /etc/os-release TARGET_VERSION=37 sed -Ei "s/F-([0-9]+)/F-$TARGET_VERSION/" /etc/yum.repos.d/beaker-*.repo
Download updates via
TARGET_VERSION=37 dnf system-upgrade download --releasever=$TARGET_VERSION
Trigger the upgrade process via
dnf system-upgrade reboot
If GitLab runners were disabled on the GitLab side, reactivate them again in the
deployment-allrepository checkout via
gitlab-runner-config/deploy.sh activations apply
Remove any additional node added to the RabbitMQ cluster during the upgrade process.