Upgrading machines to a newer Fedora version

How to reprovision machines with a newer Fedora version without causing an outage

This page has an internal companion page which might contain additional information.

According to the official Fedora Linux Release Life Cycle, Fedora Linux has releases every six months, with releases being supported for about 13 months. The Fedora Project schedules contain the details for the individual releases.

Upgrading the container images

Upgrade preparations

Create a tracking ticket in the containers repository to keep track of which container images have already been upgraded
File a merge request for the containers repository that sets the __BASE_IMAGE_TAG in includes/setup-from-fedora to the new version
Check that all container images build successfully and fix the image builds if necessary
If the upstream Beaker harness repository is missing, file a ticket similar to beaker#167 to get that resolved

Upgrading the python pipeline image

For the pipeline, only the python image is affected by the upgrade: trigger a bot run in a comment with something like @cki-ci-bot please test [centos/c9s][skip_beaker=false]
If successful, deploy the MR into the production/python environment; this will tag the python container image as python:production and gets it used in the pipeline

Upgrading the buildah image

Deploy into the production/buildah environment, and trigger a new pipeline in the MR
Check that container images still build successfully

Upgrading the base image

In the child pipeline for the base image, manually trigger the test-base child pipelines. This will make sure that derived images are at least buildable.
From the test-base child pipelines above, follow the one for cki-tools. Manually trigger the cki-tools-integration-tests child pipelines. This will make sure that the derived cki-tools image can run most of the CI jobs across CKI.
Where necessary, start to file fix-up merge requests in all dependent projects to test and fix problems caused by the new versions of Python, Ansible and the various linters; in these merge requests, temporarily add something like
```
variables:
  cki_tools_image_tag: p-12345
```
to the .gitlab-ci.yml file to use the new version of the cki-tools image from the cki-tools child pipeline above.

Repositories that are known to cause trouble are the cki-tools, cki-lib, kernel-workflow, kernel-tests and deployment-all repositories.
Once everything seems under control, deploy into the production/base environment in the containers repository merge request and merge the merge request in the containers repository.
Remove the temporary changes to the .gitlab-ci.yml files in the other repositories and merge any fixes as needed.
In repositories where no fixes were needed, trigger a new pipeline to get new container images built and deployed.

Upgrading the machines

In general, machines can be moved to a newer Fedora release either by reprovisioning or in-place upgrading. While reprovisioning is preferred, in-place upgrades are documented here as well for cases where reprovisioning is temporarily broken.

Independent of the way of upgrading, the individual machines comprising a service should be upgraded one by one, with the corresponding service staying available in a degraded state at all times.

The Machine kernel and OS versions dashboard in Grafana contains an overview of the kernel and OS versions currently running on CKI machines.

Preparations

Create a tracking ticket similar to infrastructure#140.
Familiarize yourself with the machines in the various deployment environments and how to access them.
Follow the steps in the README file in the deployment-all repository to get access to the production environments and verify you can access all machines via ./ansible_ssh.sh.

In the deployment-all repository in vars.yml, change FEDORA_CORE_VERSION to the target version via

TARGET_VERSION=37
sed -i "/^FEDORA_CORE_VERSION/s/.*/FEDORA_CORE_VERSION: $TARGET_VERSION/" vars.yml

From the Fedora Cloud Base Images, determine the AMI IDs for the needed architectures in US East (N. Virginia) (us-east-1) for Amazon Public Cloud. Update the fedora_ami_ids variable in vars.yml with the new IDs.
If upgrading machines underlying the RabbitMQ cluster, provision an additional cluster node during the upgrade process.
Before shutting down an individual machine, gracefully stop all running services on it by logging into the machine via ./ansible_ssh.sh in the deployment-all repository checkout.

For RabbitMQ machines, drain the node via
```
sudo rabbitmq-upgrade drain
```
For GitLab runners, stop the gitlab-runner service via
```
sudo systemctl stop gitlab-runner
```
If necessary, GitLab runners can be disabled on the GitLab side as well. If needed, determine the corresponding runners for a machine in the deployment-all repository checkout via the output of
```
gitlab-runner-config/deploy.sh activations generate
```
and disable them via
```
gitlab-runner-config/deploy.sh activations apply --deactivate REGEX
```
File a merge request with the changes, but do not merge it!

Changing the AMI IDs of dynamically spawned machines on AWS

Machines spawned dynamically by gitlab-runner use the changed AMI IDs via launch templates. Deploy new staging versions via

CKI_DEPLOYMENT_ENVIRONMENT=staging \
    PLAYBOOK_NAME=ansible/playbooks/aws-arr-launch-templates.yml \
    ./ansible_deploy.sh

In a dummy MR in any CKI repository with the bot hooked up, retrigger a pipeline via something like @cki-ci-bot please test [centos/c9s]. Verify that the machines spawned for the pipeline jobs via the staging launch templates are working correctly.
Submit a dummy MR in the containers repository. Verify that the machines spawned for the buildah container image integration tests via the staging launch templates are working correctly.

Reprovisioning machines

In a first step, the currently running machines need to be removed, and, in the case of machines controlled by Beaker, reprovisioned with a clean operating system.

For a Beaker-based machine, reprovision the machine in the deployment-all repository checkout via

podman run \
    --env ENVPASSWORD \
    --interactive \
    --rm \
    --tty \
    --volume .:/data \
    --workdir /data \
    --pull always \
    quay.io/cki/cki-tools:production \
    ./beaker_provision.sh <FQDN>

For an OpenStack-based machine, navigate to Project -> Compute -> Instances and select Delete Instance from the context menu for the existing machine.
For an AWS-based machine, navigate to EC2 -> Instances, and disable the termination protection for the instance via Actions -> Instance settings -> Change termination protection. Terminate the instance via Instance state -> Terminate instance. Click on the small edit icon next to the name of the instance and replace it by terminated. On the Tags tab, modify the CkiAnsibleGroup value to terminated as well.
For RabbitMQ machines, also remove the node from the RabbitMQ cluster. In the deployment-all repository checkout, log into any of the remaining RabbitMQ cluster nodes via ./ansible_ssh.sh, and get the list of cluster nodes via
```
sudo rabbitmqctl cluster_status
```
Compare the Disk Nodes and Running Nodes lists to find the name of the terminated node, and remove it from the cluster via
```
sudo rabbitmqctl forget_cluster_node <NODENAME>
```

After that, new machines can be configured in the deployment-all repository checkout via the playbook given in the table in the deployment environments documentation via

PLAYBOOK_NAME=<INSTANCE-PLAYBOOK> ./ansible_deploy.sh

Replace <INSTANCE-PLAYBOOK> by the appropriate playbook name.

Finally, newly configured GitLab runner machines need to get the correct gitlab-runner configuration in the deployment-all repository checkout via

gitlab-runner-config/deploy.sh configurations apply

In-place upgrades

Log into the machines via ./ansible_ssh.sh in the deployment-all repository checkout.
For a Beaker-based machine, manually update the Beaker repository files in /etc/yum.repos.d/beaker-*.repo on the machine itself to the target version via
```
source /etc/os-release
TARGET_VERSION=37
sed -Ei "s/F-([0-9]+)/F-$TARGET_VERSION/" /etc/yum.repos.d/beaker-*.repo
```

Download updates via

TARGET_VERSION=37
dnf system-upgrade download --releasever=$TARGET_VERSION

Trigger the upgrade process via
```
dnf system-upgrade reboot
```

Cleanup

If GitLab runners were disabled on the GitLab side, reactivate them again in the deployment-all repository checkout via
```
gitlab-runner-config/deploy.sh activations apply
```
Remove any additional node added to the RabbitMQ cluster during the upgrade process.