Upgrading machines to a newer Fedora version
According to the official Fedora Linux Release Life Cycle, Fedora Linux has releases every six months, with releases being supported for about 13 months. The Fedora Project schedules contain the details for the individual releases.
Upgrading the container images
Upgrade preparations
- Create a tracking ticket in the containers repository to keep track of which container images have already been upgraded
- File a merge request for the
containers
repository that sets the__BASE_IMAGE_TAG
inincludes/setup-from-fedora
to the new version - Check that all container images build successfully and fix the image builds if necessary
- If the upstream Beaker harness repository is missing, file a ticket similar to beaker#167 to get that resolved
Upgrading the python pipeline image
- For the pipeline, only the
python
image is affected by the upgrade: trigger a bot run in a comment with something like@cki-ci-bot please test [centos/c9s][skip_beaker=false]
- If successful, deploy the MR into the
production/python
environment; this will tag thepython
container image aspython:production
and gets it used in the pipeline
Upgrading the buildah image
- Deploy into the
production/buildah
environment, and trigger a new pipeline in the MR - Check that container images still build successfully
Upgrading cki-tools and other derived images
-
File a temporary merge request for the
cki-tools
repository that sets theBASE_IMAGE_TAG
variable for thepublish
job in.gitlab-ci.yml
tomr-123
corresponding to the merge request ID in thecontainers
repository -
If the
cki-tools
image built correctly, start to file fix-up merge requests in all dependent projects to test and fix problems caused by the new versions of Python, Ansible and the various linters; in these merge requests, temporarily add something like.cki_tools: image: quay.io/cki/cki-tools:mr-234
to the
.gitlab-ci.yml
file to use the new version of thecki-tools
image from thecki-tools
merge request.Repositories that are known to cause trouble are the
cki-tools
,cki-lib
,kernel-workflow
,kernel-tests
anddeployment-all
repositories. -
Once everything seems under control, deploy into the
production/base
environment in thecontainers
repository merge request and merge the merge request in thecontainers
repository. -
Remove the temporary changes to
.gitlab-ci.yml
file in thecki-tools
repository and merge the merge request in thecki-tools
repository. -
Remove the temporary changes to the
.gitlab-ci.yml
files in the other repositories and merge any fixes as needed. -
In repositories where no fixes were needed, trigger a new pipeline to get new container images built and deployed.
Upgrading the machines
In general, machines can be moved to a newer Fedora release either by reprovisioning or in-place upgrading. While reprovisioning is preferred, in-place upgrades are documented here as well for cases where reprovisioning is temporarily broken.
Independent of the way of upgrading, the individual machines comprising a service should be upgraded one by one, with the corresponding service staying available in a degraded state at all times.
The Machine kernel and OS versions dashboard in Grafana contains an overview of the kernel and OS versions currently running on CKI machines.
Preparations
-
Create a tracking ticket similar to infrastructure#140.
-
Familiarize yourself with the machines in the various deployment environments and how to access them.
-
Follow the steps in the README file in the deployment-all repository to get access to the production environments and verify you can access all machines via
./ansible_ssh.sh
. -
In the
deployment-all
repository invars.yml
, changeFEDORA_CORE_VERSION
to the target version viaTARGET_VERSION=37 sed -i "/^FEDORA_CORE_VERSION/s/.*/FEDORA_CORE_VERSION: $TARGET_VERSION/" vars.yml
-
From the Fedora Cloud Base Images, determine the AMI IDs for the needed architectures in
US East (N. Virginia)
(us-east-1) for Amazon Public Cloud. Update thefedora_ami_ids
variable invars.yml
with the new IDs. -
If upgrading machines underlying the RabbitMQ cluster, provision an additional cluster node during the upgrade process.
-
Before shutting down an individual machine, gracefully stop all running services on it by logging into the machine via
./ansible_ssh.sh
in thedeployment-all
repository checkout.For RabbitMQ machines, drain the node via
sudo rabbitmq-upgrade drain
For GitLab runners, stop the gitlab-runner service via
sudo systemctl stop gitlab-runner
If necessary, GitLab runners can be disabled on the GitLab side as well. If needed, determine the corresponding runners for a machine in the
deployment-all
repository checkout via the output ofgitlab-runner-config/deploy.sh activations generate
and disable them via
gitlab-runner-config/deploy.sh activations apply --deactivate REGEX
-
File a merge request with the changes, but do not merge it!
Changing the AMI IDs of dynamically spawned machines on AWS
-
To use the changed AMI IDs from above for dynamic machines spawned dynamically by gitlab-runner, deploy the updated gitlab-runner configurations in the
deployment-all
repository checkout viagitlab-runner-config/deploy.sh configurations apply
Reprovisioning machines
In a first step, the currently running machines need to be removed, and, in the case of machines controlled by Beaker, reprovisioned with a clean operating system.
-
For a Beaker-based machine, reprovision the machine in the
deployment-all
repository checkout viapodman run \ --env ENVPASSWORD \ --interactive \ --rm \ --tty \ --volume .:/data \ --workdir /data \ --pull always \ quay.io/cki/cki-tools:production \ ./beaker_provision.sh <FQDN>
-
For an OpenStack-based machine, navigate to
Project -> Compute -> Instances
and selectDelete Instance
from the context menu for the existing machine. -
For an AWS-based machine, navigate to
EC2 -> Instances
, and disable the termination protection for the instance viaActions -> Instance settings -> Change termination protection
. Terminate the instance viaInstance state -> Terminate instance
. Click on the small edit icon next to the name of the instance and replace it byterminated
. On theTags
tab, modify theCkiAnsibleGroup
value toterminated
as well. -
For RabbitMQ machines, also remove the node from the RabbitMQ cluster. In the
deployment-all
repository checkout, log into any of the remaining RabbitMQ cluster nodes via./ansible_ssh.sh
, and get the list of cluster nodes viasudo rabbitmqctl cluster_status
Compare the
Disk Nodes
andRunning Nodes
lists to find the name of the terminated node, and remove it from the cluster viasudo rabbitmqctl forget_cluster_node <NODENAME>
After that, new machines can be configured in the deployment-all
repository
checkout via the playbook given in the table in the deployment
environments documentation via
PLAYBOOK_NAME=<INSTANCE-PLAYBOOK> ./ansible_deploy.sh
Replace <INSTANCE-PLAYBOOK>
by the appropriate playbook name.
Finally, newly configured GitLab runner machines need to get the correct
gitlab-runner configuration in the deployment-all
repository checkout via
gitlab-runner-config/deploy.sh configurations apply
In-place upgrades
-
Log into the machines via
./ansible_ssh.sh
in thedeployment-all
repository checkout. -
For a Beaker-based machine, manually update the Beaker repository files in
/etc/yum.repos.d/beaker-*.repo
on the machine itself to the target version viasource /etc/os-release TARGET_VERSION=37 sed -Ei "s/F-([0-9]+)/F-$TARGET_VERSION/" /etc/yum.repos.d/beaker-*.repo
-
Download updates via
TARGET_VERSION=37 dnf system-upgrade download --releasever=$TARGET_VERSION
-
Trigger the upgrade process via
dnf system-upgrade reboot
Cleanup
-
If GitLab runners were disabled on the GitLab side, reactivate them again in the
deployment-all
repository checkout viagitlab-runner-config/deploy.sh activations apply
-
Remove any additional node added to the RabbitMQ cluster during the upgrade process.