Upgrading machines to a newer Fedora version
According to the official Fedora Linux Release Life Cycle, Fedora Linux has releases every six months, with releases being supported for about 13 months. The Fedora Project schedules contain the details for the individual releases.
Upgrading the container images
Upgrade preparations
- Create a tracking ticket in the containers repository to keep track of which container images have already been upgraded
- File a merge request for the
containers
repository that sets the__BASE_IMAGE_TAG
inincludes/setup-from-fedora
to the new version - Check that all container images build successfully and fix the image builds if necessary
- If the upstream Beaker harness repository is missing, file a ticket similar to beaker#167 to get that resolved
Upgrading the python pipeline image
- For the pipeline, only the
python
image is affected by the upgrade: trigger a bot run in a comment with something like@cki-ci-bot please test [centos/c9s][skip_beaker=false]
- If successful, deploy the MR into the
production/python
environment; this will tag thepython
container image aspython:production
and gets it used in the pipeline
Upgrading the buildah image
- Deploy into the
production/buildah
environment, and trigger a new pipeline in the MR - Check that container images still build successfully
Upgrading the base image
-
In the child pipeline for the
base
image, manually trigger thetest-base
child pipelines. This will make sure that derived images are at least buildable. -
From the
test-base
child pipelines above, follow the one forcki-tools
. Manually trigger thecki-tools-integration-tests
child pipelines. This will make sure that the derivedcki-tools
image can run most of the CI jobs across CKI. -
Where necessary, start to file fix-up merge requests in all dependent projects to test and fix problems caused by the new versions of Python, Ansible and the various linters; in these merge requests, temporarily add something like
variables: cki_tools_image_tag: p-12345
to the
.gitlab-ci.yml
file to use the new version of thecki-tools
image from thecki-tools
child pipeline above.Repositories that are known to cause trouble are the
cki-tools
,cki-lib
,kernel-workflow
,kernel-tests
anddeployment-all
repositories. -
Once everything seems under control, deploy into the
production/base
environment in thecontainers
repository merge request and merge the merge request in thecontainers
repository. -
Remove the temporary changes to the
.gitlab-ci.yml
files in the other repositories and merge any fixes as needed. -
In repositories where no fixes were needed, trigger a new pipeline to get new container images built and deployed.
Upgrading the machines
In general, machines can be moved to a newer Fedora release either by reprovisioning or in-place upgrading. While reprovisioning is preferred, in-place upgrades are documented here as well for cases where reprovisioning is temporarily broken.
Independent of the way of upgrading, the individual machines comprising a service should be upgraded one by one, with the corresponding service staying available in a degraded state at all times.
The Machine kernel and OS versions dashboard in Grafana contains an overview of the kernel and OS versions currently running on CKI machines.
Preparations
-
Create a tracking ticket similar to infrastructure#140.
-
Familiarize yourself with the machines in the various deployment environments and how to access them.
-
Follow the steps in the README file in the deployment-all repository to get access to the production environments and verify you can access all machines via
./ansible_ssh.sh
. -
In the
deployment-all
repository invars.yml
, changeFEDORA_CORE_VERSION
to the target version viaTARGET_VERSION=37 sed -i "/^FEDORA_CORE_VERSION/s/.*/FEDORA_CORE_VERSION: $TARGET_VERSION/" vars.yml
-
From the Fedora Cloud Base Images, determine the AMI IDs for the needed architectures in
US East (N. Virginia)
(us-east-1) for Amazon Public Cloud. Update thefedora_ami_ids
variable invars.yml
with the new IDs. -
If upgrading machines underlying the RabbitMQ cluster, provision an additional cluster node during the upgrade process.
-
Before shutting down an individual machine, gracefully stop all running services on it by logging into the machine via
./ansible_ssh.sh
in thedeployment-all
repository checkout.For RabbitMQ machines, drain the node via
sudo rabbitmq-upgrade drain
For GitLab runners, stop the gitlab-runner service via
sudo systemctl stop gitlab-runner
If necessary, GitLab runners can be disabled on the GitLab side as well. If needed, determine the corresponding runners for a machine in the
deployment-all
repository checkout via the output ofgitlab-runner-config/deploy.sh activations generate
and disable them via
gitlab-runner-config/deploy.sh activations apply --deactivate REGEX
-
File a merge request with the changes, but do not merge it!
Changing the AMI IDs of dynamically spawned machines on AWS
-
Machines spawned dynamically by gitlab-runner use the changed AMI IDs via launch templates. Deploy new staging versions via
CKI_DEPLOYMENT_ENVIRONMENT=staging \ PLAYBOOK_NAME=ansible/playbooks/aws-arr-launch-templates.yml \ ./ansible_deploy.sh
-
In a dummy MR in any CKI repository with the bot hooked up, retrigger a pipeline via something like
@cki-ci-bot please test [centos/c9s]
. Verify that the machines spawned for the pipeline jobs via the staging launch templates are working correctly. -
Submit a dummy MR in the
containers
repository. Verify that the machines spawned for thebuildah
container image integration tests via the staging launch templates are working correctly.
Reprovisioning machines
In a first step, the currently running machines need to be removed, and, in the case of machines controlled by Beaker, reprovisioned with a clean operating system.
-
For a Beaker-based machine, reprovision the machine in the
deployment-all
repository checkout viapodman run \ --env ENVPASSWORD \ --interactive \ --rm \ --tty \ --volume .:/data \ --workdir /data \ --pull always \ quay.io/cki/cki-tools:production \ ./beaker_provision.sh <FQDN>
-
For an OpenStack-based machine, navigate to
Project -> Compute -> Instances
and selectDelete Instance
from the context menu for the existing machine. -
For an AWS-based machine, navigate to
EC2 -> Instances
, and disable the termination protection for the instance viaActions -> Instance settings -> Change termination protection
. Terminate the instance viaInstance state -> Terminate instance
. Click on the small edit icon next to the name of the instance and replace it byterminated
. On theTags
tab, modify theCkiAnsibleGroup
value toterminated
as well. -
For RabbitMQ machines, also remove the node from the RabbitMQ cluster. In the
deployment-all
repository checkout, log into any of the remaining RabbitMQ cluster nodes via./ansible_ssh.sh
, and get the list of cluster nodes viasudo rabbitmqctl cluster_status
Compare the
Disk Nodes
andRunning Nodes
lists to find the name of the terminated node, and remove it from the cluster viasudo rabbitmqctl forget_cluster_node <NODENAME>
After that, new machines can be configured in the deployment-all
repository
checkout via the playbook given in the table in the deployment
environments documentation via
PLAYBOOK_NAME=<INSTANCE-PLAYBOOK> ./ansible_deploy.sh
Replace <INSTANCE-PLAYBOOK>
by the appropriate playbook name.
Finally, newly configured GitLab runner machines need to get the correct
gitlab-runner configuration in the deployment-all
repository checkout via
gitlab-runner-config/deploy.sh configurations apply
In-place upgrades
-
Log into the machines via
./ansible_ssh.sh
in thedeployment-all
repository checkout. -
For a Beaker-based machine, manually update the Beaker repository files in
/etc/yum.repos.d/beaker-*.repo
on the machine itself to the target version viasource /etc/os-release TARGET_VERSION=37 sed -Ei "s/F-([0-9]+)/F-$TARGET_VERSION/" /etc/yum.repos.d/beaker-*.repo
-
Download updates via
TARGET_VERSION=37 dnf system-upgrade download --releasever=$TARGET_VERSION
-
Trigger the upgrade process via
dnf system-upgrade reboot
Cleanup
-
If GitLab runners were disabled on the GitLab side, reactivate them again in the
deployment-all
repository checkout viagitlab-runner-config/deploy.sh activations apply
-
Remove any additional node added to the RabbitMQ cluster during the upgrade process.