Recover from Spot instance failures by switching to on-demand instances

How to diagose and mitigate GitLab job failures because of AWS spot instances failures

Problem

Normally, CKI pipeline GitLab jobs run on AWS EC2 spot instances.

For build jobs, the spot instances are requested via InstanceRequirements:

16 vCPUs
32 GB of RAM
us-east-1 region, availability zones A, B, C and F

In some cases, spot instances will be reclaimed by AWS, and jobs will not be able to complete successfully. In these cases, the jobs will fail midway, most likely during the Compiling step, because the underlying machines are shutting down. This shut down causes various further secondary errors, like in the following example:

Compiling the kernel with: rpmbuild --rebuild --target x86_64 --with up --with zfcpdump --with base --without trace --without arm64_16k --without arm64_64k --without debug --without realtime --without realtime_arm64_64k
Running after_script
WARNING: Failed to inspect build container 4fb57520e3c8595d879a039776025612780b1b007a4f130deef49bc9118166ee Get "https://10.10.10.10:2376/v1.47/containers/4fb57520e3c8595d879a039776025612780b1b007a4f130deef49bc9118166ee/json": dial tcp 10.10.10.10:2376: i/o timeout (docker_command.go:155:10s)
Using effective pull policy of [always] for container quay.io/cki/builder-stream10:production
Pulling docker image quay.io/cki/builder-stream10:production ...
WARNING: Failed to pull image with policy "always": Post "https://10.10.10.10:2376/v1.47/images/create?fromImage=quay.io%2Fcki%2Fbuilder-stream10&tag=production": dial tcp 10.10.10.10:2376: i/o timeout (manager.go:238:10s)
WARNING: after_script failed, but job will continue unaffected: failed to pull image "quay.io/cki/builder-stream10:production" with specified policies [always]: Post "https://10.10.10.10:2376/v1.47/images/create?fromImage=quay.io%2Fcki%2Fbuilder-stream10&tag=production": dial tcp 10.10.10.10:2376: i/o timeout (manager.go:238:10s)
Uploading artifacts for failed job
WARNING: Failed to inspect predefined container 29f49fb3b1a6add5183a7704b5a8a7195ed27e6e00148c78cc1a014a10294975 Get "https://10.10.10.10:2376/v1.47/containers/29f49fb3b1a6add5183a7704b5a8a7195ed27e6e00148c78cc1a014a10294975/json": dial tcp 10.10.10.10:2376: i/o timeout (docker_command.go:155:10s)
Using helper image:  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v18.0.2
Using effective pull policy of [always] for container registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v18.0.2
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v18.0.2 ...
WARNING: Failed to pull image with policy "always": Post "https://10.10.10.10:2376/v1.47/images/create?fromImage=registry.gitlab.com%2Fgitlab-org%2Fgitlab-runner%2Fgitlab-runner-helper&tag=x86_64-v18.0.2": dial tcp 10.10.10.10:2376: i/o timeout (manager.go:238:10s)
Cleaning up project directory and file based variables
WARNING: Failed to inspect predefined container 29f49fb3b1a6add5183a7704b5a8a7195ed27e6e00148c78cc1a014a10294975 Get "https://10.10.10.10:2376/v1.47/containers/29f49fb3b1a6add5183a7704b5a8a7195ed27e6e00148c78cc1a014a10294975/json": dial tcp 10.10.10.10:2376: i/o timeout (docker_command.go:155:10s)
Using helper image:  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v18.0.2
Using effective pull policy of [always] for container registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v18.0.2
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v18.0.2 ...
WARNING: Failed to pull image with policy "always": Post "https://10.10.10.10:2376/v1.47/images/create?fromImage=registry.gitlab.com%2Fgitlab-org%2Fgitlab-runner%2Fgitlab-runner-helper&tag=x86_64-v18.0.2": dial tcp 10.10.10.10:2376: i/o timeout (manager.go:238:10s)
ERROR: Failed to cleanup volumes
ERROR: Job failed (system failure): waiting for container: Post "https://10.10.10.10:2376/v1.47/containers/4fb57520e3c8595d879a039776025612780b1b007a4f130deef49bc9118166ee/wait?condition=not-running": dial tcp 10.10.10.10:2376: i/o timeout

While these errors are automatically retried by the pipeline-herder, repeatedly failing build jobs will cause kernel CI pipelines to stall, which over time will cause unacceptable delays in the kernel workflow.

Steps

Confirm that jobs are failing because of AWS spot instance reclaim. The job logs should fit the description above.
Make sure you have an up-to-date deployment-all checkout, and are logged into SSO and all required services as described in the README there.
Inspect the paused setting of the runner activations via
```
gitlab-runner-config/deploy.sh activations diff --interactive
```
and take note of any deviations from the default settings. These will need to be accounted for in the next step via additional --activate and --deactivate options.

Switch from spot to on-demand instances via

gitlab-runner-config/deploy.sh activations apply --activate '.*-ondemand' --deactivate '.*-spot'

Similarly, you can switch back to spot instances via

 gitlab-runner-config/deploy.sh activations apply --activate '.*-spot' --deactivate '.*-ondemand'