Debugging a failing GitLab pipeline job

How to allow a kernel developer to investigate from the inside why a GitLab pipeline job failed

This page has an internal companion page which might contain additional information.

Problem

A GitLab pipeline job failed, but the error is not reproducible outside the pipeline.

Steps

Add debugging code to the pipeline in a pipeline-definition MR that will dump the environment and sleep in the case of failure, e.g. add the following in pipeline/stages/build.yml:

if ! rpmbuild --rebuild ...; then
  environment_file=$(mktemp)
  cki_echo_error "Build failed"
  cki_echo_notify "dumping environment to $environment_file"
  export > "$environment_file"
  cki_echo_notify "sleeping"
  sleep infinity
  exit 1
fi

From the kernel repository MR, get the ID for the failed pipeline in the pipeline repositories. Then, in the pipeline-definition MR, retrigger that pipeline with the debug code via the bot with something like
```
@cki-ci-bot, please test [rhel/12345678]
```
Wait until the pipeline hits the sleep. In the failing job log, determine the gitlab-runner and spawned EC2 machine from lines like
```
Running with ...
on ...-aws-internal-a-...
...
Running on ... via runner-abcdef-arr-cki.prod.general.1234567-123abcdef...
```
Use the script ansible_ssh.sh from the deployment-all repository checkout to access the appropriate EC2 instance. Select arr-cki-prod-internal-runner-us-east-1a.infra.cki-project.org if the pipeline runs in ...-aws-internal-a-... or arr-cki-prod-internal-runner-us-east-1b.infra.cki-project.org if the pipeline runs in ...-aws-internal-b-....

Log into the runner

sudo docker-machine ssh runner-abcdef-arr-cki.prod.general.1234567-123abcdef

The docker container can then be entered via

sudo docker ps
sudo docker exec -it container-id /bin/bash