Retrigger failed GitLab jobs that seem to be caused by infrastructure failures

Configuration via environment variables

Name Type Secret Required Description
HERDER_ACTION enum no no report (default) or retry jobs
HERDER_RETRY_LIMIT int no no maximum number of retries for a job, defaults to 3
HERDER_RETRY_DELAYS list of int no no comma-delimited delays between retries in minutes, defaults to 0,3,10
HERDER_MAXIMUM_ARTIFACT_SIZE int no no artifacts larger than this will be treated as empty, defaults to 100MB
GITLAB_TOKENS json no yes URL/environment variable pairs of GitLab instances and private tokens as a JSON object
GITLAB_TOKEN string yes yes GitLab private tokens as configured in gitlab_tokens above
IRCBOT_URL url no no IRC bot endpoint
CKI_LOGGING_LEVEL enum no no Python logging level for CKI modules, defaults to WARN
CKI_METRICS_ENABLED bool no no Enable prometheus metrics. Default: false
CKI_METRICS_PORT int no no Port where prometheus metrics are exposed. Default: 8000
SENTRY_DSN url yes no Sentry DSN

RabbitMQ setup

The herder will delay the restart of jobs via RabbitMQ dead-letter queues. This needs to be setup as described in the resilient message queue documentation.

Checking a single job

It is possible to run all matchers against a single job to see whether anything matches by specifying the job URL via

python3 -m cki_tools.pipeline_herder.main \
    --job-url https://instance/project/-/jobs/012345

Analyzing coverage

The batch_check module can be used to analyze coverage of successfully recovered jobs. In other words, jobs that failed and were restarted successfully should be detected by the herder.

First, the database of jobs in ~/.cache/pipeline-herder/ needs to be populated per project with

python3 -m cki_tools.pipeline_herder.batch_check \
    --project-url PROJECT_URL \
    --private-token PRIVATE_TOKEN \
    --days DAYS \

Next, the current coverage can be analyzed with --check.

Interactive review of recovered/failed jobs

Traces of recovered jobs can be interactively reviewed with --review. Ad-hoc regular expressions can be added that can later be integrated into the matchers module. This mode stores a configuration file in ~/.cache/pipeline-herder/review.yaml.

In the main loop, a table is shown that displays the currently available matchers/expressions together with the number of matching recovered jobs (true positive) and matching failed jobs (false positive).

Generally speaking, the number of matching failed jobs should be kept close to zero. These jobs were not restarted (or at least not successfully), and therefore represent jobs that should not be touched by the herder. Any false positives for a matcher/expression should be inspected to make sure that jobs are not needlessly restarted by the herder. This is especially important for test jobs where a restart might take a long time.

The following commands are accepted:

  • q: quit the main loop
  • [tf]123: load all traces for true/false positive jobs for a given matcher/expression in gvim
  • n: open all traces of the next missed recovered job in gvimdiff, and allow the addition of regular expressions

When inspecting the next missed recovered job, new expressions can be entered that are directly tested on all missed job traces. An expression can be stored in the configuration file with y. If the job failed because of an issue that cannot be solved by a restart, the job can be ignored with i so it will not show up again.

Supported failure conditions

Pod resource exhaustion (exit code 137)

Pods might get terminated prematurely with exit code 137. Most of the times, this indicates that the pod has exhausted its ephemeral storage. One example where this can happen is during kernel RPM building. In this case, the job can be retried directly, hopefully moving it to another node. A proper fix would be to provide a larger scratch NFS volume.

Prometheus Metrics

If CKI_METRICS_ENABLED is true, Prometheus metrics are exposed on the CKI_METRICS_PORT port.

The exposed data is the following:

Name Type Labels Description
cki_message_delayed Counter no Number of queued messages delayed via retry queue
cki_herder_problem_detected Counter gitlab_stage, gitlab_job, matcher Number of jobs processed where a problem was found
cki_herder_problem_retries Histogram gitlab_stage, gitlab_job, matcher Number of retries for a job with a problem
cki_herder_no_problem_detected Counter gitlab_stage, gitlab_job Number of jobs processed where no problem was found
cki_herder_problem_reported Counter gitlab_stage, gitlab_job, matcher Number of jobs reported (and not retried) after finding a problem
cki_herder_problem_retried Counter gitlab_stage, gitlab_job, matcher Number of jobs retried after finding a problem
cki_herder_process_time_seconds Histogram no Time spent matching a job