Retrigger failed GitLab jobs that seem to be caused by infrastructure failures

Configuration via environment variables

Name Type Secret Required Description
PIPELINE_HERDER_CONFIG yaml no no Configuration in YAML. If not present, falls back to PIPELINE_HERDER_CONFIG_PATH.
PIPELINE_HERDER_CONFIG_PATH path no no Path to the configuration YAML file
GITLAB_TOKENS json no yes URL/environment variable pairs of GitLab instances and private tokens as a JSON object
GITLAB_TOKEN string yes yes GitLab private tokens as configured in gitlab_tokens above
CHATBOT_URL url no no chat bot endpoint
CKI_LOGGING_LEVEL enum no no Python logging level for CKI modules, defaults to WARN
CKI_METRICS_ENABLED bool no no Enable prometheus metrics. Default: false
CKI_METRICS_PORT int no no Port where prometheus metrics are exposed. Default: 8000
SENTRY_DSN url yes no Sentry DSN

Configuration file

Job matching is configured in the shipped configuration file.

  - name: image-pull
    description: Failure during image pull
      - 'ERROR: Job failed: image pull failed'
      - 'ERROR: Job failed: failed to pull image'
  - name: integrity
    description: Job failed with data_integrity_failure
    failure_reason: data_integrity_failure
  - action: report
      - name: no-trace
        description: Job has no trace
        builtin: no_trace
      - name: tests-not-run
        description: Test job has tests that did not run
        job_name: test
        job_status: []
        builtin: missed_tests
      - name: no-trace
        description: Job has no trace
        builtin: no_trace
Field Type Default Description
name string empty matcher name
description string empty matcher description
action string retry retry, report or alert
maximum_artifact_size int 1_000_000 maximum artifact size to process
retry_delays list[string] [5m] delay between successive retries
retry_limit int 3 maximum number of retries, 0 to disable
web_url list[url] empty job URL prefixes
job_status list[string] [failed] success, failed
job_name string empty job name prefix
variables dict[str,list[regex or None]] empty allowed trigger variable values
exemplars list[url] empty job URLs that should be matched by this node
failure_reason string empty data_integrity_failure, stuck_or_timeout_failure, …
builtin string empty no_trace, missed_tests
messages list[str or /regex/] [] pattern to look for in config files
file_name string empty log file name, uses console log if empty
tail_lines int 300 number of lines to check
matchers object empty further sub-matchers

In general, matchers are recursively processed depth-first via the matchers field, with field values getting overwritten if redefined; if the matchers field is not set, the actual matching takes place with all collected fields.

For message matching via regular expressions, regex modifiers/flags cannot be appended to the trailing slash. They have to be provided inline via (?aiLmsux).

RabbitMQ setup

The herder will delay the restart of jobs via RabbitMQ dead-letter queues. This needs to be setup as described in the resilient message queue documentation.

Checking a single job

It is possible to run all matchers against a single job to see whether anything matches by specifying the job URL via

python3 -m cki_tools.pipeline_herder.main \
    --job-url https://instance/project/-/jobs/012345

Validating the configuration

All embedded job exemplars can be checked via

python3 -m cki_tools.pipeline_herder.main --validate

Prometheus Metrics

If CKI_METRICS_ENABLED is true, Prometheus metrics are exposed on the CKI_METRICS_PORT port.

The exposed data is the following:

Name Type Labels Description
cki_message_delayed Counter no Number of queued messages delayed via retry queue
cki_herder_problem_detected Counter gitlab_stage, gitlab_job, matcher Number of jobs processed where a problem was found
cki_herder_problem_retries Histogram gitlab_stage, gitlab_job, matcher Number of retries for a job with a problem
cki_herder_no_problem_detected Counter gitlab_stage, gitlab_job Number of jobs processed where no problem was found
cki_herder_problem_reported Counter gitlab_stage, gitlab_job, matcher Number of jobs reported (and not retried) after finding a problem
cki_herder_problem_retried Counter gitlab_stage, gitlab_job, matcher Number of jobs retried after finding a problem
cki_herder_process_time_seconds Histogram no Time spent matching a job