cki_tools.pipeline_herder

Retrigger failed GitLab jobs that seem to be caused by infrastructure failures

This page has an internal companion page which might contain additional information.

Configuration via environment variables

Name	Type	Secret	Required	Description
`PIPELINE_HERDER_CONFIG`	yaml	no	no	Configuration in YAML. If not present, falls back to `PIPELINE_HERDER_CONFIG_PATH`.
`PIPELINE_HERDER_CONFIG_PATH`	path	no	no	Path to the configuration YAML file
`GITLAB_TOKENS`	json	no	yes	URL/environment variable pairs of GitLab instances and private tokens as a JSON object
`GITLAB_TOKEN`	string	yes	yes	GitLab private tokens as configured in `gitlab_tokens` above
`CHATBOT_URL`	url	no	no	chat bot endpoint
`CKI_LOGGING_LEVEL`	enum	no	no	Python logging level for CKI modules, defaults to WARN
`CKI_METRICS_ENABLED`	bool	no	no	Enable prometheus metrics. Default: false
`CKI_METRICS_PORT`	int	no	no	Port where prometheus metrics are exposed. Default: 8000
`SENTRY_DSN`	url	yes	no	Sentry DSN

Configuration file

Job matching is configured in the shipped configuration file.

matchers:
  - name: image-pull
    description: Failure during image pull
    messages:
      - 'ERROR: Job failed: image pull failed'
      - 'ERROR: Job failed: failed to pull image'
  - name: integrity
    description: Job failed with data_integrity_failure
    failure_reason: data_integrity_failure
  - action: report
    matchers:
      - name: no-trace
        description: Job has no trace
        builtin: no_trace
      - name: tests-not-run
        description: Test job has tests that did not run
        job_name: test
        job_status: []
        builtin: missed_tests
      - name: no-trace
        description: Job has no trace
        builtin: no_trace

Field	Type	Default	Description
`name`	string	empty	matcher name
`description`	string	empty	matcher description
`action`	string	`retry`	`retry`, `report` or `alert`
`maximum_artifact_size`	int	`1_000_000`	maximum artifact size to process
`retry_delays`	list[string]	`[5m]`	delay between successive retries
`retry_limit`	int	`3`	maximum number of retries, 0 to disable
`web_url`	list[url]	empty	job URL prefixes
`job_status`	list[string]	`[failed]`	`success`, `failed`
`job_name`	string	empty	job name prefix
`variables`	dict[str,list[regex or None]]	empty	allowed trigger variable values
`exemplars`	list[url]	empty	job URLs that should be matched by this node
`failure_reason`	string	empty	`data_integrity_failure`, `stuck_or_timeout_failure`, …
`builtin`	string	empty	`no_trace`, `missed_tests`
`messages`	list[str or /regex/]	`[]`	pattern to look for in config files
`file_name`	string	empty	log file name, uses console log if empty
`tail_lines`	int	`300`	number of lines to check
`matchers`	object	empty	further sub-matchers

In general, matchers are recursively processed depth-first via the matchers field, with field values getting overwritten if redefined; if the matchers field is not set, the actual matching takes place with all collected fields.

For message matching via regular expressions, regex modifiers/flags cannot be appended to the trailing slash. They have to be provided inline via (?aiLmsux).

RabbitMQ setup

The herder will delay the restart of jobs via RabbitMQ dead-letter queues. This needs to be setup as described in the resilient message queue documentation.

Checking a single job

It is possible to run all matchers against a single job to see whether anything matches by specifying the job URL via

python3 -m cki_tools.pipeline_herder.main \
    --job-url https://instance/project/-/jobs/012345

Validating the configuration

All embedded job exemplars can be checked via

python3 -m cki_tools.pipeline_herder.main --validate

Prometheus Metrics

If CKI_METRICS_ENABLED is true, Prometheus metrics are exposed on the CKI_METRICS_PORT port.

The exposed data is the following:

Name	Type	Labels	Description
`cki_message_delayed`	Counter	no	Number of queued messages delayed via retry queue
`cki_herder_problem_detected`	Counter	gitlab_stage, gitlab_job, matcher	Number of jobs processed where a problem was found
`cki_herder_problem_retries`	Histogram	gitlab_stage, gitlab_job, matcher	Number of retries for a job with a problem
`cki_herder_no_problem_detected`	Counter	gitlab_stage, gitlab_job	Number of jobs processed where no problem was found
`cki_herder_problem_reported`	Counter	gitlab_stage, gitlab_job, matcher	Number of jobs reported (and not retried) after finding a problem
`cki_herder_problem_retried`	Counter	gitlab_stage, gitlab_job, matcher	Number of jobs retried after finding a problem
`cki_herder_process_time_seconds`	Histogram	no	Time spent matching a job