Retrigger failed GitLab jobs that seem to be caused by infrastructure failures

Configuration via environment variables

Name Type Secret Required Description
HERDER_ACTION enum no no report (default) or retry jobs
HERDER_RETRY_LIMIT int no no maximum number of retries for a job, defaults to 3
HERDER_RETRY_DELAYS list of int no no comma-delimited delays between retries in minutes, defaults to 0,3,10
HERDER_MAXIMUM_ARTIFACT_SIZE int no no artifacts larger than this will be treated as empty, defaults to 100MB
GITLAB_TOKENS json no yes URL/environment variable pairs of GitLab instances and private tokens as a JSON object
GITLAB_TOKEN string yes yes GitLab private tokens as configured in gitlab_tokens above
CHATBOT_URL url no no chat bot endpoint
CKI_LOGGING_LEVEL enum no no Python logging level for CKI modules, defaults to WARN
CKI_METRICS_ENABLED bool no no Enable prometheus metrics. Default: false
CKI_METRICS_PORT int no no Port where prometheus metrics are exposed. Default: 8000
SENTRY_DSN url yes no Sentry DSN

RabbitMQ setup

The herder will delay the restart of jobs via RabbitMQ dead-letter queues. This needs to be setup as described in the resilient message queue documentation.

Checking a single job

It is possible to run all matchers against a single job to see whether anything matches by specifying the job URL via

python3 -m cki_tools.pipeline_herder.main \
    --job-url https://instance/project/-/jobs/012345

Prometheus Metrics

If CKI_METRICS_ENABLED is true, Prometheus metrics are exposed on the CKI_METRICS_PORT port.

The exposed data is the following:

Name Type Labels Description
cki_message_delayed Counter no Number of queued messages delayed via retry queue
cki_herder_problem_detected Counter gitlab_stage, gitlab_job, matcher Number of jobs processed where a problem was found
cki_herder_problem_retries Histogram gitlab_stage, gitlab_job, matcher Number of retries for a job with a problem
cki_herder_no_problem_detected Counter gitlab_stage, gitlab_job Number of jobs processed where no problem was found
cki_herder_problem_reported Counter gitlab_stage, gitlab_job, matcher Number of jobs reported (and not retried) after finding a problem
cki_herder_problem_retried Counter gitlab_stage, gitlab_job, matcher Number of jobs retried after finding a problem
cki_herder_process_time_seconds Histogram no Time spent matching a job