cki_tools.pipeline_herder
Retrigger failed GitLab jobs that seem to be caused by infrastructure failures
This page has an internal companion page which might contain additional information.
Configuration via environment variables
Name | Type | Secret | Required | Description |
---|---|---|---|---|
HERDER_ACTION |
enum | no | no | report (default) or retry jobs |
HERDER_RETRY_LIMIT |
int | no | no | maximum number of retries for a job, defaults to 3 |
HERDER_RETRY_DELAYS |
list of int | no | no | comma-delimited delays between retries in minutes, defaults to 0,3,10 |
HERDER_MAXIMUM_ARTIFACT_SIZE |
int | no | no | artifacts larger than this will be treated as empty, defaults to 100MB |
GITLAB_TOKENS |
json | no | yes | URL/environment variable pairs of GitLab instances and private tokens as a JSON object |
GITLAB_TOKEN |
string | yes | yes | GitLab private tokens as configured in gitlab_tokens above |
CHATBOT_URL |
url | no | no | chat bot endpoint |
CKI_LOGGING_LEVEL |
enum | no | no | Python logging level for CKI modules, defaults to WARN |
CKI_METRICS_ENABLED |
bool | no | no | Enable prometheus metrics. Default: false |
CKI_METRICS_PORT |
int | no | no | Port where prometheus metrics are exposed. Default: 8000 |
SENTRY_DSN |
url | yes | no | Sentry DSN |
RabbitMQ setup
The herder will delay the restart of jobs via RabbitMQ dead-letter queues. This needs to be setup as described in the resilient message queue documentation.
Checking a single job
It is possible to run all matchers against a single job to see whether anything matches by specifying the job URL via
python3 -m cki_tools.pipeline_herder.main \
--job-url https://instance/project/-/jobs/012345
Prometheus Metrics
If CKI_METRICS_ENABLED
is true
, Prometheus metrics are exposed on the
CKI_METRICS_PORT
port.
The exposed data is the following:
Name | Type | Labels | Description |
---|---|---|---|
cki_message_delayed |
Counter | no | Number of queued messages delayed via retry queue |
cki_herder_problem_detected |
Counter | gitlab_stage, gitlab_job, matcher | Number of jobs processed where a problem was found |
cki_herder_problem_retries |
Histogram | gitlab_stage, gitlab_job, matcher | Number of retries for a job with a problem |
cki_herder_no_problem_detected |
Counter | gitlab_stage, gitlab_job | Number of jobs processed where no problem was found |
cki_herder_problem_reported |
Counter | gitlab_stage, gitlab_job, matcher | Number of jobs reported (and not retried) after finding a problem |
cki_herder_problem_retried |
Counter | gitlab_stage, gitlab_job, matcher | Number of jobs retried after finding a problem |
cki_herder_process_time_seconds |
Histogram | no | Time spent matching a job |