cki_tools.pipeline_herder
Configuration via environment variables
Name | Type | Secret | Required | Description |
---|---|---|---|---|
HERDER_ACTION |
enum | no | no | report (default) or retry jobs |
HERDER_RETRY_LIMIT |
int | no | no | maximum number of retries for a job, defaults to 3 |
HERDER_RETRY_DELAYS |
list of int | no | no | comma-delimited delays between retries in minutes, defaults to 0,3,10 |
HERDER_MAXIMUM_ARTIFACT_SIZE |
int | no | no | artifacts larger than this will be treated as empty, defaults to 100MB |
GITLAB_TOKENS |
json | no | yes | URL/environment variable pairs of GitLab instances and private tokens as a JSON object |
GITLAB_TOKEN |
string | yes | yes | GitLab private tokens as configured in gitlab_tokens above |
IRCBOT_URL |
url | no | no | IRC bot endpoint |
CKI_LOGGING_LEVEL |
enum | no | no | Python logging level for CKI modules, defaults to WARN |
CKI_METRICS_ENABLED |
bool | no | no | Enable prometheus metrics. Default: false |
CKI_METRICS_PORT |
int | no | no | Port where prometheus metrics are exposed. Default: 8000 |
SENTRY_DSN |
url | yes | no | Sentry DSN |
RabbitMQ setup
The herder will delay the restart of jobs via RabbitMQ dead-letter queues. This needs to be setup as described in the resilient message queue documentation.
Checking a single job
It is possible to run all matchers against a single job to see whether anything matches by specifying the job URL via
python3 -m cki_tools.pipeline_herder.main \
--job-url https://instance/project/-/jobs/012345
Analyzing coverage
The batch_check
module can be used to analyze coverage of successfully
recovered jobs. In other words, jobs that failed and were restarted
successfully should be detected by the herder.
First, the database of jobs in ~/.cache/pipeline-herder/
needs to be
populated per project with
python3 -m cki_tools.pipeline_herder.batch_check \
--project-url PROJECT_URL \
--private-token PRIVATE_TOKEN \
--days DAYS \
--update
Next, the current coverage can be analyzed with --check
.
Interactive review of recovered/failed jobs
Traces of recovered jobs can be interactively reviewed with --review
. Ad-hoc
regular expressions can be added that can later be integrated into the
matchers
module. This mode stores a configuration file in
~/.cache/pipeline-herder/review.yaml
.
In the main loop, a table is shown that displays the currently available matchers/expressions together with the number of matching recovered jobs (true positive) and matching failed jobs (false positive).
Generally speaking, the number of matching failed jobs should be kept close to zero. These jobs were not restarted (or at least not successfully), and therefore represent jobs that should not be touched by the herder. Any false positives for a matcher/expression should be inspected to make sure that jobs are not needlessly restarted by the herder. This is especially important for test jobs where a restart might take a long time.
The following commands are accepted:
q
: quit the main loop[tf]123
: load all traces for true/false positive jobs for a given matcher/expression ingvim
n
: open all traces of the next missed recovered job ingvimdiff
, and allow the addition of regular expressions
When inspecting the next missed recovered job, new expressions can be entered
that are directly tested on all missed job traces. An expression can be stored
in the configuration file with y
. If the job failed because of an issue that
cannot be solved by a restart, the job can be ignored with i
so it will not
show up again.
Supported failure conditions
Pod resource exhaustion (exit code 137)
Pods might get terminated prematurely with exit code 137. Most of the times, this indicates that the pod has exhausted its ephemeral storage. One example where this can happen is during kernel RPM building. In this case, the job can be retried directly, hopefully moving it to another node. A proper fix would be to provide a larger scratch NFS volume.
Prometheus Metrics
If CKI_METRICS_ENABLED
is true
, Prometheus metrics are exposed on the
CKI_METRICS_PORT
port.
The exposed data is the following:
Name | Type | Labels | Description |
---|---|---|---|
cki_message_delayed |
Counter | no | Number of queued messages delayed via retry queue |
cki_herder_problem_detected |
Counter | gitlab_stage, gitlab_job, matcher | Number of jobs processed where a problem was found |
cki_herder_problem_retries |
Histogram | gitlab_stage, gitlab_job, matcher | Number of retries for a job with a problem |
cki_herder_no_problem_detected |
Counter | gitlab_stage, gitlab_job | Number of jobs processed where no problem was found |
cki_herder_problem_reported |
Counter | gitlab_stage, gitlab_job, matcher | Number of jobs reported (and not retried) after finding a problem |
cki_herder_problem_retried |
Counter | gitlab_stage, gitlab_job, matcher | Number of jobs retried after finding a problem |
cki_herder_process_time_seconds |
Histogram | no | Time spent matching a job |