CKI-009: CKI pipeline plan generation

Knowing the expected output of a CKI pipeline from the beginning

Abstract

DataWarehouse is turning into the source of truth about CKI pipelines for many of the services that require results data.

With the CKI pipeline generating its own KCIDB data, the data in DataWarehouse is becoming more reliable everyday, but there is one thing that is still missing: knowing when all the builds and tests for a checkout finished, and we can report the results.

Using KCIDB, we can generate a “plan” of expected results beforehand to help us understand the status of the CKI pipeline.

Motivation

Goals

Knowing what things we are expecting from a pipeline will help us improve the overall quality in different ways.

Describe state

Our current implementation of the pipeline-herder monitors the job executions, compares the output to known failures and retries them if needed. This is great for known issues but does not protect us from unknown failures.

With a testing plan we can describe the expected outcome for a given checkout: certain build and test results.

Knowing what the state of the pipeline should be we can evaluate the results and identify missing data due to incomplete runs.

For example, if a given pipeline needs to have build results for architectures A, B and C, and one of them is missing, we can retry the job to achieve the expected plan.

Knowing when it finished

If we know what we are expecting it’s easy to know when a pipeline finished.

At the beginning, all the objects are submitted without a result, and once the builds and tests are updated in DataWarehouse we know what finished and what is still running.

Approach

Generating the test plan as early as possible is the best way forward. Knowing what we are expecting from the very beginning can be very beneficial.

Given the current architecture of the pipeline and the tools used, the build architectures can be deducted anywhere thanks to the pipeline variables, but only after kpet is run (setup stage) we have enough information to know all the tests that we are expecting.

In other words:

  • Merge stage should generate checkout and, if it was valid, the child builds.

  • Setup stage, which is run after the builds succeeded, should generate child tests.

This makes the idea of “plan” a little weaker, but it’s good enough for a first iteration.

The concept of a finished pipeline changes from all objects have results to (checkout invalid) or (all builds have results and any build failed) or (all tests have results).

Requirements

Reproducible Build IDs

In order to be able to update the KCIDB objects in DataWarehouse, we need to guarantee that the IDs are known and reproducible.

This means that in the merge job we know what ID each build is going to have, and retried jobs should generate the same ids.

To get reproducible build IDs, using a structure similar to redhat:{pipeline_id}_{architecture}(_debug)? is good enough, as we only build each architecture once + debug, which can be appended to the architecture name.

Reproducible Test IDs

For tests it’s not so simple, as some of them are run more than once for each architecture and a {build_id}_{test_path} approach will generate colliding objects.

Unique IDs for tasks inside the recipesets is necessary. These values can be just simple counters, and they don’t need to be reproducible between different kpet calls: as the Beaker XML is generated on the setup stage, future test job retries would run with the same plan and generate the same reproducible ids.

Chained objects after failure

When an object that has child objects fails, it should mark those depending objects as not run.

With the split approach (some objects in the merge stage, others in the setup), we can directly avoid creating child objects if the parent failed.

Drawbacks

Reproducible IDs for non reproducible objects

While two consecutive runs of the same build job would generate the same ID, the resulting binaries (and results) might not be the same, as the they’re not reproducible.

This is more obvious on tests, as the same test will probably be executed on different environments each time it is run.

For this reason, valid results (like a failing test because of a certain flakiness or transient failure) is going to be overwritten with the retries, effectively ereasing data about failures from DataWarehouse.