CKI-008: Manipulating KCIDB data in the pipeline

Deprecating the rc file in favor of using KCIDB data in the pipeline


The pipeline uses the rc file to store data and pass it between pipeline stages. Since pipeline data must ultimately be submitted to Datawarehouse in KCIDB format, it would be simpler to use KCIDB data throughout the pipeline, rather than having to convert it.

Data must be dumped throughout a job because:

  1. Partial data is desired if a job fails midway
  2. Some data cannot be collected at the end of a job, such as the time for a single command like make to run

For this reason, data cannot all be collected at the end of a stage by a Python script and a CLI tool is needed to allow manipulating data.

Complex edits

The CLI must support more than just simple getting and setting like the rc file, since JSON is being manipulated. Either an existing JSON tool could be used for manipulating data, or a custom tool could be written.

Use jq

One approach would be to use a tool like jq. For example, an output file could be added to a build by running

jq '.builds[0].output_files += [{"file": "url"}]' kcidb_data.json > kcidb_data.json.tmp
mv kcidb_data.json.tmp kcidb_data.json

This approach would take no added effort to start using in the pipeline.

Custom CLI

A second approach would be to implement a CLI customized to the exact operations the pipeline must perform. Using the same example as above, an output file could be added by:

  1. Adding a KCIDBFile wrapper in cki-lib to open and parse a file
  2. A tool in cki-tools. This tool could have a command build add-output-file url file which would then run kcidb_file['builds'][0]['output_files'].append({'file': 'url'}).

Adding very specific commands to the tool would ensure that schema changes could be properly tested. Rather than hiding complex assumptions about KCIDB in the pipeline, they could be validated with kcidb-io.

This approach would allow unit testing more complex operations, and it would make validation using kcidb-io much simpler. It would potentially require frequent modifications to the CLI interface whenever different functionality was desired.

The list problem

KCIDB stores data about checkouts and builds in lists, so these objects must be filtered by id. Pipeline jobs, however, can only discover id’s from the KCIDB file, so this would be circular, and id’s cannot be used.

It is not possible for every pipeline job to generate the same id. Because jobs can be retried, the build job can be run multiple times, creating multiple build_id’s. The publish job only knows the correct build_id based on what it finds in the KCIDB file, so it has no way to generate the correct id.

For this reason, either the pipeline must assume KCIDB lists contain single objects, or the pipeline could split KCIDB build and checkout objects into individual files.

In practice, the approaches would be very similar. The formats would store equivalent data; for example, a single build can be converted between formats easily using build_single = kcidb_data['builds'][0] and kcidb_data = {'version': ..., 'builds': [build_single]}. The only difference would be style and wrapper code.

An alternative would be to add a schema that enforced len(builds) <= 1 and len(checkouts) <= 1