CKI-008: Manipulating KCIDB data in the pipeline
Problem
The pipeline uses the rc file to store data and pass it between pipeline stages. Since pipeline data must ultimately be submitted to Datawarehouse in KCIDB format, it would be simpler to use KCIDB data throughout the pipeline, rather than having to convert it.
Data must be dumped throughout a job because:
- Partial data is desired if a job fails midway
- Some data cannot be collected at the end of a job, such as the time for a
single command like
make
to run
For this reason, data cannot all be collected at the end of a stage by a Python script and a CLI tool is needed to allow manipulating data.
Complex edits
The CLI must support more than just simple getting and setting like the rc file, since JSON is being manipulated. Either an existing JSON tool could be used for manipulating data, or a custom tool could be written.
Use jq
One approach would be to use a tool like jq
. For example, an output file could
be added to a build by running
jq '.builds[0].output_files += [{"file": "url"}]' kcidb_data.json > kcidb_data.json.tmp
mv kcidb_data.json.tmp kcidb_data.json
This approach would take no added effort to start using in the pipeline.
Custom CLI
A second approach would be to implement a CLI customized to the exact operations the pipeline must perform. Using the same example as above, an output file could be added by:
- Adding a
KCIDBFile
wrapper incki-lib
to open and parse a file - A tool
cki_edit.py
incki-tools
. This tool could have a commandcki_edit.py build add-output-file url file
which would then runkcidb_file['builds'][0]['output_files'].append({'file': 'url'})
.
Adding very specific commands to the tool would ensure that schema changes
could be properly tested. Rather than hiding complex assumptions about KCIDB in
the pipeline, they could be validated with kcidb-io
.
This approach would allow unit testing more complex operations, and it would
make validation using kcidb-io
much simpler. It would potentially require
frequent modifications to the cki_edit.py
CLI interface whenever different
functionality was desired.
The list problem
KCIDB stores data about checkouts and builds in lists, so these objects must be filtered by id. Pipeline jobs, however, can only discover id’s from the KCIDB file, so this would be circular, and id’s cannot be used.
It is not possible for every pipeline job to generate the same id. Because jobs
can be retried, the build job can be run multiple times, creating multiple
build_id
’s. The publish job only knows the correct build_id
based on what
it finds in the KCIDB file, so it has no way to generate the correct id.
For this reason, either the pipeline must assume KCIDB lists contain single objects, or the pipeline could split KCIDB build and checkout objects into individual files.
In practice, the approaches would be very similar. The formats would store
equivalent data; for example, a single build can be converted between formats
easily using build_single = kcidb_data['builds'][0]
and
kcidb_data = {'version': ..., 'builds': [build_single]}
. The only difference
would be style and wrapper code.
An alternative would be to add a schema that enforced
len(builds) <= 1 and len(checkouts) <= 1