27 May 2019, 00:00

CKI pipeline under the hood part 2: Don't reinvent the wheel

Now that we know what to test, we need to figure out how to do it. To reiterate, the base stages we need are:

  • Apply the patch(es) to the git tree
  • Build the kernel
  • Test the kernel
  • Profit (send the report)

As we discussed in the previous post, not all steps are relevant to all types of testing.

When the project started we used Jenkins to execute the stages. We’ve since moved away as we didn’t find it flexible enough for our needs. We often ran into errors with plugins and general maintenance, deployments took longer than we wanted and let’s be honest, writing and debugging Groovy is no fun either. We’ve spent way too much time maintaining the setup and didn’t have time to actually work on new features.

We wanted a simple, straightforward pipeline that would allow us to easily modify and deploy the runs and wasn’t a pain to maintain. And we found it.

CKI ❤️ GitLab CI

The CI is already embedded in GitLab and we don’t need to do anything special to set it up. Just add a .gitlab-ci.yml file with any stages and bash commands you need. No need to sign up for a service, everything works out of the box. And as you can see in our pipeline definition, you can do real crazy things in the YAML definitions. But I’m getting ahead of myself so let’s go from the start.

Pipelines are easily adjustable. We are passing different environment variables describing the setup from the pipeline triggers and these variables can be checked in the pipeline. Oh, kernel tree X has a cross compile bug? Add an if for tree name comparison and patch it! Don’t need a merge stage since we get an URL to already built kernel? Don’t specify the stage in the tree description.

Pipelines are easy to deploy. Write the code, open a merge request to test it, click ‘merge’ aaaand done! No downtime and new pipelines automatically pick up the new fixes (and bugs…).

These were the main selling points for us but other great features include autoretries for infrastructure issues or storage of artifacts that can be used as a yum/dnf repo and automatically cleaned up after X days after the job finishes.

How do we tame the beast?

If you opened the pipeline definition repo already, you likely noticed there’s no gitlab-ci.yml file. That’s is because this repo just serves as the storage of the code to be executed and the actual testing pipelines don’t run here.

There are separate repos set up internally as they need access to internal test infrastructure. Each repository contains multiple branches for tested trees and all they contain is a gitlab-ci.yml file with the following contents:

include:
  - https://gitlab.com/cki-project/pipeline-definition/raw/master/cki_pipeline.yml
  - https://gitlab.com/cki-project/pipeline-definition/raw/master/trees/<TREE_NAME>.yml

The <TREE_NAME> in the example is obviously substituted be a specific file from the trees directory.

These two files are expanded before the pipeline starts and define what stages are actually executed and how. This expansion guarantees consistent behavior across retries and ensures smooth completion as newly committed changes have no effect on already running pipelines. If you want to use different pipeline definition you have to trigger a new pipeline – which might be a bit painful if you have just fixed a bug and need to re-run all the affected pipelines – but the benefit of reproducible runs outweighs this occasional annoyance.

The devil is in the details

That’s all for the high-level description! What we have not discussed yet is the actual code used for the stages. These are just bash snippets for grabbing the patches to apply, cloning the git tree, compiling the kernel… nothing fancy to look at or even worth describing. What is worth describing, however, are the different workarounds and tricks embedded in the pipeline.

  • Network hiccups or upstream server issues – a problem we all familiar with. Our downloads are retried a few times by default.

  • Avoiding reporting infrastructure issues – obvious point related to the first one. Any part of the surrounding code can fail, not only the make call itself. We explicitly set stage results after the important calls themselves into a file that’s artifacted to make sure we don’t report these. However, make can fail because of infrastructure issues too. Detecting this is not easy to do automatically but we try to minimize these as much as possible (e.g. by ensuring we have enough free space for the build).

  • Speeding up builds with ccache – throwing more CPUs onto kernels to get faster builds doesn’t scale infinitely. There’s no reason to rebuild the same thing that wasn’t modified all over again so why shouldn’t we reuse the previous build results? Thanks to this, we can get tarballs in ~5-10mins and RPMs in ~30mins (note: we don’t build debug packages).

  • Avoiding broken machines – sometimes the machines used for testing in Beaker don’t behave as they should. We use Beaker metrics to filter out machines with high percentage of aborted jobs and allow up to 3 job executions on top of that.

  • Filtering out test issues – some tests need to clone a git tree, download some packages or expect enough disk space to run. These failures don’t mean the tested kernel is bad. The tests are responsible for verifying their requirements are met and checking the return codes and aborting in case of unexpected results. Check out our test repo for more information!

  • Only running tests relevant to tested patches – we use kpet to analyze the patches and map them to existing tests. If the patch in question changed networking stack there’s no need to spend time running storage tests. This both makes the test time shorter and minimizes the number of false positives as the patch couldn’t possibly introduce a bug in unrelated subsystem and thus shouldn’t be blocked. For base builds, a standard set of tier 1 tests is executed to ensure all applied patches play nicely together.

Wait, where’s the reporting part?

The reporter is set up as a webhook triggered on pipeline completion. It interprets data saved in the pipeline, pushes them into a template and sends the email. We’ve learned the hard way that integrating reporting into the pipeline itself is a recipe for trouble – it would need to be an extra stage at the end, meaning we couldn’t hit any failures or errors in the pipeline to ensure the stage gets executed. Each stage would have to start with error/failure checks and this still doesn’t account for infrastructure issues which make the pipeline crash and abort, leading to lost reports. This is a limitation of the generic pipeline design, not GitLab CI.

Want to talk more about the pipeline design?

Come to the CKI hackfest after Plumbers conference this year! The hackfest is planned for September 12-13 in Lisbon and anyone interested in CI is welcome to join us. Check out the invite and contact us if you have any questions or want to sign up!