Secure Compile Environment

Secure environment in AWS for compiling untrusted kernel code

This is a work in progress.

Network

For production, the VPC is configured at 10.10.0.0/16. The VPC is split into two subnets per availability zone (AZ):

gitlab-runner: 10.10.{0,1,2}.0/24
workers: 10.10.{0,4,8}/22

All subnets have internet access (igw) and an S3 endpoint.

The runner network is only accessible from Red Hat IPs. The runner itself has a public IP address and can be reached via SSH. The workers are only accessible from the runner network. These properties are configured in vpc.yml and security_groups.yml.

Eventually, more subnets might be needed with the following properties:

runner network: internet access, ssh access to workers
trusted worker network: internet access
untrusted worker network: no internet access

Policy

Five different policies are defined that map to 4 different roles (roles/cki-iam/*):

git-cache-update-worker:
- write access to the git-cache S3 bucket (update-git-cache)
merge-and-build-worker:
- write access to the artifacts S3 bucket (update-artifacts)
- write access to the runner-cache S3 bucket (update-runner-cache)
runner:
- spawn worker VMs (manage-instances)
test-worker:
- spawn test VMs (test-boot)

Accounting and Configuration

In vars.yml, the cki_tags variables are used for accounting. The ArrCkiEnvironment field is set depending on the user or overriden on the command line to allow to deploy different environment into the same AWS account.

GitLab Runner

The GitLab runner is based on CentOS 7 and runs in the same AZ as its workers. During deployment, the public SSH keys of the gitlab.com users in the GITLAB_COM_SSH_ACCOUNTS are downloaded from gitlab.com and written to the authorized_keys file of the root user.

The runner is configured to

have the official gitlab-runner RPM repository
install dnf-automatic: automatic updates of the base OS
install gitlab-runner: the runner itself
install docker: needed to spawn any workers
install docker-machine from source: needed to spawn VMs for the workers
install python-boto3: needed to create AWS EC2 keypairs
install python3: needed to configure the GitLab coordinator with the runners
install python-gitlab with pip3: needed to configure the GitLab coordinator with the runners

The private key for worker access is generated on AWS. As it can only be downloaded once, it is written in /etc/gitlab-runner/worker-key for use by docker-machine.

The workers are spawned with the configuration from roles/gitlab-runner/vars/main.yml and roles/gitlab-runner/templates/config.toml.

Workers

Several different types of workers are used:

general-worker:
- t3a.medium spot instance (4GiB, 2 vCPUs for 4:48)
- 50GB hard disk
- ramdisk at /ramdisk
git-cache-update-worker:
- t3a/t3/t2.medium spot instance (4GiB, 2 vCPUs)
- 10GB hard disk
- ramdisk at /ramdisk
- only one concurrent worker allowed
merge-and-build-worker:
- c5d.4xlarge spot instance (32GiB, 16 vCPUs, 400GiB NVMe SSD)
- ramdisk at /ramdisk
- attached SSD at /var/lib/docker
test-worker:
- t3a.micro instance (1GiB, 2 vCPUs for 2:24)
- 20GB hard disk
- ramdisk at /ramdisk

For a description of the instances types, see EC2instances.info.

Git cache und associated runner configuration

The Git cache is stored in S3 and is used for two purposes:

to reduce time and network traffic needed to clone repositories
to provide a fallback in case the repositories are down

The first point is especially important for the kernel repositories, which are around 2 GB each.

The git-cache-update GitLab runner is run on a schedule to keep the repositories up-to-date. Repositories are stored as tarballs. They are uncompressed as Git repositories can only be compressed by about 10%. Next to each tarball, the MD5 sum of the Git reference list as obtained by git ls-remote is stored. This is used to determine whether it is actually necessary to update the cache for a repository. If an update is necessary, the existing tarball is streamed from S3 into tar and extracted on the hard disk. Storing them on the RAM disk doesn’t increase processing speed, and will result in memory errors as there is nothing to swap out. As the repositories are about 2 GiB, instances need at least 4 GiB of RAM+swap. The cheapest instance types on Amazon EC2 that fulfill that requirement purely from RAM are t3a.medium, t3.medium and t2.medium, which cost about $0.0125/hour for a spot instance. Pulling and extracting a kernel repository tarball on a t3.medium instance in this way takes about 10 seconds. Cloning a kernel repository takes about 7 minutes, which seems to be mostly CPU-bound (90% CPU utilization). Pushing it again after updating takes about 20 seconds. Just checking all repositories without any updates needed takes about 30 seconds. It still takes about 2 minutes to spin up an instance on Amazon EC2. Moving to a 2 GiB RAM instance type like t3.small with added swap increases the duration of the jobs by about 5%, but reduces the price per hour by 50%. The git-cache-update runner has a limit of one worker to prevent any interference between concurrent update jobs. The idle time is set to zero to shutdown the worker as soon as the job is finished.