Secure Compile Environment
This is a work in progress.
Network
For production, the VPC is configured at 10.10.0.0/16. The VPC is split into two subnets per availability zone (AZ):
- gitlab-runner: 10.10.{0,1,2}.0/24
- workers: 10.10.{0,4,8}/22
All subnets have internet access (igw
) and an S3 endpoint.
The runner network is only accessible from Red Hat IPs.
The runner itself has a public IP address and can be reached via SSH.
The workers are only accessible from the runner network.
These properties are configured in vpc.yml
and security_groups.yml
.
Eventually, more subnets might be needed with the following properties:
- runner network: internet access, ssh access to workers
- trusted worker network: internet access
- untrusted worker network: no internet access
Policy
Five different policies are defined that map to 4 different roles (roles/cki-iam/*
):
- git-cache-update-worker:
- write access to the git-cache S3 bucket (
update-git-cache
)
- write access to the git-cache S3 bucket (
- merge-and-build-worker:
- write access to the artifacts S3 bucket (
update-artifacts
) - write access to the runner-cache S3 bucket (
update-runner-cache
)
- write access to the artifacts S3 bucket (
- runner:
- spawn worker VMs (
manage-instances
)
- spawn worker VMs (
- test-worker:
- spawn test VMs (
test-boot
)
- spawn test VMs (
Accounting and Configuration
In vars.yml
, the cki_tags
variables are used for accounting. The
ArrCkiEnvironment
field is set depending on the user or overriden on the
command line to allow to deploy different environment into the same AWS
account.
GitLab Runner
The GitLab runner is based on CentOS 7 and runs in the same AZ as its workers.
During deployment, the public SSH keys of the gitlab.com users in the
GITLAB_COM_SSH_ACCOUNTS
are downloaded from gitlab.com and written to the
authorized_keys
file of the root user.
The runner is configured to
- have the official
gitlab-runner
RPM repository - install
dnf-automatic
: automatic updates of the base OS - install
gitlab-runner
: the runner itself - install
docker
: needed to spawn any workers - install
docker-machine
from source: needed to spawn VMs for the workers - install
python-boto3
: needed to create AWS EC2 keypairs - install
python3
: needed to configure the GitLab coordinator with the runners - install
python-gitlab
with pip3: needed to configure the GitLab coordinator with the runners
The private key for worker access is generated on AWS. As it can only be
downloaded once, it is written in /etc/gitlab-runner/worker-key
for use by
docker-machine
.
The workers are spawned with the configuration from
roles/gitlab-runner/vars/main.yml
and
roles/gitlab-runner/templates/config.toml
.
Workers
Several different types of workers are used:
- general-worker:
- t3a.medium spot instance (4GiB, 2 vCPUs for 4:48)
- 50GB hard disk
- ramdisk at /ramdisk
- git-cache-update-worker:
- t3a/t3/t2.medium spot instance (4GiB, 2 vCPUs)
- 10GB hard disk
- ramdisk at /ramdisk
- only one concurrent worker allowed
- merge-and-build-worker:
- c5d.4xlarge spot instance (32GiB, 16 vCPUs, 400GiB NVMe SSD)
- ramdisk at /ramdisk
- attached SSD at /var/lib/docker
- test-worker:
- t3a.micro instance (1GiB, 2 vCPUs for 2:24)
- 20GB hard disk
- ramdisk at /ramdisk
For a description of the instances types, see EC2instances.info.
Git cache und associated runner configuration
The Git cache is stored in S3 and is used for two purposes:
- to reduce time and network traffic needed to clone repositories
- to provide a fallback in case the repositories are down
The first point is especially important for the kernel repositories, which are around 2 GB each.
The git-cache-update GitLab runner is run on a schedule to keep the
repositories up-to-date. Repositories are stored as tarballs. They are
uncompressed as Git repositories can only be compressed by about 10%. Next to
each tarball, the MD5 sum of the Git reference list as obtained by git ls-remote
is stored. This is used to determine whether it is actually
necessary to update the cache for a repository. If an update is necessary, the
existing tarball is streamed from S3 into tar
and extracted on the hard disk.
Storing them on the RAM disk doesn’t increase processing speed, and will result
in memory errors as there is nothing to swap out. As the repositories are about
2 GiB, instances need at least 4 GiB of RAM+swap. The cheapest instance types
on Amazon EC2 that fulfill that requirement purely from RAM are t3a.medium
,
t3.medium
and t2.medium
, which cost about $0.0125/hour for a spot instance.
Pulling and extracting a kernel repository tarball on a t3.medium
instance in
this way takes about 10 seconds. Cloning a kernel repository takes about 7
minutes, which seems to be mostly CPU-bound (90% CPU utilization). Pushing it
again after updating takes about 20 seconds. Just checking all repositories
without any updates needed takes about 30 seconds. It still takes about 2
minutes to spin up an instance on Amazon EC2. Moving to a 2 GiB RAM instance
type like t3.small with added swap increases the duration of the jobs by about
5%, but reduces the price per hour by 50%. The git-cache-update runner has a
limit of one worker to prevent any interference between concurrent update jobs.
The idle time is set to zero to shutdown the worker as soon as the job is
finished.