Refreshing recipe Beaker machines

How to solve lack of machines after an outage

Rationale

The list of eligible machines for each recipe is evaluated during the time of submission. This means that the list does not change while the recipe is waiting in the Beaker queue, and thus the following situation can occur:

  1. There is a lab outage, taking out most of the eligible machines.
  2. A recipe is submitted, with the very small number of working machines in its pool.
  3. The outage is resolved and the machines come back.
  4. The recipe is still waiting on any of the machines in the tiny pool to become available, ignoring the newly added machines.

This situation ends up blocking the pipelines, and a manual fix of those recipes may be required.

Solution

The problematic recipes can be identified by checking the queued recipes table in Grafana. Check the recipes that were submitted before the outage was resolved, and compare the numbers of currently available machines with the numbers in the table.

upt resubmits a Beaker job if it gets canceled, and this feature is essential in fixing the problem. For each identified recipe, open the recipe in Beaker and cancel it (you can also use the bkr command line to do this). upt will take care of resubmitting the recipe.

You can verify the increased number of available machines by checking the newly submitted recipe in Beaker. Most of the time, the new recipes should also show up in the updated Grafana table, however that will not happen if the recipe picks up a free machine immediately upon submission.