Changing the configuration of the RabbitMQ nodes

How to mess with the RabbitMQ cluster without causing an outage

Problem

Invasive changes that need to be performed on the RabbitMQ cluster should be implemented via rolling upgrades.

Approach

In a first step, an additional node with the changed configuration is added to the cluster. Afterwards, nodes are reconfigured one-by-one. Finally, the additional node is removed again.

Test-running the changed configuration on an additional node will prevent certain problems from being visible to production clients because the additional node is not used by them:

  • configuration changes that prevent the RabbitMQ service from coming up will not result in a degraded cluster
  • configuration changes that prevent the RabbitMQ node from joining the cluster will not result in lost messages
  • configuration changes that prevent clients from connecting to the RabbitMQ node can be detected in isolation

Alternative approaches

The additional node is not strictly necessary. Any changes can also be performed directly on the production nodes. The following aspects should be kept in mind in that case.

Any configuration changes that prevent the RabbitMQ service from restarting will result in a node being offline for RabbitMQ consumers. By default, clients will round-robin through all configured nodes until they are successful, so in practice this shouldn’t be a problem.

While a node is offline, the cluster will be running in a degraded state. As e.g. automatic updates of the other nodes are still enabled, the cluster could degrade further because of e.g. nodes rebooting for offline update installations.

Any configuration changes that will result in a running RabbitMQ node that is not joined with the cluster (split cluster) will result in listening clients connecting successfully, but not being able to consume messages. More importantly, any messages sent to the node will not be synchronized to the cluster and effectively be lost.

Steps

  1. Create a new RabbitMQ node in us-east-1d and join it to the cluster via

    PLAYBOOK_NAME=aws-arr-rabbitmq-instance \
        ./ansible_deploy.sh \
        --extra-vars '{"rabbitmq_instances": ["us-east-1d"]}' \
        --limit arr-cki-prod-rabbitmq-us-east-1d,localhost \
        --skip-tags qualys
    

    The Qualys cloud agent installation can be skipped via --skip-tags qualys if dnf is not available locally.

  2. Log into the RabbitMQ management console. Ensure that there are now four nodes in the cluster. Determine the status of the newly joined node by searching for the node with the lowest uptime.

  3. One-by-one, implement the changes on the RabbitMQ nodes. Ensure that a changed node is healthy and completely synced to the cluster after being restarted before continuing with the next node.

  4. To remove the additional node in us-east-1d, drain it first via

    ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1d \
        sudo rabbitmq-upgrade drain
    

    In the AWS console, disable termination protection of the EC2 instance via actions -> instance settings -> change termination protection. Then terminate it via instance state -> terminate instance.

    To remove the node from the cluster, get the lists of cluster nodes via

    ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1a \
        sudo rabbitmqctl cluster_status
    

    Compare the Disk Nodes and Running Nodes lists to find the name of the terminated additional node, and remove it from the cluster via

    ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1a \
        sudo rabbitmqctl forget_cluster_node rabbit@ip-123-45-67-89.ec2.internal
    
  5. In the RabbitMQ management console, ensure that there are only three nodes shown.