Site maintenance

This document describes how to re-install and replace hosts in various scenarios.

For operating system upgrades, see OS Upgrade.

See Add your first site for instructions on how to add a site to the system.

See Local unseal for instructions on how to unseal an isolated site. Unseal is how an existing site retrieves it's decryption keys after being restarted. This is not to be confused with site unwrap (where a site gets its secret keys at time of installation), mentioned below.

During development, a host may need to be re-installed in order to test installation scripts, perform OS upgrades, change disks etc. This can often be done without any concerns for handling running applications and keeping configuration.

During production, host re-installation is probably quite rare, but when it happens, care has to be taken to properly handle running applications.

When a host is re-installed, it is assumed that the Edge Enforcer installation script is executed again.

See Sites for details about how hosts call home etc.

Maintenance modes overview

Each host in a site has a maintenance-mode setting which can be set to one of three values:

off The default, the host can be used normally.
blocked No new service instances will be scheduled to this host, but existing service instances are not rescheduled. The host will also not be used to handle any new volga topics, but old topics are not moved away from the host. When draining services from multiple hosts, it is good practice to first place them all in this mode to avoid having any services relocate to these hosts.
out-of-service Like blocked, but service instances scheduled to this host will be rescheduled to other hosts and volga topics handled by this host will be moved to other hosts. Useful when a host is to be permanently removed, or taken offline for a substantial amount of time.

In addition, there is the drain mode which can only be set by calling the drain-host action to reschedule service-instances to other hosts. This is useful when a host needs to be rebooted. The drain mode is ephemeral and once the host is restarted, its maintenance-mode will revert back to its previous value.

Re-install a host on a site with a single host

Suppose you have created a site with a single host, and installed and started the host. Then you stop the host and re-install it. The Edge Enforcer on the host will perform its call-home procedure. However, this will fail, since the call-home server thinks that the host is already a member of an initialized site.

When the Edge Enforcer doesn't start, check the systemd status:

systemctl status supd

and logs

journalctl -xu supd

In this case, you might see messages like:

<NOTICE> 2022-12-13 14:03:00.261629Z UNKNOWN_CHOSTNAME boot_callhome:200 <0.9.0>
Host id 38cfff0d-cce6-4f71-99dd-98612719cd79 calling home to https://192.168.100.101/v1/host-init resulted in reply code 403 (forbidden).
Check the volga topic system:logs in Control Tower and verify that this host identifier is configured on a site, retrying

You can check the volga topic system:logs in the Control Tower:

supctl do volga topics system:logs consume --payload-only

And you should see messages like:

<INFO>    2022-12-13 13:58:39.551647Z topdc-001: host-init ignoring duplicate host id 38cfff0d-cce6-4f71-99dd-98612719cd79 (already running in site sthlm)

This issue can be resolved in two different ways:

Re-create the site

One solution is to delete the site in the control tower, and then re-create it.

supctl show --config system sites sthlm > /tmp/sthlm
supctl delete system sites sthlm
supctl create system sites < /tmp/sthlm

When this is done, the host, which is still calling home, will get a successful reply from the Control Tower.

Re-create the host

Another solution is to delete the host from the site in the control tower, and then re-create it. In this case you can keep the rest of the site config intact, but you will have to allow the site to unwrap again, see below.

supctl show --config system sites sthlm

name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
hosts:
  - host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79

Remove host
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
EOF

Add the host
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
hosts:
  - host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
EOF

At this point, the host encounters a new problem:

journalctl -xu supd

Failed to unwrap site bundle

This is because the the site is only allowed to unwrap its site bundle once, and it was already unwrapped when the site was first initialized. The site bundle is the site's unique secrets needed to establish the site and to communicate with the Control Tower.

To fix this, we can manually allow the site to unwrap the site bundle again:

supctl do system sites sthlm reallow-site-unwrap

When this is done, the host, which is still calling home, will get a successful reply from the Control Tower.

Re-install a host on a site with more than one host

Suppose you have created a site with three hosts, and installed and started the hosts. Then you stop one of the hosts and re-install it. The Edge Enforcer on the host will perform its call-home procedure. However, this will fail, since the call-home server thinks that the host is already a member of an initialized site.

When the Edge Enforcer doesn't start, check the systemd status:

systemctl status supd

and logs

journalctl -xu supd

In this case, you might see messages like:

<NOTICE> 2022-12-13 14:03:00.261629Z UNKNOWN_CHOSTNAME boot_callhome:200 <0.9.0>
Host id 38cfff0d-cce6-4f71-99dd-98612719cd79 calling home to https://192.168.100.101/v1/host-init resulted in reply code 403 (forbidden).
Check system:log in Control Tower and verify that this host identifier is configured on a site, retrying

You can also check the system:logs volga topic in the Control Tower:

supctl do volga topics system:logs consume --payload-only

And you should see messages like:

<INFO>    2022-12-13 13:58:39.551647Z topdc-001: host-init ignoring duplicate host id 38cfff0d-cce6-4f71-99dd-98612719cd79 (already running in site sthlm)

This issue can be resolved by recreating the host:

Re-create the host

Delete the host from the site in the control tower, and then re-create it.

supctl show --config system sites sthlm

name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
hosts:
  - host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
  - host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
  - host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6

Remove host
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
hosts:
  - host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
  - host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
EOF

Add the host
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
hosts:
  - host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
  - host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
  - host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
EOF

Replace hosts in a site

Suppose you have a site with three hosts and running applications, and these hosts need to be replaced by new updated hardware. While it is possible to just remove the hosts from the site configuration, this may result in unnecessary application downtime and data loss for volga topics. The following steps describe how to bring the old hosts down one by one, while migrating applications and topic data to the new hosts.

NOTE: This procedure should be followed even if not all hosts are replaced.

Site Preconditions

Ensure that all hosts are up and running.

supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,hostname,oper-status

- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
  cluster-hostname: sthlm-001
  hostname: sthlm-3
  oper-status: up
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
  cluster-hostname: sthlm-002
  hostname: sthlm-1
  oper-status: up
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
  cluster-hostname: sthlm-003
  hostname: sthlm-2
  oper-status: up

Configure and start the new hosts

Add the host-ids of the new hosts to the config:

supctl merge system sites sthlm << EOF
hosts:
  - host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
  - host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
  - host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
EOF

Then start the new hosts. Verify that all hosts are up and running (NOTE the new hosts will not show up until they have attached to the site):

supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,hostname,oper-status,controller

- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
  cluster-hostname: sthlm-001
  hostname: sthlm-3
  oper-status: up
  controller: true
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
  cluster-hostname: sthlm-002
  hostname: sthlm-1
  oper-status: up
  controller: true
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
  cluster-hostname: sthlm-003
  hostname: sthlm-2
  oper-status: up
  controller: true
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
  cluster-hostname: sthlm-004
  hostname: sthlm-4
  oper-status: up
  controller: false
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
  cluster-hostname: sthlm-005
  hostname: sthlm-6
  oper-status: up
  controller: false
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
  cluster-hostname: sthlm-006
  hostname: sthlm-5
  oper-status: up
  controller: false

Set maintenance mode

Set the maintenance-mode for all hosts that are to be removed to blocked. This ensures that no new service instances are scheduled to the old hosts, and no new volga topics will be assigned to them.

warning

Do not set the hosts to out-of-service in this step, that could potentially lead to data loss.

supctl merge system sites sthlm << EOF
hosts:
  - host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
    maintenance-mode: blocked
  - host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
    maintenance-mode: blocked
  - host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
    maintenance-mode: blocked
EOF

Verify that the hosts are blocked:

supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,maintenance-mode

- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
  cluster-hostname: sthlm-001
  maintenance-mode: blocked
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
  cluster-hostname: sthlm-002
  maintenance-mode: blocked
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
  cluster-hostname: sthlm-003
  maintenance-mode: blocked
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
  cluster-hostname: sthlm-004
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
  cluster-hostname: sthlm-005
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
  cluster-hostname: sthlm-006

Set the old hosts to out-of-service

Set the maintenance-mode for the old hosts to out-of-service, one host at a time:

supctl merge system sites sthlm << EOF
hosts:
  - host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
    maintenance-mode: out-of-service
EOF

This will move services and volga topic replicas from this host to the new hosts (since the remaining old hosts were all set to blocked in the previous step).

You check the value of safe-to-remove on the host that is taken out of service. It will get the value true when all service instances and all replicated volga topics have been moved to other hosts:

supctl show --site sthlm system cluster hosts --fields hostname,safe-to-remove

Repeat this step for each host that is to be removed.

Remove the old hosts

Once all of the old hosts have been set to out-of-service, they can be removed from the site configuration. Note that it is important that the hosts are up and running at this point.

supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
  parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
  - 192.168.100.1
hosts:
  - host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
  - host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
  - host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
EOF

The new hosts will copy the shared state from the old hosts. The old hosts will detect that they have been removed from the site and stop themselves. Verify that the old hosts are no longer running and that the new hosts are controllers and up and running (this may take some short amount of time):

supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,hostname,oper-status,controller

- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
  cluster-hostname: sthlm-004
  hostname: sthlm-4
  oper-status: up
  controller: true
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
  cluster-hostname: sthlm-005
  hostname: sthlm-6
  oper-status: up
  controller: true
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
  cluster-hostname: sthlm-006
  hostname: sthlm-5
  oper-status: up
  controller: true

At this point, the Edge Enforcer has wiped its local state and stopped, and the host VM or physical machine can be stopped.

Rescheduling of service instances

Some maintenance scenarios might affect the distribution of running service instances among the running hosts. Ie. not being well balanced between hosts anymore after the maintenance. One such scenario is when all hosts are drained, upgraded and restarted in sequence one by one.

When this is the case it is possible to invoke the reschedule action to re-balance the service instances among the running hosts within a site.

supctl do --site sthml system cluster reschedule

rescheduled-service-instances:
  - name: telco.alpine.fifth-srv-4
    from-host: h05
    to-host: h07
  - name: telco.alpine.fifth-srv-3
    from-host: h04
    to-host: h06
  - name: telco.alpine.fifth-srv-2
    from-host: h03
    to-host: h07
  - name: telco.alpine.second-srv-1
    from-host: h06
    to-host: h07
  - name: telco.alpine.fifth-srv-1
    from-host: h06
    to-host: h07

Another use case when this could be useful is when new hosts are added to a site, since no automatic rebalancing of service instances between hosts are done at this point.

Maintenance modes overview​

Re-install a host on a site with a single host​

Re-create the site​

Re-create the host​

Re-install a host on a site with more than one host​

Re-create the host​

Replace hosts in a site​

Site Preconditions​

Configure and start the new hosts​

Set maintenance mode​

Set the old hosts to out-of-service​

Remove the old hosts​

Rescheduling of service instances​

Maintenance modes overview

Re-install a host on a site with a single host

Re-create the site

Re-create the host

Re-install a host on a site with more than one host

Re-create the host

Replace hosts in a site

Site Preconditions

Configure and start the new hosts

Set maintenance mode

Set the old hosts to out-of-service

Remove the old hosts

Rescheduling of service instances