Site maintenance
This document describes how to re-install and replace hosts in various scenarios.
For operating system upgrades, see OS Upgrade.
See Add your first site for instructions on how to add a site to the system.
See Local unseal for instructions on how to unseal an isolated site. Unseal is how an existing site retrieves it's decryption keys after being restarted. This is not to be confused with site unwrap (where a site gets its secret keys at time of installation), mentioned below.
During development, a host may need to be re-installed in order to test installation scripts, perform OS upgrades, change disks etc. This can often be done without any concerns for handling running applications and keeping configuration.
During production, host re-installation is probably quite rare, but when it happens, care has to be taken to properly handle running applications.
When a host is re-installed, it is assumed that the Edge Enforcer installation script is executed again.
See Sites for details about how hosts call home etc.
Maintenance modes overview
Each host in a site has a maintenance-mode
setting which can be set
to one of three values:
-
off
The default, the host can be used normally. -
blocked
No new service instances will be scheduled to this host, but existing service instances are not rescheduled. The host will also not be used to handle any new volga topics, but old topics are not moved away from the host. When draining services from multiple hosts, it is good practice to first place them all in this mode to avoid having any services relocate to these hosts. -
out-of-service
Like blocked, but service instances scheduled to this host will be rescheduled to other hosts and volga topics handled by this host will be moved to other hosts. Useful when a host is to be permanently removed, or taken offline for a substantial amount of time.
In addition, there is the drain
mode which can only be set by calling
the drain-host
action to reschedule service-instances to other hosts.
This is useful when a host needs to be rebooted. The drain
mode is ephemeral
and once the host is restarted, its maintenance-mode
will revert back to
its previous value.
Re-install a host on a site with a single host
Suppose you have created a site with a single host, and installed and started the host. Then you stop the host and re-install it. The Edge Enforcer on the host will perform its call-home procedure. However, this will fail, since the call-home server thinks that the host is already a member of an initialized site.
When the Edge Enforcer doesn't start, check the systemd status:
systemctl status supd
and logs
journalctl -xu supd
In this case, you might see messages like:
<NOTICE> 2022-12-13 14:03:00.261629Z UNKNOWN_CHOSTNAME boot_callhome:200 <0.9.0>
Host id 38cfff0d-cce6-4f71-99dd-98612719cd79 calling home to https://192.168.100.101/v1/host-init resulted in reply code 403 (forbidden).
Check the volga topic system:logs in Control Tower and verify that this host identifier is configured on a site, retrying
You can check the volga topic system:logs
in the Control Tower:
supctl do volga topics system:logs consume --payload-only
And you should see messages like:
<INFO> 2022-12-13 13:58:39.551647Z topdc-001: host-init ignoring duplicate host id 38cfff0d-cce6-4f71-99dd-98612719cd79 (already running in site sthlm)
This issue can be resolved in two different ways:
Re-create the site
One solution is to delete the site in the control tower, and then re-create it.
supctl show --config system sites sthlm > /tmp/sthlm
supctl delete system sites sthlm
supctl create system sites < /tmp/sthlm
When this is done, the host, which is still calling home, will get a successful reply from the Control Tower.
Re-create the host
Another solution is to delete the host from the site in the control tower, and then re-create it. In this case you can keep the rest of the site config intact, but you will have to allow the site to unwrap again, see below.
supctl show --config system sites sthlm
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
hosts:
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
EOF
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
hosts:
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
EOF
At this point, the host encounters a new problem:
journalctl -xu supd
Failed to unwrap site bundle
This is because the the site is only allowed to unwrap its site bundle once, and it was already unwrapped when the site was first initialized. The site bundle is the site's unique secrets needed to establish the site and to communicate with the Control Tower.
To fix this, we can manually allow the site to unwrap the site bundle again:
supctl do system sites sthlm reallow-site-unwrap
When this is done, the host, which is still calling home, will get a successful reply from the Control Tower.
Re-install a host on a site with more than one host
Suppose you have created a site with three hosts, and installed and started the hosts. Then you stop one of the hosts and re-install it. The Edge Enforcer on the host will perform its call-home procedure. However, this will fail, since the call-home server thinks that the host is already a member of an initialized site.
When the Edge Enforcer doesn't start, check the systemd status:
systemctl status supd
and logs
journalctl -xu supd
In this case, you might see messages like:
<NOTICE> 2022-12-13 14:03:00.261629Z UNKNOWN_CHOSTNAME boot_callhome:200 <0.9.0>
Host id 38cfff0d-cce6-4f71-99dd-98612719cd79 calling home to https://192.168.100.101/v1/host-init resulted in reply code 403 (forbidden).
Check system:log in Control Tower and verify that this host identifier is configured on a site, retrying
You can also check the system:logs
volga topic in the Control Tower:
supctl do volga topics system:logs consume --payload-only
And you should see messages like:
<INFO> 2022-12-13 13:58:39.551647Z topdc-001: host-init ignoring duplicate host id 38cfff0d-cce6-4f71-99dd-98612719cd79 (already running in site sthlm)
This issue can be resolved by recreating the host:
Re-create the host
Delete the host from the site in the control tower, and then re-create it.
supctl show --config system sites sthlm
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
hosts:
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
hosts:
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
EOF
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
hosts:
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
EOF
Replace hosts in a site
Suppose you have a site with three hosts and running applications, and these hosts need to be replaced by new updated hardware. While it is possible to just remove the hosts from the site configuration, this may result in unnecessary application downtime and data loss for volga topics. The following steps describe how to bring the old hosts down one by one, while migrating applications and topic data to the new hosts.
NOTE: This procedure should be followed even if not all hosts are replaced.
Site Preconditions
Ensure that all hosts are up and running.
supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,hostname,oper-status
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
cluster-hostname: sthlm-001
hostname: sthlm-3
oper-status: up
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
cluster-hostname: sthlm-002
hostname: sthlm-1
oper-status: up
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
cluster-hostname: sthlm-003
hostname: sthlm-2
oper-status: up
Configure and start the new hosts
Add the host-ids of the new hosts to the config:
supctl merge system sites sthlm << EOF
hosts:
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
EOF
Then start the new hosts. Verify that all hosts are up and running (NOTE the new hosts will not show up until they have attached to the site):
supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,hostname,oper-status,controller
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
cluster-hostname: sthlm-001
hostname: sthlm-3
oper-status: up
controller: true
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
cluster-hostname: sthlm-002
hostname: sthlm-1
oper-status: up
controller: true
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
cluster-hostname: sthlm-003
hostname: sthlm-2
oper-status: up
controller: true
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
cluster-hostname: sthlm-004
hostname: sthlm-4
oper-status: up
controller: false
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
cluster-hostname: sthlm-005
hostname: sthlm-6
oper-status: up
controller: false
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
cluster-hostname: sthlm-006
hostname: sthlm-5
oper-status: up
controller: false
Set maintenance mode
Set the maintenance-mode
for all hosts that are to be removed to blocked
.
This ensures that no new service instances are scheduled to the old hosts, and
no new volga topics will be assigned to them.
Do not set the hosts to out-of-service
in this step, that could potentially lead to
data loss.
supctl merge system sites sthlm << EOF
hosts:
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
maintenance-mode: blocked
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
maintenance-mode: blocked
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
maintenance-mode: blocked
EOF
Verify that the hosts are blocked:
supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,maintenance-mode
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
cluster-hostname: sthlm-001
maintenance-mode: blocked
- host-id: b9933b6b-8861-42ac-bb07-2ffe674f49ae
cluster-hostname: sthlm-002
maintenance-mode: blocked
- host-id: 36c16c1d-fcb7-49fd-9785-ddd6502825b6
cluster-hostname: sthlm-003
maintenance-mode: blocked
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
cluster-hostname: sthlm-004
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
cluster-hostname: sthlm-005
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
cluster-hostname: sthlm-006
Set the old hosts to out-of-service
Set the maintenance-mode
for the old hosts to out-of-service
, one host at a
time:
supctl merge system sites sthlm << EOF
hosts:
- host-id: 38cfff0d-cce6-4f71-99dd-98612719cd79
maintenance-mode: out-of-service
EOF
This will move services and volga topic replicas from this host to the new
hosts (since the remaining old hosts were all set to blocked
in the previous
step).
You check the value of safe-to-remove
on the host that is taken out
of service. It will get the value true
when all service instances
and all replicated volga topics have been moved to other hosts:
supctl show --site sthlm system cluster hosts --fields hostname,safe-to-remove
Repeat this step for each host that is to be removed.
Remove the old hosts
Once all of the old hosts have been set to out-of-service
, they can be removed
from the site configuration. Note that it is important that the hosts are up
and running at this point.
supctl replace system sites sthlm << EOF
name: sthlm
type: edge
topology:
parent-site: control-tower
ingress-allocation-method: dhcp
management-ipv4-access-list:
- 192.168.100.1
hosts:
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
EOF
The new hosts will copy the shared state from the old hosts. The old hosts will detect that they have been removed from the site and stop themselves. Verify that the old hosts are no longer running and that the new hosts are controllers and up and running (this may take some short amount of time):
supctl show --site sthlm system cluster hosts --fields host-id,cluster-hostname,hostname,oper-status,controller
- host-id: 84716d5b-b9d2-4af2-87fb-4de58ba86fd6
cluster-hostname: sthlm-004
hostname: sthlm-4
oper-status: up
controller: true
- host-id: aa3dfa3e-9102-42f2-a631-c87e75f1c5c9
cluster-hostname: sthlm-005
hostname: sthlm-6
oper-status: up
controller: true
- host-id: b27ee070-00c2-44a1-a65e-0fecb19de073
cluster-hostname: sthlm-006
hostname: sthlm-5
oper-status: up
controller: true
At this point, the Edge Enforcer has wiped its local state and stopped, and the host VM or physical machine can be stopped.
Rescheduling of service instances
Some maintenance scenarios might affect the distribution of running service instances among the running hosts. Ie. not being well balanced between hosts anymore after the maintenance. One such scenario is when all hosts are drained, upgraded and restarted in sequence one by one.
When this is the case it is possible to invoke the reschedule
action to
re-balance the service instances among the running hosts within a site.
supctl do --site sthml system cluster reschedule
rescheduled-service-instances:
- name: telco.alpine.fifth-srv-4
from-host: h05
to-host: h07
- name: telco.alpine.fifth-srv-3
from-host: h04
to-host: h06
- name: telco.alpine.fifth-srv-2
from-host: h03
to-host: h07
- name: telco.alpine.second-srv-1
from-host: h06
to-host: h07
- name: telco.alpine.fifth-srv-1
from-host: h06
to-host: h07
Another use case when this could be useful is when new hosts are added to a site, since no automatic rebalancing of service instances between hosts are done at this point.