Host metrics
The Avassa system collects metrics related to
- the hosts on the site
- edge applications on each site
If you are a site provider, you should use #1 to observe the edge infrastructure proactively. If you are an application owner you should use #2 to monitor the health of your applications. The Avassa built-in pub/sub bus Volga collects the metrics. A number of dedicated topics exist per category above. The metrics are collected and kept on the site. So you need to perform the query towards the specific site in order to subscribe to the metrics. Also note that different tenants have access to different topics. An application owner can for example not read host metrics, one application owner can not read metrics from another application owner.
You can read our dedicated guide on how-to monitor the hosts and applications. The purpose of this section is to explain the metrics in detail.
The table below summarizes the metrics and they are further elaborated in the following subsections
| Metrics | ||
|---|---|---|
| Host Metrics | ||
| Memory metrics | Total, free and available memory on the host | |
| Load metrics | Load average latest 1, 5 and 15 minutes | |
| Disk metrics | Disk usage parameters for file systems used by either the Edge Enforcer (supd) or by any application managed by the Edge Enforcer. | |
| Application metrics | ||
| Per container | Per individual container: memory, CPU usage, and container disk usage. Disk usage is only reported if quotas are enabled. | |
| Per service | Ephemeral and persistent disk metrics per service instance. This is only reported if the underlying file system has quota support. | |
| Per application | Memory, CPU, and disk usage per host. Disk metrics are only reported if quotas are enabled |
The sections below explains the above metrics in detail.
Host metrics
The host metrics provide metrics on the health of the hosts that Edge Enforcer is running on. Samples related to cpu, memory and disk are collected once every 30 seconds.
Host - memory metrics
Memory metrics taken from /proc/meminfo:
total: Total usable RAM memory in bytes.free: Free usable RAM memory in bytes.available: An estimate of how much RAM memory in bytes is available for starting new applications, without swapping.
Host - load metrics
Average load metrics collected from /proc/loadavg. Metrics are available for last 1 minute, 5 and 15 minutes.
Load average indicates whether the system resources (CPU and IO) are adequately available for the processes (system load) that are running, runnable or in uninterruptible sleep states. This measures demand, which can be greater than what the system is currently processing.
A runnable process is a process that is waiting for the CPU. A process is in an uninterruptible sleep state when it is waiting for a resource and cannot be interrupted and will return from sleep only when the resource becomes available or a timeout occurs. For example, a process may be waiting for disk or network I/O. Runnable processes indicate that we are short of CPUs. Similarly, processes in an uninterruptible sleep state indicate I/O bottlenecks. The load average is the exponential moving average of the load number during the previous n minutes.
Zero means there is no load. If you have a system with four CPU cores, a value of four would mean the system is fully loaded, a value of eight would mean the system is overloaded. The rule of thumb is if the load average is consistently higher than the number of core/threads available to the OS, the server is overloaded.
Some rules of thumb below:
- If the averages are 0.0, then your system is idle.
- If the 1-minute average is higher than the 5 or 15 minute averages, then load is increasing.
- If the 1-minute average is lower than the 5 or 15 minute averages, then load is decreasing.
- If they are higher than your CPU count, then you might have a performance problem.
Load average gives a better picture of the system's capability to meet processing requirements than CPU utilization. CPU utilization is the time the CPU is working expressed as a percentage of the total elapsed clock time. On process-intensive systems this could be 100%. Assuming a single CPU system, the load average may be any number greater than 1. In all cases, where the load average is greater than 1, CPU utilization could be around 100%. But the problem becomes more and more severe as the load average takes on higher number values (1, 2, 3, ...). But in all cases, the CPU utilization may be around 100%.
So when looking at the load average metrics, relate this to the number of CPUs available as indicated in the CPU metrics.
Example from Volga
"cpu": {
"vcpus": 2
...
{
"time": "2023-02-14T09:11:28.343Z",
"site": "gothenburg-bergakungen",
...
},
"loadavg": {
"total": 226,
"running": 1,
"avg5": 0.02,
"avg15": 0,
"avg1": 0.01
},
Host - disk metrics
Disk metrics report a set of usage parameters for file systems used by either the Edge Enforcer (supd) or by any application managed by the Edge Enforcer. All file systems listed are from inside the Edge Enforcer
container. Metrics with mount points reported as CONTAINER-ROOT indicates that it reports the Edge Enforcer containers root file system usage, not the hosts root file system.
We have the following kinds of disc metrics
- Edge Enforcer writes (supd state dir): config files and secrets as files
- Container root filesystem
- Mounted ephemeral and persistent volumes
For each of these the Edge Enforcer will collect:
- size: Total number of 1K-blocks where K is 1024 bytes.
- used: Number of used 1K-blocks where K is 1024 bytes.
- free: Number of available (free) 1K-blocks where K is 1024 bytes.
- percentage-used : a percent value with up to 2 fractional digits calculated as used divided by size.
For example, looking at df on the host:
df
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 400464 2240 398224 1% /run
**/dev/sda1 4903276 3147704 1739188 65% /**
tmpfs 2002320 0 2002320 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
/dev/sda15 99800 5184 94616 6% /boot/efi
tmpfs 2002320 0 2002320 0% /var/lib/supd/state/secrets
tmpfs 2002320 0 2002320 0% /var/lib/supd/state/dhclient/_pid
tmpfs 400464 4 400460 1% /run/user/1000
And corresponding metrics:
"disk": [
{
"used": 3147840,
"type": "ext4",
"size": 4903276,
"percentage-used": 65,
"mount": "/avassa-data/supd",
"free": 1739052,
"filesystem": "/dev/sda1"
},
{
"used": 3147840,
"type": "overlay",
"size": 4903276,
"percentage-used": 65,
"mount": "CONTAINER-ROOT",
"free": 1739052,
"filesystem": "overlay"
}
The host metrics also provides details on disk io such as reads, writes, time-spent-reading, writes-completed etc, details on CPUs and temperatures on thermal zones, if available by the linux kernel.