Skip to main content

Host and application metrics

The Avassa system collects metrics related to

  1. the hosts on the site
  2. edge applications on each site

If you are a site provider, you should use #1 to observe the edge infrastructure proactively. If you are an application owner you should use #2 to monitor the health of your applications. The Avassa built-in pub/sub bus Volga collects the metrics. A number of dedicated topics exist per category above. The metrics are collected and kept on the site. So you need to perform the query towards the specific site in order to subscribe to the metrics. Also note that different tenants have access to different topics. An application owner can for example not read host metrics, one application owner can not read metrics from another application owner.

You can read our dedicated guide on how-to monitor the hosts and applications. The purpose of this section is to explain the metrics in detail.

The table below summarizes the metrics and they are further elaborated in the following subsections

Metrics
Host Metrics
Memory metricsTotal, free and available memory on the host
Load metricsLoad average latest 1, 5 and 15 minutes
Disk metricsDisk usage parameters for file systems used by either the Edge Enforcer (supd) or by any application managed by the Edge Enforcer.
Application metrics
Per containerPer individual container: memory, CPU usage, and container disk usage. Disk usage is only reported if quotas are enabled.
Per serviceEphemeral and persistent disk metrics per service instance. This is only reported if the underlying file system has quota support.
Per applicationMemory, CPU, and disk usage per host. Disk metrics are only reported if quotas are enabled

The sections below explains the above metrics in detail.

Host metrics

The host metrics provide metrics on the health of the hosts that Edge Enforcer is running on. Samples related to cpu, memory and disk are collected once every 30 seconds.

Host - memory metrics

Memory metrics taken from /proc/meminfo:

  • total: Total usable RAM memory in bytes.
  • free: Free usable RAM memory in bytes.
  • available: An estimate of how much RAM memory in bytes is available for starting new applications, without swapping.

Host - load metrics

Average load metrics collected from /proc/loadavg. Metrics are available for last 1 minute, 5 and 15 minutes.

Load average indicates whether the system resources (CPU and IO) are adequately available for the processes (system load) that are running, runnable or in uninterruptible sleep states. This measures demand, which can be greater than what the system is currently processing.

A runnable process is a process that is waiting for the CPU. A process is in an uninterruptible sleep state when it is waiting for a resource and cannot be interrupted and will return from sleep only when the resource becomes available or a timeout occurs. For example, a process may be waiting for disk or network I/O. Runnable processes indicate that we are short of CPUs. Similarly, processes in an uninterruptible sleep state indicate I/O bottlenecks. The load average is the exponential moving average of the load number during the previous n minutes.

Zero means there is no load. If you have a system with four CPU cores, a value of four would mean the system is fully loaded, a value of eight would mean the system is overloaded. The rule of thumb is if the load average is consistently higher than the number of core/threads available to the OS, the server is overloaded.

Some rules of thumb below:

  • If the averages are 0.0, then your system is idle.
  • If the 1-minute average is higher than the 5 or 15 minute averages, then load is increasing.
  • If the 1-minute average is lower than the 5 or 15 minute averages, then load is decreasing.
  • If they are higher than your CPU count, then you might have a performance problem.

Load average gives a better picture of the system's capability to meet processing requirements than CPU utilization. CPU utilization is the time the CPU is working expressed as a percentage of the total elapsed clock time. On process-intensive systems this could be 100%. Assuming a single CPU system, the load average may be any number greater than 1. In all cases, where the load average is greater than 1, CPU utilization could be around 100%. But the problem becomes more and more severe as the load average takes on higher number values (1, 2, 3, ...). But in all cases, the CPU utilization may be around 100%.

So when looking at the load average metrics, relate this to the number of CPUs available as indicated in the CPU metrics.

Example from Volga

"cpu": {
"vcpus": 2
...
{
"time": "2023-02-14T09:11:28.343Z",
"site": "gothenburg-bergakungen",
...
},
"loadavg": {
"total": 226,
"running": 1,
"avg5": 0.02,
"avg15": 0,
"avg1": 0.01
},

Host - disk metrics

Disk metrics report a set of usage parameters for file systems used by either the Edge Enforcer (supd) or by any application managed by the Edge Enforcer. All file systems listed are from inside the Edge Enforcer container. Metrics with mount points reported as CONTAINER-ROOT indicates that it reports the Edge Enforcer containers root file system usage, not the hosts root file system.

We have the following kinds of disc metrics

  • Edge Enforcer writes (supd state dir): config files and secrets as files
  • Container root filesystem
  • Mounted ephemeral and persistent volumes

For each of these the Edge Enforcer will collect:

  • size: Total number of 1K-blocks where K is 1024 bytes.
  • used: Number of used 1K-blocks where K is 1024 bytes.
  • free: Number of available (free) 1K-blocks where K is 1024 bytes.
  • percentage-used : a percent value with up to 2 fractional digits calculated as used divided by size.

For example, looking at df on the host:

df
Filesystem     1K-blocks    Used Available Use% Mounted on
tmpfs 400464 2240 398224 1% /run
**/dev/sda1 4903276 3147704 1739188 65% /**
tmpfs 2002320 0 2002320 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
/dev/sda15 99800 5184 94616 6% /boot/efi
tmpfs 2002320 0 2002320 0% /var/lib/supd/state/secrets
tmpfs 2002320 0 2002320 0% /var/lib/supd/state/dhclient/_pid
tmpfs 400464 4 400460 1% /run/user/1000

And corresponding metrics:

  "disk": [
{
"used": 3147840,
"type": "ext4",
"size": 4903276,
"percentage-used": 65,
"mount": "/avassa-data/supd",
"free": 1739052,
"filesystem": "/dev/sda1"
},
{
"used": 3147840,
"type": "overlay",
"size": 4903276,
"percentage-used": 65,
"mount": "CONTAINER-ROOT",
"free": 1739052,
"filesystem": "overlay"
}

The host metrics also provides details on disk io such as reads, writes, time-spent-reading, writes-completed etc, details on CPUs and temperatures on thermal zones, if available by the linux kernel.

Application metrics

Application metrics are aggregated per container, service, and application. Metrics are related to CPU, memory, disk and network traffic and the metrics are collected once every 10 seconds for and aggregated every 30 seconds.

Application - per container metrics

A rich set of metrics are available per container

  • Memory: memory used and available for the container as well as percentage.
  • CPU
    • nanoseconds: The total CPU usage of all tasks in the container.
    • cpus: The CPUs limit for the container.
    • shares: The CPU shares limit for the container. CPU shares is a relative priority between containers. If we have three containers with share 1024, 512, 512, then each container will get 0.5 CPU, 0.25 CPU, 0.25 CPU in a single CPU system
    • percentage used: Percentage of CPU used in relation to limits. Calculated based on nanoseconds(t2) - nanoseconds(t1)/delta(nanoseconds)*CPUs. For example, if we have a limit of 2 CPUs and we get 30 seconds total usage over a 30 seconds interval = 50%
  • Container layer: This will report size, used, free and percentage used for the container layer disk. Note well: this will only be reported if the underlying file system has quota support. If not, the metric will not be reported.

An example Volga message is shown below

"per-container": {
"service-instance": "fifth-srv-1",
"container": "alpine",
"memory": {
"used": 348160,
"total": 2087116800,
"percentage-used": 1
},
"cpu": {
"nanoseconds": 200690282,
"cpus": 1.0,
"shares": 1024,
"percentage-used": 1
},
"container-layer": {
"size": 4183040,
"used": 583868,
"free": 3599172,
"percentage-used": 14
}
}

Application - per service metrics

Ephemeral and persistent disk metrics are aggregated per service. The same precondition holds as above: this is only reported if the underlying file system has quota support.

Here you will see disc metrics like this:

{
"time": "2023-02-15T12:10:37.415Z",
"host": "gbg-1",
"application": "test-application",
"per-service": {
"service-instance": "svc-1",
"ephemeral-volumes": [
{
"size": 4884,
"used": 20,
"free": 4864,
"percentage-used": 1,
"volume-name": "cfg"
}
],
"persistent-volumes": []
}
},

The above says the ephemeral volume name cfg has ~5Mb of storage, and 1% is used.

Application - per application metrics

We divide application metrics related to the gateway network and the host.

Host metrics:

  • memory-percentage-used: Percentage of aggregated memory used by the application on this host in relation to total available memory.
  • cpu-percentage-used: Percentage of maximum CPU used among all containers for the application on this host in relation to total available CPUs.
  • disk-percentage-used: Percentage of maximum disk used among all container layer, ephemeral and persistent volumes for the application on this host. This metric is only reported if the underlying file system has quota support

An example of application host metrics is shown below:

{
"time": "2023-02-15T12:10:57.291Z",
"host": "gbg-1",
"application": "test-application",
"per-application": {
"hosts": [
{
"host": "gbg-1",
"memory-percentage-used": 11,
"cpu-percentage-used": 25,
"disk-percentage-used": 3
}
]
}
},

Network metrics:

This metric will show the transmitted and received traffic on the gateway network, that is, traffic originated outside the host or leaving the host. Two relative measures are most relevant:

  • TX and RX bytes per second: Intensity of transmitted/received external traffic in bytes per second. This value is the average intensity over the interval between the last two reported samples.
  • Upstream and downstream bandwidth utilization: If upstream-bandwidth-per-host / downstream-bandwidth-per-hostlimit is configured for this application, then this value indicates the fraction of the available bandwidth used by the application. This value is based on the tx-bytes-per-second/rx-bytes-per-second metric.

An example payload is shown below:

{
"time": "2023-02-15T12:10:57.419Z",
"host": "gbg-1",
"application": "test-application",
"per-application": {
"gateway-network": {
"tx-packets": 56577,
"tx-bytes": 5479085,
"rx-packets": 59612,
"rx-bytes": 80352670,
"tx-packets-per-second": 5,
"tx-bytes-per-second": 562,
"rx-packets-per-second": 6,
"rx-bytes-per-second": 8095
}
}
},