Skip to main content

How to monitor edge infrastructure and applications

The purpose of this document is to answer the question:

How do I monitor my edge infrastructure and applications?

We describe two monitoring perspectives:

  • Site provider: you own and manage the infrastructure, edge sites and hosts
  • Application owner: you manage the life-cycle of the applications

When you read this instruction you might want to dig into more details on the various metrics.

Basics

There are three primary sources for your monitoring activities:

  • React on alerts, special topics in Volga are alerts that all represents undesirable states that should require action
  • Subscribe to telemetry data available as topics on the built-in pub/sub bus Volga.
  • Inspect the state of objects

It is essential to understand that different states and topics live in the central Control Tower and the sites respectively. On the sites, you have the site's most real-time, up-to-date state and site local telemetry. In general, most telemetry stays on the site and is not automatically pushed to Control Tower. Some coarse-grained state is propagated to Control Tower.

Note well also that sites might be disconnected. Control Tower has the latest known state for the sites.

Volga topics live at the sites and Control Tower respectively: some topics

  • …exist at both but with different content; for example system:logs
  • …are exclusive to Control Tower: example system:events
  • …are exclusive to the site: example system:host-metrics and the container logs
  • …are propagated to the Control Tower: example system:site-alerts

To learn this:

Topics in Control Tower:

supctl list volga topics
- system:alerts
- system:all-scheduler-events
...

Topics on a site:

supctl list **--site amsterdam** volga topics
- system:alerts
- system:all-scheduler-events
...

The same pattern holds for inspecting state. To get the details on an application you would have to read from the site, whereas you have a more coarse-grained latest state always available in Control Tower:

supctl show applications popcorn
name: popcorn
version: "2.0"
services:
- name: popcorn-service
mode: replicated
replicas: 1
containers:
...

While on the site you will see detailed run-time state:

supctl show --site amsterdam applications popcorn
name: popcorn
version: "2.0"
services:
- name: popcorn-service
...as above
...
**service-instances:
- name: popcorn-service-1
application-version: "2.0"
oper-status: running
ready: true
host: amsterdam-001
application-network:
ips:
- 172.29.0.1/16
dns-records:
- popcorn-service-1.popcorn.internal. 15 IN A 172.29.0.1
- popcorn-service.popcorn.internal. 15 IN A 172.29.0.1
gateway-network:
ips:
- 172.28.255.2/24
ingress:
ips: []
containers:
- name: popcontainer
id: 54adc135bd9f
oper-status: running
ready: true
start-time: 2023-02-01T11:08:37.650Z
current-restarts: 0
total-restarts: 0
probes:
startup:
status: success
readiness:
status: success
liveness:
status: success**

As you see in the above command, you have runtime information about the service instances on the site.

Please refer to the Volga reference guide for more information on each topic. To read more on the attributes of the objects, see our REST API reference guide. Also note that topics are tenant-specific. For example, an application owner does not see topics related to the infrastructure. The purpose of this document is to give overall guidance on which topics and states are helpful for which purpose.

To perform monitoring activities, the Avassa Control Tower contains the functions you need for manual work. All functions described here are also available over APIs so that you can integrate and automate the activities in your overall monitoring solution.

When it comes to state, Avassa interprets the state using contextual information. How severe is the state of an object? That depends on context. For example, if a container fails within the only service instance of an application the application will fail as well and raise an application alert. However if the application runs two service replicas and the container fails only in one of them, the application will not fail. See illustrations below:

Single replica fails:

Single replica failing

2 replicas, one failing:

1 of 2 replicas failing

Alert topics send discrete events for alert raise and alert clear. The latest alert raise are available in the clock ikon:

Alerts

You can use the “View all” link or the “Alerts” menu item in the left menu bar to see and search all alerts.

All alerts

Note: the Avassa system does not currently correlate clear and raise, these are shown as individual events in time order. Furthermore, when the Edge Enforcer restarts it sends clear alerts for all states.

Infrastructure monitoring

If you are a site provider, you need to monitor the health of the infrastructure, which corresponds to sites and hosts in Avassa.

We have the notion of hosts in distress, meaning that we consider hosts unhealthy. A host can enter this state for several reasons, such as disc space issues. By providing built-in heuristics for these states we simplify the boot-strapping of your monitoring solution. If any hosts on a site is in distress the site will be marked red in the Control Tower UI as an indication that you should drill down. Alerts will also be generated so that you will be notified even if you are not focusing on the “right place” in the UI.

There might sometimes be a blame-game between infrastructure teams and application teams. Therefore we provide linkage from the infrastructure to the running applications (service impact) and vice versa (root cause).

As an infrastructure owner you can see which applications and tenants exist on a site. In the Web UI to the left for each site you will see: how many hosts, (number within the circle)? any hosts in distress (red circle), an (A) indicates running applications, and a (T) indicates tenants on the site. This will guide your operational priorities, an issue on a site without running applications is less severe than a site with tenants and/or applications. If you select a site you are able to see which applications and tenants exist on the site. The same information is available over APIs. Detail further down in this document.

Untitled

Control Tower topics

The easiest way to get examples from your live system is to use the topic search in the Control Tower. Click the topics icon to the left:

Untitled

Perform a search there and learn the payload from your system.

You can also use supctl, for example:


supctl do volga topics system:alerts consume

See the tutorial on supctl and Volga tutorial for more examples.

The table below lists and summarizes relevant topics for infrastructure monitoring.

TopicDescription
system:alertsThese are alerts related to applications like issues with ephemeral storage or operational state on the application.
system:site-alertsAlerts related to site infrastructure, such as disc usage on a host. It would be best if you reacted on any alerts published here.
system:eventsNotifications regarding hosts connecting/disconnecting. A site disconnecting might not be critical as such, but essential to be aware of it. No events will be pushed to the Control Tower during the disconnected state. You might want to trouble-shoot why the site is not connected.
system:logsIt is a good practice to monitor for EMERGENCY , ALERT, CRITICAL ERROR, and WARNING. This will mostly be related to the Avassa platform in the Control Tower. In combination with the site-alerts above, log entries with high severity in this topic are good candidates for integration into an alarm system.
system:config-eventsThis contains events about objects being created, updated, and deleted. So, for example, if you want to monitor sites being created, you can monitor this topic.
system:audit-trail-logThis is a full audit trail log of operations performed over APIs, supctl, and the Web UI. This is useful for integration into an overall audit trail system. This topic is a union of all sites’ audit trail logs. This to have audit information even if the site is disconnected.
system:unauthenticated-audit-trail-logSome operations are not authenticated such as the login, these events are posted here.

An overall recommendation would be as follows:

  • Integrate system:alerts, system:site-alerts and system:logs (subset of severity levels) to an alerting system
  • Define your procedure for treating disconnected sites and integrate connection events accordingly.
  • The system:config-events can be integrated to your inventory/CMDB system to keep an updated view on your edge artifacts.
  • Integrate the audit trail log topics to your overall audit trail system

Site topics

The easiest way to get examples from your live system is to use the topic search function on a site. Click sites, pick a site, and go to the topics tab:

Untitled

Perform a search there and learn the payload from your system.

You can also use subctl:

supctl do --site gothenburg-bergakungen volga topics system:host-metrics consume
TopicDescription
system:site-alertsThis is the basis for your infrastructure monitoring per site. These are as well propagated to the Control Tower
system:host-metricsThis is a highly relevant topic for monitoring your sites. Every 30 seconds various metrics for the host regarding for example disc, memory, load etc is published
system:logsIt is a good practice to monitor for EMERGENCY , ALERT, CRITICAL,ERROR and WARNING. Entries might refer to the infrastructure as well as applications depending on your tenant.

A simple recommendation is to integrate system:logs (a subset if severity levels) and system:site-alerts into an overall alerting system. The system:host-metrics should be integrated to a time-series analytics solution with thresholding and anomaly detection.

State, drill down

Many state changes for objects will result in events on the above mentioned topics. But both as a drill-down action of an event and as an active pull, we list a number of read requests that makes sense from a monitoring perspective.

State in Control Tower, latest reported from the site. (If you run the commands with option —verbose you will see the path that can be used in a REST call).

supctl show system site-status sites

This will list all sites and a summary of their status

name: amsterdam
type: edge
descriptive-name: Amsterdam
hosts: 1
hosts-called-home: 1
hosts-in-distress: 0
cluster-established: true
connection-state:
connected: true
last-connect: 2023-02-02T12:58:21.520Z
last-disconnect: 2023-02-01T11:41:42.770Z
location:
latitude: 52.37713501238853
longitude: 4.895028823252919
tenants: 0
deployed-applications: true

name: avassahq
...
supctl show system sites amsterdam

This will give you details about the sites and hosts on the site such as connection information, host volumes, host interfaces, Edge Enforcer (container supd) version etc.

name: bangalore
descriptive-name: Bangalore
type: edge
location:
latitude: 13.2678728306991
longitude: 77.53790818278752
creation-time: 2023-02-01T09:11:27.122Z
domain: bangalore.wallan.edge.avassa.dev
labels:
system/type: edge
system/name: bangalore
size: large
region: Asia
country: India
topology:
parent-site: control-tower
ingress-allocation-method: pool
connection-state:
connected: true
last-connect: 2023-02-02T15:09:30.546Z
last-disconnect: 2023-02-01T11:41:34.722Z
call-home-state:
all-hosts:
called-home: 1
total: 1
controller-hosts:
called-home: 1
total: 1
cluster-established: true
hosts:
- host-id: c3f06066-e095-641f-027e-2214a28c1f2c
controller: true
local-volumes:
- name: volume1
path: /xfs
network-interfaces:
- name: default
host-interface-by-default-route: true
ingress-ipv4-address-ranges:
- range: 172.31.21.108
network-prefix-length: 20
- range: 172.31.22.92
network-prefix-length: 20
cluster-hostname: bangalore-001
hostname: hugely-assuring-quail
supd-version: 23.2.0-27467d86
last-call-home:
time: 2023-02-01T11:41:45Z
reason: startup
smbios:
board-asset-tag: i-0f373feda10d171a3
board-vendor: Amazon EC2
chassis-asset-tag: Amazon EC2
chassis-type: "1"
chassis-vendor: Amazon EC2
product-name: t3.small
product-serial: ec23a594-2bf7-135d-8110-adcaf1243f30
product-uuid: ec23a594-2bf7-135d-8110-adcaf1243f30
platform:
architecture: x86_64
total-memory: 1994064 KiB
vcpus: 2
operating-system: Debian GNU/Linux 11 (bullseye)
kernel-version: 5.10.0-21-cloud-amd64
docker:
version: 20.10.23
api-version: "1.41"
os: linux
arch: amd64
git-commit: 6051f14
components:
- name: Engine
version: 20.10.23
- name: containerd
version: 1.6.16
- name: runc
version: 1.1.4
- name: docker-init
version: 0.19.0
supctl show system sites bangalore allocations

This will give you a list of tenants and their respective applications on the site.

tenants:
- name: tenant1
applications:
- name: popcorn

To get a summary of applications that you have running on a site and respective status. More on application monitoring in next subsection. This is to show that if you are responsible for the infrastructure you can easily get information on which applications are running on a site.

supctl --tabular show --site amsterdam applications --fields name,version,oper-status
name version oper-status
------- ------- -----------
popcorn 2.0 running

If you are responsible for the hosts it is also important to see which services are running on a host, if any. For example imagine you would like to upgrade the OS and wait until no services are running there you can do:

supctl show --site amsterdam system cluster service-instances
- name: edge.popcorn.popcorn-service-1
tenant: edge
application: popcorn
application-version: "2.0"
oper-status: running
....

Other relevant infrastructure monitoring states are related to the call home process.

supctl show system call-home <TAB>
duplicates
unclaimed

This would list duplicate host ids calling home, this might just be a human error but might also a security issue, someone trying to reuse a host id. The unclaimed hosts list will show hosts calling home without any matching host id in a site.

To get even more details on a host you can read the host state from the site:

supctl show --site amsterdam system cluster hosts
- cluster-hostname: amsterdam-001
version: 23.2.0-27467d86
oper-status: up
maintenance-mode: off
hostname: reasonably-famous-kitten
host-id: dec622a5-376b-769e-1481-428503b3b06d
controller: true
labels:
system/controller: "true"
device-labels: {}
devices: []
local-volumes:
- name: volume1
status: up
filesystem:
type: xfs
size-limit-support: true
is-mountpoint: true
space:
total: 1014 MiB
allocated: 0 B
free: 1014 MiB
host-network:
ip-addresses:
- 172.31.22.155
- fe80::412:f0ff:fe77:5804
site-overlay-network:
status: up
ip-addresses:
- 172.30.0.1
network-interfaces:
- name: default
os-interface-name: ens5
ingress-ipv4-address-allocation:
summary:
unallocated: 2
allocated: 0
unallocated:
ranges:
- 172.31.20.181
- 172.31.26.133
active-pool: interface-specific

Apart from the state related to the host such as volumes and networking it is worth-while pointing out to states:

  • oper-status: from an Avassa Edge perspective do we consider the host up or in distress
  • maintenance-mode: there are maintenance actions like blocking and draining a host in order to perform maintenance such as OS upgrades. off indicates normal operation, other states can be blocked or drain.

Avassa is capable of discovering udev devices, for example cameras, and use that for application scheduling. You can read the devices from a host by issuing:

supctl do --site amsterdam system cluster hosts amsterdam-001 get-udevs

Application monitoring

We strongly recommend that you define application probes for your containers. This will give you a much better insight into the health of your applications rather than just inspecting the container state. Many issues are related to response times, not total failures, that is where application probes will help you. Not just does it give an application level monitoring it also provides that per sight. A certain site might suffer from application response times whereas others not.

In order to inspect the application health at a site in the UI, pick the application and thereafter the specific site:

View application state

Note well the context state propagation, a failure of a container might propagate an error to the service - to the application running on the site and all the way to the application in the application list to the left. Say you have the popcorn application running at 21 sites, (as above), and it would fail on two sites, the popcorn application would be red and the number of failed applications indicated. You can drill down all the way to the failing container like in the screen shot below;

Failing application

At this stage you would probably do a number of actions in the Avassa system

  • inspect the container logs: issue within the container logic?
  • inspect the system:logs topic: issue outside the container logic?

More on that below.

Control Tower topics

TopicDescription
system:alertsThis is a fundamental topic to subscribe to. It is also what is included in the Alert UI shown previously. Whenever an application changes state from running, an alert is generated.
system:logsAgain, a fundamental topic of monitoring for errors. There will be some application-related events here. But primarily useful on the sites.
system:deployment-eventsThis will publish events regarding the progress of ongoing deployments. Fine-grained information about what is happening on each site can be fetched from the site using system:scheduler events on each site
system:config-eventsThis contains events about objects being created, updated, and deleted. So for example, if you want to monitor applications being created or modified you can monitor this

The system:alerts and system:logs (only the highest severity levels) should be integrated to your alerting system. If you want an external system to monitor the progress of deployments this can be achieved by integrating the system:deployment-events. Finally to keep an up to date inventory of your application versions you can inspect the system:config-events.

Site topics

TopicDescription
system:application-metricsThis will be your favorite topic for monitoring your applications. It constantly publishes metrics on your applications resource usage such as memory usage per container. It also collects network statistics for your applications.
system:logsAgain, this is a fundamental topic to monitor for errors. It is a mix of application and infrastructure events. You will for example see errors from configured probes here. When trouble shooting application life-cycles drill down here.
system:scheduler-eventsDetailed information on the progress of the scheduler, such as starting applications.
system:container-logs.container-idAll your containers’ output on stderr/stdout will be available on this topic. The container logs are kept local to each site but you can query across sites by using the query API.

The system:application metrics should be integrated to your time series solution for thresholding and anomaly detection. As stated previously select the highest severity levels from system:logs and integrate to your alerting system.

In order to have a detailed overview of the status of ongoing deployments you might want to combine Control Tower deployment events with the scheduler events from each site.

The container logs needs a deep dive elaboration. First of all, Avassa keeps the container logs in Volga at each site. They are not propagated to Control Tower. This for several reasons: traffic load, all content is probably not relevant in the central solutions etc. However note well that Avassa provides a powerful query API that enables you to perform queries from Control Tower across sites and across topics. You can view this as a distributed query engine for your container logs. So there are a number of alternatives how you would like to treat your container logs from a monitoring perspective. It all depends on your use cases and content/format of the logs.

  • Perform scheduled distributed queries
  • Export the container logs, filtered, enriched, from the edge site to a central solution, for example cloud watch. For this purpose we provide a Fluent Bit plugin which you can run at each site (deployed by Avassa), that will perform the required local processing and thereafter forward to your central log analytics tool
  • Ad hoc searches in trouble shooting using the Avassa UI and/or command line

An example how to perform a topic search across sites is shown below: it will search all sites with label city for ERROR in the container logs for a certain container.

supctl do volga query-topics --topics topic-names=system:container-logs.<NAME> filter-re-match=ERROR output-payload-only=true --match-site-labels city

State, drill-down

In order to inspect the application state on a site you can perform the following supctl command. (This is what the User Interface shows in the screen shot above)

supctl show --site gothenburg-bergakungen applications pop

Note well, looking at the application state is often done in combination with inspecting the container logs and system:logs in order to drill down and analyse situations

You can also perform actions to try to fix the issue, for example restart a service:

UnRestart service

Or using supctl


supctl do --site amsterdam applications popcorn service-instances popcorn-service-1 restart

You can also perform exec interactive action to run commands in the container:

supctl do --site amsterdam applications popcorn service-instances popcorn-service-1 containers popcontainer exec-interactive sh
/ # ps
PID USER TIME COMMAND
1 root 0:03 /sbin/docker-init -- /bin/sh -c $EXECUTABLE
7 root 0:01 kettle-popper-manager
24 root 0:00 sh
30 root 0:00 ps
/ #

And the same in the User Interface, select the container and use the … menu

Open exec terminal

Exec interactive