Skip to main content

Volga Topics

The system offers a set of built-in volga topics that applications can consume.

Events

An item on an event topic is an event record. It's a JSON object with the following structure:

Definition of events sent on volga streams to tenants.

NameTypeDescription
eventstringThe name of the specific event.
occurred-atdate-and-timeTime when the event occurred.
idstringUnique identifier of the event instance.
tenantstringThe event occurred in the context of this tenant.
dataJSON ObjectEvent-specific data.

Alerts

An item on an alert topic is an alert record. It's a JSON object with the following structure:

Definition of alerts sent on volga streams to tenants.

NameTypeDescription
alertstringThe name of the specific alert.
timedate-and-timeThe time the alert was generated.
idstringThe site unique id of the alert.
sitenameThe name of the site where the alert was generated.
severityenumeration
  • warning
  • minor
  • major
  • critical
The severity of the alert.

  • warning
    The 'warning' severity level indicates the detection of a
    potential or impending service-affecting fault, before any
    significant effects have been felt. Action should be
    taken to further diagnose (if necessary) and correct the
    problem in order to prevent it from becoming a more
    serious service-affecting fault.
  • minor
    The 'minor' severity level indicates the existence of a
    non-service-affecting fault condition and that corrective
    action should be taken in order to prevent a more serious
    (for example, service-affecting) fault. Such a severity
    can be reported, for example, when the detected alarm
    condition is not currently degrading the capacity of the
    resource.
  • major
    The 'major' severity level indicates that a service-
    affecting condition has developed and an urgent corrective
    action is required. Such a severity can be reported, for
    example, when there is a severe degradation in the
    capability of the resource and its full capability must be
    restored.
  • critical
    The 'critical' severity level indicates that a service-
    affecting condition has occurred and an immediate
    corrective action is required. Such a severity can be
    reported, for example, when a resource becomes totally out
    of service and its capability must be restored.
descriptionstringThe description of the reason behind the alert.
expiry-timedate-and-timeAfter this date and time this alert should be considered expired.
clearedbooleanIs true if the alert has been cleared, otherwise false.
dataJSON ObjectAlert-specific data.

Threshold alerts

For threshold alerts like: container-layer-threshold-reached, ephemeral-volume-threshold-reached, persistent-volume-threshold-reached and disk-threshold-reached the levels are:

  • CRITICAL: 100%
  • MAJOR: 90%
  • WARNING: 80%

Tenant specific events

An item on a tenant specific events topic is an event record. It's a JSON object described in the Events section.

Topic: system:scheduler-events

Available on every site.

This topic contains scheduler related events on the local site.

Event: application-status-changed

This event is generated when the oper-status of an application changes. The oper-status of an application is an aggregated status of all service instances in the application.

In some cases, this event may be generated even if the oper-status hasn't been changed. Clients need to be prepared for that.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
oper-statusenumeration
  • waiting-for-images
  • starting
  • running
  • upgrading
  • error
The oper-status of the application.

  • waiting-for-images
    Waiting for images to be automatically downloaded.
  • starting
    The application's service instances are being started.

    This is a combined status to make it easier to find
    problematic applications. For detailed information
    see the service-instances list.
  • running
    All the application's service instances have oper-status
    running.
  • upgrading
    The application is being upgraded.
  • error
    Some of the application's service instances are in
    an error state, or no service instances could be created.

    This is a combined status to make it easier to find
    problematic applications. For detailed information see
    the service-instances list.
application-versionstringThe version of the application.
application-deploymentstringThe name of the application deployment to which the application
belongs.
sitenameThe name of the site where the application is running.

Event: container-completed

This event is generated when an init container has completed successfully.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-failed

This event is generated when the system detects that a container has failed; either because it exited, or because its startup or liveness probe failed.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
reasonenumeration
  • exited
  • startup-probe
  • liveness-probe
  • exited
    The container process exited.
  • startup-probe
    The startup probe failed.
  • liveness-probe
    The liveness probe failed.
restart-afterdurationA duration in years, days, hours, minutes and seconds.

Format is [<digits>y][<digits>d][<digits>m][<digits>s].

Examples: 1y2d5h, 5h or 10m30s

Time until the container is restarted.

Event: container-ready

This event is generated when a container's readiness probe is successful. If no readiness probe is configured, this event is generated as soon as the container has started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-starting

This event is generated when the container has been created, and just before it is started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-stopped

This event is generated when the system has stopped the container for any reason.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-unready

This event is generated when a container's readiness probe fails.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: service-instance-creation-failed

This event is generated when no service instances could be created for an application.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
sitenameThe name of the site where the event was generated.
error-messagestringAdditional information about the failure to create service
instances.

Event: service-instance-failed

This event is generated when some container in a service failed, or when the service instance could not be scheduled or started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.
container-namenameThe name of the failing container.
reasonenumeration
  • scheduling-error
  • preparation-error
  • exited
  • startup-probe
  • liveness-probe
  • scheduling-error
    The service instance could not be scheduled to any host.
  • preparation-error
    No containers were started due to a
    preparation error, e.g., an error when preparing the
    application network or an error when preparing the volumes.
  • exited
    The container process exited.
  • startup-probe
    The startup probe failed.
  • liveness-probe
    The liveness probe failed.

Event: service-instance-ready

This event is generated when all containers in a service are ready, i.e., their readiness probes succeeded.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Event: service-instance-starting

This event is generated just before the containers in a service instance are started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Event: service-instance-stopped

This event is generated when all containers in a service instance have been stopped.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.
reasonenumeration
  • upgrading
  • admin-down
  • mounted-file-change
  • secret-change
  • config-change
  • restart
  • removed
  • rescheduled
  • upgrading
    The service instance is stopped as part of being
    upgraded.
  • admin-down
    The service instance is stopped due to an administrative
    action, e.g., draining the host.
  • mounted-file-change
    The service instance is restarted because a mounted file
    was changed.
  • secret-change
    The service instance is restarted because a secret that
    was is in the environment or on the command line was
    changed.
  • config-change
    The service instance is restarted because some
    configuration parameter that is used in the environment
    or on the command line was changed.
  • restart
    The service instance is stopped due to some administrative
    action, e.g., an explicit call to restart.
  • removed
    The service instance has been removed from the host.
  • rescheduled
    The service instance has been rescheduled, due to an invocation of
    the reschedule action.

Event: service-instance-unready

This event is generated when one or more containers in a service are unready; i.e., their readiness probes failed.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Event: service-instance-updated

This event is generated when a service instance's specification has been updated.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Topic: system:deployment-events

Available only on the top site.

This topic contains application deployment related events.

Event: application-summary-status-changed

Sent when a site has reported that an application deployed to the site has changed its oper-status.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.
oper-statusenumeration
  • waiting-for-distribution
  • waiting-to-be-scheduled
  • starting
  • running
  • error
This is an aggregated status of the application.

  • waiting-for-distribution
    The application is being distributed to all selected sites.
  • waiting-to-be-scheduled
    The application is not yet started on at least one site.
  • starting
    The application is not yet running on at least one site.
  • running
    The application is running without an error on all sites.
  • error
    The application's oper-status is error on at least one site.
error-messagestringGives additional information when oper-status is error.

Event: deployment-failed

Sent when a set of sites have reported that the application has failed to be deployed at the sites.

These sites are moved to the deploy-failed list in the application deployment's state.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.
error-messagestringError message describing the failure.

Event: deployment-initiated

Sent when a set of sites have been instructed to deploy a specific version of an application.

These sites are moved to the deploying-to list in the application deployment's state.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.

Event: deployment-sites-removed

Sent when an application no longer is deployed to set of sites.

When this event is sent, the sites have been instructed to remove the application.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.

Event: deployment-succeeded

Sent when a set of sites have reported that the application has been successfully deployed at the sites. Successfully deployed means that all service instances are in state running.

These sites are moved to the deployed-to list in the application deployment's state.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.

Topic: system:config-events

Available only on the top site.

This topic contains configuration changes events.

Event: config-created

Sent when a new config object was created.

Event-specific data:

NameTypeDescription
usernamestringIdentifies the user that initiated the config change.
objectnameKind of config object.
pathstringIdentifier of uniqe object.

Event: config-deleted

Sent when a config object was deleted.

Event-specific data:

NameTypeDescription
usernamestringIdentifies the user that initiated the config change.
objectnameKind of config object.
pathstringIdentifier of uniqe object.

Event: config-updated

Sent when a config object was updated.

Event-specific data:

NameTypeDescription
usernamestringIdentifies the user that initiated the config change.
objectnameKind of config object.
pathstringIdentifier of uniqe object.

Topic: system:events

Available only on the top site.

This topic contains important events from the system.

Event: site-connected

This event is generated at the top site when an edge site connects.

Event-specific data:

NameTypeDescription
sitename

Event: site-disconnected

This event is generated at the top site when the connection to an edge site is lost.

Event-specific data:

NameTypeDescription
sitename

Tenant specific application metrics

Topic: system:application-metrics

Available on every site except on top sites.

This topic contains application related metrics for the tenant.

JSON Object

The system samples metrics related to cpu, memory and network traffic once every 10 seconds. Every 30 seconds, related samples are collected into this object.

Metrics are sampled on three different levels, per container, per service instance or for the whole application on a host.

NameTypeDescription
sitenameThe name of the site where the metrics originated from.
tenantnameThe name of the tenant.
entriesarray of JSON Object
see metrics-entry JSON Object
A set of samples for this tenant.

metrics-entry JSON Object

NameTypeDescription
timedate-and-timeThe time the sample was taken.
hoststringThe host where the sample was taken.
applicationnameThe name of the application that was sampled.
per-container OR
per-service OR
per-application
JSON Object see per-container-metrics
JSON Object see per-service-metrics
JSON Object see per-application-metrics


per-container-metrics JSON Object

NameTypeDescription
service-instancenameThe name of the service instance.
containernameThe name of the container.
memoryJSON Object
see memory-metric JSON Object
Memory metric data.
cpuJSON Object
see cpu-metric JSON Object
CPU metric data.
container-layerJSON Object
see disk-metric JSON Object
Container layer storage metric data.

This section of metrics is only available if the underlying file
system has support for quota.

memory-metric JSON Object

NameTypeDescription
useduint64The memory used by the container in bytes, at the time the
sample was taken.
totaluint64The total memory available in bytes for the container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of memory used by the container divided by total

cpu-metric JSON Object

NameTypeDescription
nanosecondsuint64The total number of CPU nanoseconds used by the container.
cpusdecimal64The CPUs limit for the container. I.e. the maximum number of CPUs
used by the container.
sharesuint16The CPU shares limit for the container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of CPU used in relation to limits.

disk-metric JSON Object

NameTypeDescription
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.

per-service-metrics JSON Object

NameTypeDescription
service-instancenameThe name of the service instance.
ephemeral-volumesarray of JSON Object
see disk-volume-metric JSON Object
Ephemeral volume storage parameters.

Ephemeral volume parameters are only available for volumes where the
the underlying file system has support for quota.
persistent-volumesarray of JSON Object
see disk-volume-metric JSON Object
Persistent volume storage parameters.

Persistent volume parameters are only available for volumes where the
the underlying file system has support for quota.

disk-volume-metric JSON Object

NameTypeDescription
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestring

per-application-metrics JSON Object

Per-application metrics, aggregated for all service instances that are part of the same application.

NameTypeDescription
gateway-networkJSON Object
see gateway-network-metrics JSON Object
hostsarray of JSON Object
see host-metric JSON Object
Resource metrics aggregated per application per host.

gateway-network-metrics JSON Object

NameTypeDescription
tx-packetsuint64Transmitted external traffic on the gateway network, i.e.,
traffic leaving the host.

Note that transmitted traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an application
sending uncontrollably large amount of traffic with the
destination outside the host the traffic is included here.
tx-bytesuint64Transmitted external traffic on the gateway network, i.e.,
traffic leaving the host. The value represents frame
payload, Layer 2 header not included.

Note that transmitted traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an application
sending uncontrollably large amount of traffic with the
destination outside the host the traffic is included here.
rx-packetsuint64Received external traffic on the gateway network, i.e.,
traffic originated outside the host. Only the traffic allowed
by the firewall is included, i.e. traffic on the open ingress
ports and traffic on established connections.

Note that received traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an external party
sending uncontrollably large amount of traffic to the
application the traffic is included here.
rx-bytesuint64Received external traffic on the gateway network, i.e.,
traffic originated outside the host. The value represents
frame payload, Layer 2 header not included. Only the traffic
allowed by the firewall is included, i.e. traffic on the open
ingress ports and traffic on established connections.

Note that received traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an external party
sending uncontrollably large amount of traffic to the
application the traffic is included here.
tx-packets-per-seconduint64Intensity of transmitted external traffic in packets per second. This
value is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
tx-bytes-per-seconduint64Intensity of transmitted external traffic in bytes per second. This
value is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
rx-packets-per-seconduint64Intensity of received external traffic in packets per second. This
value is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
rx-bytes-per-seconduint64Intensity of received external traffic in bytes per second. This value
is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
upstream-bandwidth-utilizationdecimal64If an upstream-bandwidth-per-host is configured for this application,
then this value indicates the fraction of the available bandwidth
used by the application. This value is based on the
tx-bytes-per-second metric.
downstream-bandwidth-utilizationdecimal64If an downstream-bandwidth-per-host is configured for this
application, then this value indicates the fraction of the available
bandwidth used by the application. This value is based on the
rx-bytes-per-second metric.

host-metric JSON Object

NameTypeDescription
hoststringThe application metrics are from this host.
memory-percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of aggregated memory used by the application on this host
in relation to total available memory.
cpu-percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of maximum CPU used among all containers for the
application on this host in relation to total available CPUs.
disk-percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of maximum disk used among all container layers, ephemeral
and persistent volumes with quota support for the application on
this host.

This metric is not available if no filesystem, for this application,
does support quota.

Tenant specific audit trail log

The audit trail log records all operations performed by a tenant. The log includes the access token, if provided, the operation, and any parameters provided.

In order to protect sensitive data, e.g., tokens and secrets, all such data is hashed using a tenant-specific HMAC before being logged. To search for some specific sensitive data in the logs, e.g., operations performed using a specific access token, the plain text version of the data can be hashed using the strongbox/audit/hmac operation.

Audit trail logs are streamed upwards from edge sites to the top site. This allows inspection of audit logs even if a site is compromised.

Topic: system:audit-trail-log

Available on every site.

This log contains all authenticated requests for the tenant.

JSON Object

NameTypeDescription
occurred-atdate-and-timeTime when access occurred.
request-time-msgauge64Time to process request in milliseconds.
userstringUser or Approle that performed the access.
pathstringPath that was accessed
querystringAny supplied URL query parameters.
methodstringHTTP method that was used.
statusuint32HTTP response status.
status-infostringText representation of status.
request-parametersJSON ObjectThe parameters included in the request, if any, ie, request body.
client-ipstringAddress from which the client accessed the host.
x-forwarded-forstringIf the request went through a load balance the clients real
ip may appear as x-forwarded-for. It may be a list of addresses
where the first address should be the clients original address.
x-real-ipstringIf the request went through a load balance the clients real
ip may appear as x-real-ip. It may be a list of addresses
where the first address should be the clients original address.
sitestringSite where the access occurred.
hoststringHost on which the access occurred.
tenantnameTenant that performed the access.
tokenstringHashed representation of the access token. It can be
used when identifying a specific session, and all
accesses/operations using the same token.
accessorstringToken accessor that can be used to terminate an ongoing
session.
fromstringThe 'from' request header from the originating HTTP request.
user-agentstringUser-agent that performed request, eg, curl, firefox etc.

Tenant specific logs

An item on a log topic is a string.

Topic: system:container-logs:CONTAINER-ID

Available on the host where the container is running.

There is one topic per running container. It contains all output from standard output and standard error from the container.

The CONTAINER-ID is a string on the format APPLICATIONNAME.SERVICENAME-IX.CONTAINERNAME, where IX is a numeric index, one per service instance (replica).

The default behavior, if a container is rescheduled to another host, is to delete the container log topic for the previous container. However, if container-log-archive is set to true in the application specification, the container log will instead be appended with a timestamp and become read only in this case.

An archived container log will continue to exist on the original host for container-log-archive-days.

Topic: system:logs

Available on every site.

This topic contains tenant-specific log info generated by the avassa system. Each log item is on the format

<LEVEL> (TENANT) DATE TIME HOSTNAME SRCFILE PID

Where LEVEL is one of EMERGENCY, ALERT, CRITICAL, ERROR, WARNING, NOTICE, INFO, DEBUG.

Tenant specific alerts

An alert item for a tenant specific topic is an alert record, which are JSON objects described in the Alerts section.

Topic: system:alerts

Available on every site.

This topic contains important alerts from the system.

Alert: application-error

Alert signaling when an application ends up in an erroneous oper-status.

Alert-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.

Note that this is the application version triggering the change of the
application oper-state.
application-deploymentstringThe name of the application deployment to which the application
belongs.
error-messagestringAdditional information about the application failure.

Alert: container-layer-threshold-reached

Alert signaling that a container-layer reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamenameThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
containernameThe name of the container.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.

Alert: custom-alert

A custom alert issued and controlled by the tenant.

Alert-specific data:

NameTypeDescription
custom-idstringThe custom identifier of the alert. This could be a unique
identifier of the entity effected by the fault causing the
alert. As an example it could be a unique identifier of the
failing service instance (replica) of an application on a
site.
custom-namestringThe custom name of the specific alert. This should be a
unique name describing the alert.
applicationnameThe name of the application. This attribute will only be
available if the alert or clear operation has been executed
from within the application container using an approle.
service-instancenameThe name of the service instance id. This attribute will only be
available if the alert or clear operation has been executed
from within the application container using an approle.

Alert: ephemeral-volume-threshold-reached

Alert signaling that a ephemeral volume reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamenameThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestringThe ephemeral volume name.

Alert: invalid-auto-cert-configuration

An auto-cert has been configured with a TTL that exceeds either the renew-threshold or the activate-threshold of the CA certificate.

Alert-specific data:

NameTypeDescription
secret-namenameName of secret with invalid auto-cert configuration.
ca-namenameName of CA used to generate certificate.

Alert: persistent-volume-threshold-reached

Alert signaling that a persistent volume reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamestringThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestringThe persistent volume name.

Alert: unwrap-failure

Issued when multiple unwraps of a secret is attempted. It may be an indication of a security breach.

Alert-specific data:

NameTypeDescription
idnameId of the token subject to multiple unwraps.
metastring
peer-ipip-addressPeer IP of this unwrap attempt.
successful-peer-ipip-addressPeer IP of successful unwrap (first attempted unwrap)
successful-timedate-and-timeTime of successful unwrap.

Topic: system:notifications - DEPRECATED

This topic is deprecated and will be removed in a future release. Use system:alerts instead.

Alert: application-error

Alert signaling when an application ends up in an erroneous oper-status.

Alert-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.

Note that this is the application version triggering the change of the
application oper-state.
application-deploymentstringThe name of the application deployment to which the application
belongs.
error-messagestringAdditional information about the application failure.

Alert: container-layer-threshold-reached

Alert signaling that a container-layer reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamenameThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
containernameThe name of the container.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.

Alert: ephemeral-volume-threshold-reached

Alert signaling that a ephemeral volume reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamenameThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestringThe ephemeral volume name.

Alert: persistent-volume-threshold-reached

Alert signaling that a persistent volume reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamestringThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestringThe persistent volume name.

Alert: unwrap-failure

Issued when multiple unwraps of a secret is attempted. It may be an indication of a security breach.

Alert-specific data:

NameTypeDescription
idnameId of the token subject to multiple unwraps.
metastring
peer-ipip-addressPeer IP of this unwrap attempt.
successful-peer-ipip-addressPeer IP of successful unwrap (first attempted unwrap)
successful-timedate-and-timeTime of successful unwrap.

Site provider specific events

An item on a topic readable by site providers is an event record. It's a JSON object described in the Events section.

Topic: system:all-scheduler-events

Available on every site.

This topic contains the union of all scheduler related events on the local site for all tenants and are readable only by site providers.

The events are the same as for the topic system:scheduler-events.

Site provider specific audit trail log

Topic: system:unauthenticated-audit-trail-log

Available only on the top site.

This log is available to the top site provider, and contains all unauthenticated requests to the system. The format is the same as for the system:audit-trail-log.

Site provider specific host metrics

This topic contains host-related metrics.

Topic: system:host-metrics

Available on every site, but only readable by sys tenant on top sites.

JSON Object

The system samples host metrics related to cpu, memory and disk once every 30 seconds.

NameTypeDescription
timedate-and-timeThe time the sample was taken.
sitenameThe name of the site where the event was generated.
cluster-hostnamenameThe cluster hostname of the host where the event was generated.
hostnamedomain-nameThe hostname of the host where the event was generated.
cpuJSON Object
see cpu-params JSON Object
memoryJSON Object
see mem-params JSON Object
loadavgJSON Object
see loadavg-params JSON Object
diskarray of JSON Object
see disk-entry JSON Object
A set of disk usage parameters for file systems used by either the
Edge Enforcer (supd) or by any application managed by the Edge
Enforcer. All file systems listed are from inside the Edge Enforcer
container.

Metrics with mount points reported as CONTAINER-ROOT indicates that
it reports the Edge Enforcer containers root file system usage, not the
hosts root file system.
disk-ioarray of JSON Object
see disk-io-entry JSON Object
A set of disk io metrics for disk partitions used by either the
Edge Enforcer (supd) or by any application managed by the Edge
Enforcer.
cpusarray of JSON Object
see mpstat-cpu-entry JSON Object
CPU related statistics. Entries can be either aggregated for
all CPUs or per CPU.

cpu-params JSON Object

CPU metrics taken from /proc/cpuinfo.

NameTypeDescription
vcpusuint32Total number of CPUs.

mem-params JSON Object

Memory metrics taken from /proc/meminfo.

NameTypeDescription
totaluint64Total usable RAM memory in bytes.
freeuint64Free usable RAM memory in bytes.
availableuint64An estimate of how much RAM memory in bytes is available for
starting new applications, without swapping.
useduint64Used RAM memory in bytes, calculated as:
(MemTotal + SwapTotal) -
(MemFree + SwapFree) -
Buffers -
(Cached + SReclaimable).

loadavg-params JSON Object

Average load metrics taken from /proc/loadavg.

Metrics avg1, avg5 and avg15 reflects the avarage load of processes:

  • queued for execution
  • executed
  • sleeping while being uninterruptible, typically waiting for I/O over time periods of 1, 5 and 15 minutes respectively.
NameTypeDescription
avg1decimal64Avarage load of processes last minute.
avg5decimal64Avarage load of processes last 5 minutes.
avg15decimal64Avarage load of processes last 15 minutes.
runninggauge64Number of currently runnable kernel scheduling entities (processes,
threads).
totalgauge64Number of kernel scheduling entities (processes, threads) that
currently exist on the system.

disk-entry JSON Object

NameTypeDescription
filesystemstringThe source of the mount point, usually a device.
typestringFile system type.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
mountstringThe mount point.

disk-io-entry JSON Object

NameTypeDescription
device-namestringThe device name.
reads-completedgauge64The total number of reads completed successfully.
sectors-readgauge64The total number of sectors read successfully.
time-spent-readinggauge64This is the total number of milliseconds spent by all reads.
writes-completedgauge64The total number of writes completed successfully.
sectors-writtengauge64The total number of sectors written successfully.
time-spent-writinggauge64This is the total number of milliseconds spent by all writes.
ios-in-progressgauge64This is the number of I/Os currently in progress.
time-spent-on-iogauge64This is the total number of milliseconds spent doing I/Os, i.e. when
time when ios-in-progress is non-zero

mpstat-cpu-entry JSON Object

NameTypeDescription
cpustringCPU number or all.
usrdecimal64Percentage of CPU utilization that occurred while executing at the
user level (application).
nicedecimal64Percentage of CPU utilization that occurred while executing at the
user level with nice priority.
sysdecimal64Percentage of CPU utilization that occurred while executing at the
system level (kernel). Note that this does not include time spent
servicing hardware and software interrupts.
iowaitdecimal64Percentage of time that the CPU or CPUs were idle during which the
system had an outstanding disk I/O request.
irqdecimal64Percentage of time spent by the CPU or CPUs to service hardware
interrupts.
softdecimal64Percentage of time spent by the CPU or CPUs to service software
interrupts.
stealdecimal64Percentage of time spent in involuntary wait by the virtual CPU or
CPUs while the hypervisor was servicing another virtual processor.
guestdecimal64Percentage of time spent by the CPU or CPUs to run a virtual
processor.
idledecimal64Percentage of time that the CPU or CPUs were idle and the system did
not have an outstanding disk I/O request.

Site provider specific supd metrics

This topic contains supd-related metrics.

Topic: system:supd-metrics

Available on every site, but only readable by sys tenant on top sites.

JSON Object

The system samples metrics on supd container to cpu and memory once every 30 seconds.

NameTypeDescription
timedate-and-timeThe time the sample was taken.
sitenameThe name of the site where the event was generated.
cluster-hostnamenameThe cluster hostname of the host where the event was generated.
hostnamedomain-nameThe hostname of the host where the event was generated.
cpuJSON Object
see supd-cpu-metric JSON Object
CPU metric data.
memoryJSON Object
see supd-memory-metric JSON Object
Memory metric data.
long-gcsarray of JSON Object
see gc-metric JSON Object
Long GC times.
large-heapsarray of JSON Object
see gc-metric JSON Object
Large heap sizes when GC.
long-schedulesarray of JSON Object
see schedule-metric JSON Object
Long schedule times.
busy-portsarray of JSON Object
see busy-port JSON Object
Busy ports.
busy-dist-portsarray of JSON Object
see busy-port JSON Object
Busy distribution ports.

supd-cpu-metric JSON Object

NameTypeDescription
nanosecondsuint64The total CPU usage of all tasks in the SUPD container. The value is
measured in nano seconds.
cpusdecimal64The CPUs limit for the SUPD container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of CPU used by the SUPD container in relation to limits.

supd-memory-metric JSON Object

NameTypeDescription
useduint64The memory used by the SUPD container in bytes, at the time the
sample was taken.
totaluint64The total memory available in bytes for the SUPD container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of memory used by the SUPD container divided by total.

gc-metric JSON Object

NameTypeDescription
pidstringThe suspended erlang process identifier.
registered-namestringThe suspended erlang process's registered name.
millisecondsuint64The GC time in milliseconds.
heap-sizeuint64The size of the used part of the heap.
heap-block-sizeuint64The size of the memory block used for storing the heap and the stack.
old-heap-sizeuint64The size of the used part of the old heap.
old-heap-block-sizeuint64The size of the memory block used for storing the old heap.
stack-sizeuint64The size of the stack.
mbuf-sizeuint64The combined size of message buffers associated with the process.

schedule-metric JSON Object

NameTypeDescription
idstringThe erlang process or port identifier.
registered-namestringThe erlang process's registered name.
millisecondsuint64The schedule time in milliseconds.

busy-port JSON Object

NameTypeDescription
pidstringThe suspended erlang process identifier.
registered-namestringThe suspended erlang process's registered name.
portstringThe busy port identifier.

Site provider specific alerts

An alert item for a site provider specific topic is an alert record, which are JSON objects described in the Alerts section.

Topic: system:site-alerts

Available on every site.

This topic contains important site-related alerts from the system.

Alert: disk-threshold-reached

Alert signaling that a disk reached an alerting threshold.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.
filesystemstringThe source of the mount point, usually a device.
typestringFile system type.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
mountstringThe mount point.

Alert: host-down

Alert signaling that a host is down.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: host-in-disaster-mode

Alert signaling that a host ended up in disaster mode.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: host-in-distress

Alert signaling that a host ended up in distress.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.
distressenumeration
  • none
  • down
  • disaster-mode
  • no-space-on-disk
  • disk-threshold-reached
  • supd-down
Indicating the level of distress in the host.

  • none
    No indication of distress in the host.
  • down
    Indicating that the host is down.
  • disaster-mode
    Indicating that the host is in disaster mode.
  • no-space-on-disk
    Indicating that a disk has no more space.
  • disk-threshold-reached
    Indicating that a disk has reached a threshold value.
  • supd-down
    Indicating that the supd is down.
distress-infostringDetail information of the distress.

Alert: no-space-left-on-disk

Alert signaling that a host has no space left on a critical disk.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: supd-down

Alert signaling that supd is down.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.