Skip to main content

Volga Topics

The system offers a set of built-in volga topics that applications can consume.

Events

An item on an event topic is an event record. It's a JSON object with the following structure:

Definition of events sent on volga streams to tenants.

NameTypeDescription
eventstringThe name of the specific event.
occurred-atdate-and-timeTime when the event occurred.
idstringUnique identifier of the event instance.
tenantstringThe event occurred in the context of this tenant.
dataObjectEvent-specific data.

Alerts

An item on an alert topic is an alert record. It's a JSON object with the following structure:

Definition of alerts sent on volga streams to tenants.

NameTypeDescription
alertstringThe name of the specific alert.
timedate-and-timeThe time the alert was generated.
idstringThe site unique id of the alert.
sitenameThe name of the site where the alert was generated.
kindenumeration
  • infrastructure
  • application
  • security
Type of alert, infrastructure, application, or security.

  • infrastructure
    Alerts related to hosts, sites, cluster, etc.
  • application
    Application related alerts
  • security
    Security related alerts.
severityenumeration
  • warning
  • minor
  • major
  • critical
The severity of the alert.

  • warning
    The 'warning' severity level indicates the detection of a
    potential or impending service-affecting fault, before any
    significant effects have been felt. Action should be
    taken to further diagnose (if necessary) and correct the
    problem in order to prevent it from becoming a more
    serious service-affecting fault.
  • minor
    The 'minor' severity level indicates the existence of a
    non-service-affecting fault condition and that corrective
    action should be taken in order to prevent a more serious
    (for example, service-affecting) fault. Such a severity
    can be reported, for example, when the detected alarm
    condition is not currently degrading the capacity of the
    resource.
  • major
    The 'major' severity level indicates that a service-
    affecting condition has developed and an urgent corrective
    action is required. Such a severity can be reported, for
    example, when there is a severe degradation in the
    capability of the resource and its full capability must be
    restored.
  • critical
    The 'critical' severity level indicates that a service-
    affecting condition has occurred and an immediate
    corrective action is required. Such a severity can be
    reported, for example, when a resource becomes totally out
    of service and its capability must be restored.
descriptionstringThe description of the reason behind the alert.
expiry-timedate-and-timeAfter this date and time this alert should be considered expired.
clearedbooleanIs true if the alert has been cleared, otherwise false.
dataObjectAlert-specific data.

Threshold alerts

For threshold alerts like: container-layer-threshold-reached, ephemeral-volume-threshold-reached, persistent-volume-threshold-reached and disk-threshold-reached the levels are:

  • CRITICAL: 100%
  • MAJOR: 90%
  • WARNING: 80%

Tenant specific events

An item on a tenant specific events topic is an event record. It's a JSON object described in the Events section.

Topic: system:scheduler-events

Available on every site.

This topic contains scheduler related events on the local site.

Event: application-status-changed

This event is generated when the oper-status of an application changes. The oper-status of an application is an aggregated status of all service instances in the application.

In some cases, this event may be generated even if the oper-status hasn't been changed. Clients need to be prepared for that.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
oper-statusenumeration
  • waiting-to-be-scheduled
  • waiting-for-images
  • starting
  • running
  • upgrading
  • error
The oper-status of the application.

  • waiting-to-be-scheduled
    The application exists only in the application queue,
    waiting to be scheduled.
  • waiting-for-images
    Waiting for images to be automatically downloaded.
  • starting
    The application's service instances are being started.

    This is a combined status to make it easier to find
    problematic applications. For detailed information
    see the service-instances list.
  • running
    All the application's service instances have oper-status
    running.
  • upgrading
    The application is being upgraded.
  • error
    Some of the application's service instances are in
    an error state, or no service instances could be created.

    This is a combined status to make it easier to find
    problematic applications. For detailed information see
    the service-instances list.
application-versionstringThe version of the application.
application-deploymentstringThe name of the application deployment to which the application
belongs.
sitenameThe name of the site where the application is running.

Event: container-completed

This event is generated when an init container has completed successfully.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-exited

This event is generated when the system detects that a VM container has exited with no errors. Such container is restarted directly.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-failed

This event is generated when the system detects that a container has failed; either because it exited, or because its startup or liveness probe failed.

A VM container that exited with status 0 does not generate this event, instead it generates a container-exited event.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
reasonenumeration
  • exited
  • startup-probe
  • liveness-probe
  • exited
    The container process exited.
  • startup-probe
    The startup probe failed.
  • liveness-probe
    The liveness probe failed.
restart-afterdurationA duration in years, days, hours, minutes and seconds.

Format is [<digits>y][<digits>d][<digits>m][<digits>s].

Examples: 1y2d5h, 5h or 10m30s

Time until the container is restarted.

Event: container-ready

This event is generated when a container's readiness probe is successful. If no readiness probe is configured, this event is generated as soon as the container has started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-starting

This event is generated when the container has been created, and just before it is started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-stopped

This event is generated when the system has stopped the container for any reason.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: container-unready

This event is generated when a container's readiness probe fails.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionnameThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
container-namenameThe name of the container.
container-idstringThe internal id of the container.
container-imagenameThe name of the image that the container is running.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.

Event: service-instance-creation-failed

This event is generated when no service instances could be created for an application.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
sitenameThe name of the site where the event was generated.
error-messagestringAdditional information about the failure to create service
instances.

Event: service-instance-failed

This event is generated when some container in a service failed, or when the service instance could not be scheduled or started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.
container-namenameThe name of the failing container.
reasonenumeration
  • scheduling-error
  • preparation-error
  • exited
  • startup-probe
  • liveness-probe
  • scheduling-error
    The service instance could not be scheduled to any host.
  • preparation-error
    No containers were started due to a
    preparation error, e.g., an error when preparing the
    application network or an error when preparing the volumes.
  • exited
    The container process exited.
  • startup-probe
    The startup probe failed.
  • liveness-probe
    The liveness probe failed.

Event: service-instance-ready

This event is generated when all containers in a service are ready, i.e., their readiness probes succeeded.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Event: service-instance-starting

This event is generated just before the containers in a service instance are started.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Event: service-instance-stopped

This event is generated when all containers in a service instance have been stopped.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.
reasonenumeration
  • upgrading
  • admin-down
  • mounted-file-change
  • secret-change
  • config-change
  • restart
  • removed
  • rescheduled
  • upgrading
    The service instance is stopped as part of being
    upgraded.
  • admin-down
    The service instance is stopped due to an administrative
    action, e.g., draining the host.
  • mounted-file-change
    The service instance is restarted because a mounted file
    was changed.
  • secret-change
    The service instance is restarted because a secret that
    was is in the environment or on the command line was
    changed.
  • config-change
    The service instance is restarted because some
    configuration parameter that is used in the environment
    or on the command line was changed.
  • restart
    The service instance is stopped due to some administrative
    action, e.g., an explicit call to restart.
  • removed
    The service instance has been removed from the host.
  • rescheduled
    The service instance has been rescheduled, due to an invocation of
    the reschedule action.

Event: service-instance-unready

This event is generated when one or more containers in a service are unready; i.e., their readiness probes failed.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Event: service-instance-updated

This event is generated when a service instance's specification has been updated.

Event-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.
service-namenameThe name of the service.
service-instancenameThe name of the service instance.
sitenameThe name of the site where the event was generated.
hostnamenameThe name of the host where the event was generated.
application-ipsarray of ip-addressThe IP addresses assigned to this service instance on
the application network.
ingress-ipsarray of ip-addressIngress IP addresses assigned to this service instance.

Topic: system:deployment-events

Available only on the Control Tower.

This topic contains application deployment related events.

Event: application-summary-status-changed

Sent when a site has reported that an application deployed to the site has changed its oper-status.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.
oper-statusenumeration
  • waiting-for-distribution
  • waiting-to-be-scheduled
  • starting
  • running
  • error
This is an aggregated status of the application.

  • waiting-for-distribution
    The application is being distributed to all selected sites.
  • waiting-to-be-scheduled
    The application is not yet started on at least one site.
  • starting
    The application is not yet running on at least one site.
  • running
    The application is running without an error on all sites.
  • error
    The application's oper-status is error on at least one site.
error-messagestringGives additional information when oper-status is error.

Event: deployment-failed

Sent when a set of sites have reported that the application has failed to be deployed at the sites.

These sites are moved to the deploy-failed list in the application deployment's state.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.
error-messagestringError message describing the failure.

Event: deployment-initiated

Sent when a set of sites have been instructed to deploy a specific version of an application.

These sites are moved to the deploying-to list in the application deployment's state.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.

Event: deployment-sites-removed

Sent when an application no longer is deployed to set of sites.

When this event is sent, the sites have been instructed to remove the application.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.

Event: deployment-succeeded

Sent when a set of sites have reported that the application has been successfully deployed at the sites. Successfully deployed means that all service instances are in state running.

These sites are moved to the deployed-to list in the application deployment's state.

Event-specific data:

NameTypeDescription
application-deploymentnameThe name of the application deployment.
applicationnameThe name of the application.
application-versionstringThe version of the application.
is-canaryemptyPresent if the event relates to deployment to a canary site.
sitesarray of nameThe affected sites.

Topic: system:config-events

Available only on the Control Tower.

This topic contains configuration changes events.

Event: config-created

Sent when a new config object was created.

Event-specific data:

NameTypeDescription
usernamestringIdentifies the user that initiated the config change.
objectnameKind of config object.
pathstringIdentifier of uniqe object.

Event: config-deleted

Sent when a config object was deleted.

Event-specific data:

NameTypeDescription
usernamestringIdentifies the user that initiated the config change.
objectnameKind of config object.
pathstringIdentifier of uniqe object.

Event: config-updated

Sent when a config object was updated.

Event-specific data:

NameTypeDescription
usernamestringIdentifies the user that initiated the config change.
objectnameKind of config object.
pathstringIdentifier of uniqe object.

Topic: system:events

Available only on the Control Tower.

This topic contains important events from the system.

Event: site-connected

This event is generated at the Control Tower when an edge site connects.

Event-specific data:

NameTypeDescription
sitename

Event: site-disconnected

This event is generated at the Control Tower when the connection to an edge site is lost.

Event-specific data:

NameTypeDescription
sitename

Event: supd-download

This event is generated at the Control Tower when supd is being downloaded from a site.

Event-specific data:

NameTypeDescription
environmentstringThe name of the Control Tower environment.
sitestringThe name of the site that downloaded supd.
tenantstringThe name of the tenant that downloaded supd.
imagestringThe full name of the supd image that the site downloaded.
peerip-addressThe IP address of the site that downloaded supd. Note that
when a load-balancer is used in front of the control tower,
this will be the address of the load-balancer instead.
forwardedarray of ip-addressWhen a load-balancer is used in front of the control tower
the true address of the site will be found here.

Event: supd-upgrade-announcement

This event is generated at the Control Tower when supd is about to be upgraded.

Event-specific data:

NameTypeDescription
current-versionversion
new-versionversion
wait-timedurationA duration in years, days, hours, minutes and seconds.

Format is [<digits>y][<digits>d][<digits>m][<digits>s].

Examples: 1y2d5h, 5h or 10m30s

Event: supd-upgraded

This event is generated at the Control Tower when supd has been successfully upgraded.

Event-specific data:

NameTypeDescription
versionversion

Event: ui-upgrade-announcement

This event is generated at the Control Tower when the UI is about to be upgraded.

Event-specific data:

NameTypeDescription
current-versionversion
new-versionversion
wait-timedurationA duration in years, days, hours, minutes and seconds.

Format is [<digits>y][<digits>d][<digits>m][<digits>s].

Examples: 1y2d5h, 5h or 10m30s

Event: ui-upgrade-failed

This event is generated at the Control Tower when the UI has failed to be upgraded.

Event-specific data:

NameTypeDescription
versionversion
error-messagestring

Event: ui-upgraded

This event is generated at the Control Tower when the UI has been successfully upgraded.

Event-specific data:

NameTypeDescription
versionversion

Tenant specific application metrics

Topic: system:application-metrics

Available on every site except on the Control Tower.

This topic contains application related metrics for the tenant.

Object

The system samples metrics related to cpu, memory and network traffic once every 10 seconds. Every 30 seconds, related samples are collected into this object.

Metrics are sampled on three different levels, per container, per service instance or for the whole application on a host.

NameTypeDescription
sitenameThe name of the site where the metrics originated from.
tenantnameThe name of the tenant.
entriesarray of Object
see metrics-entry
A set of samples for this tenant.

The metrics-entry Object

NameTypeDescription
timedate-and-timeThe time the sample was taken.
hoststringThe host where the sample was taken.
applicationnameThe name of the application that was sampled.
per-container OR
per-service OR
per-application
Object see per-container-metrics OR
Object see per-service-metrics OR
Object see per-application-metrics

The per-container-metrics Object

NameTypeDescription
service-instancenameThe name of the service instance.
containernameThe name of the container.
memoryObject
see memory-metric
Memory metric data.
cpuObject
see cpu-metric
CPU metric data.
container-layerObject
see disk-metric
Container layer storage metric data.

This section of metrics is only available if the underlying file
system has support for quota.

The memory-metric Object

NameTypeDescription
useduint64The memory used by the container in bytes, calculated at the time the
sample was taken.
totaluint64The total memory available in bytes for the container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of memory used by the container divided by total
used-hotuint64The hot memory used by the container in bytes, calculated at
the time the sample was taken. Hot memory is same as used,
but excluding inactive file buffers (inactive_file) and
inactive anonymous mappings (inactive_anon)
percentage-used-hotpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of hot memory used by the container divided by total

The cpu-metric Object

NameTypeDescription
nanosecondsuint64The total number of CPU nanoseconds used by the container.
cpusdecimal64The CPUs limit for the container. I.e. the maximum number of CPUs
used by the container.
sharesuint16The CPU shares limit for the container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of CPU used in relation to limits.

The disk-metric Object

NameTypeDescription
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.

The per-service-metrics Object

NameTypeDescription
service-instancenameThe name of the service instance.
ephemeral-volumesarray of Object
see disk-volume-metric
Ephemeral volume storage parameters.

Ephemeral volume parameters are only available for volumes where the
the underlying file system has support for quota.
persistent-volumesarray of Object
see disk-volume-metric
Persistent volume storage parameters.

Persistent volume parameters are only available for volumes where the
the underlying file system has support for quota.

The disk-volume-metric Object

NameTypeDescription
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestring

The per-application-metrics Object

Per-application metrics, aggregated for all service instances that are part of the same application.

NameTypeDescription
gateway-networkObject
see gateway-network-metrics
hostsarray of Object
see host-metric
Resource metrics aggregated per application per host.

The gateway-network-metrics Object

NameTypeDescription
tx-packetsuint64Transmitted external traffic on the gateway network, i.e.,
traffic leaving the host.

Note that transmitted traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an application
sending uncontrollably large amount of traffic with the
destination outside the host the traffic is included here.
tx-bytesuint64Transmitted external traffic on the gateway network, i.e.,
traffic leaving the host. The value represents frame
payload, Layer 2 header not included.

Note that transmitted traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an application
sending uncontrollably large amount of traffic with the
destination outside the host the traffic is included here.
rx-packetsuint64Received external traffic on the gateway network, i.e.,
traffic originated outside the host. Only the traffic allowed
by the firewall is included, i.e. traffic on the open ingress
ports and traffic on established connections.

Note that received traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an external party
sending uncontrollably large amount of traffic to the
application the traffic is included here.
rx-bytesuint64Received external traffic on the gateway network, i.e.,
traffic originated outside the host. The value represents
frame payload, Layer 2 header not included. Only the traffic
allowed by the firewall is included, i.e. traffic on the open
ingress ports and traffic on established connections.

Note that received traffic is counted before bandwidth
restrictions are applied. This is usually not a problem for
protocols with flow-control, but in case of an external party
sending uncontrollably large amount of traffic to the
application the traffic is included here.
tx-packets-per-seconduint64Intensity of transmitted external traffic in packets per second. This
value is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
tx-bytes-per-seconduint64Intensity of transmitted external traffic in bytes per second. This
value is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
rx-packets-per-seconduint64Intensity of received external traffic in packets per second. This
value is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
rx-bytes-per-seconduint64Intensity of received external traffic in bytes per second. This value
is the average intensity over the interval between the last two
reported samples. Hence it is not reported when there is not enough
recent samples available.
upstream-bandwidth-utilizationdecimal64If an upstream-bandwidth-per-host is configured for this application,
then this value indicates the fraction of the available bandwidth
used by the application. This value is based on the
tx-bytes-per-second metric.
downstream-bandwidth-utilizationdecimal64If an downstream-bandwidth-per-host is configured for this
application, then this value indicates the fraction of the available
bandwidth used by the application. This value is based on the
rx-bytes-per-second metric.

The host-metric Object

NameTypeDescription
hoststringThe application metrics are from this host.
memory-percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of aggregated memory used by the application on this host
in relation to total available memory.
cpu-percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of maximum CPU used among all containers for the
application on this host in relation to total available CPUs.
disk-percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of maximum disk used among all container layers, ephemeral
and persistent volumes with quota support for the application on
this host.

This metric is not available if no filesystem, for this application,
does support quota.

Tenant specific audit trail log

The audit trail log records all operations performed by a tenant. The log includes the access token, if provided, the operation, and any parameters provided.

In order to protect sensitive data, e.g., tokens and secrets, all such data is hashed using a tenant-specific HMAC before being logged. To search for some specific sensitive data in the logs, e.g., operations performed using a specific access token, the plain text version of the data can be hashed using the strongbox/audit/hmac operation.

Audit trail logs are streamed upwards from edge sites to the Control Tower. This allows inspection of audit logs even if a site is compromised.

Messages in the audit trail log are cryptographically signed when stored to disk in order to guarantee their integrity. The signature is always verified when a message is retrieved from disk.

Topic: system:audit-trail-log

Available on every site.

This log contains all authenticated requests for the tenant.

Object

NameTypeDescription
occurred-atdate-and-timeTime when access occurred.
request-time-msgauge64Time to process request in milliseconds.
userstringUser or Approle that performed the access.
pathstringPath that was accessed
querystringAny supplied URL query parameters.
methodstringHTTP method that was used.
statusuint32HTTP response status.
status-infostringText representation of status.
request-parametersObjectThe parameters included in the request, if any, ie, request body.
client-ipstringAddress from which the client accessed the host.
x-forwarded-forstringIf the request went through a load balance the clients real
ip may appear as x-forwarded-for. It may be a list of addresses
where the first address should be the clients original address.
x-real-ipstringIf the request went through a load balance the clients real
ip may appear as x-real-ip. It may be a list of addresses
where the first address should be the clients original address.
sitestringSite where the access occurred.
hoststringHost on which the access occurred.
tenantnameTenant that performed the access.
tokenstringHashed representation of the access token. It can be
used when identifying a specific session, and all
accesses/operations using the same token.
accessorstringToken accessor that can be used to terminate an ongoing
session.
fromstringThe 'from' request header from the originating HTTP request.
user-agentstringUser-agent that performed request, eg, curl, firefox etc.

Tenant specific logs

An item on a log topic is a string.

Topic: system:container-logs:CONTAINER-ID

Available on the host where the container is running.

There is one topic per running container. It contains all output from standard output and standard error from the container.

The CONTAINER-ID is a string on the format APPLICATIONNAME.SERVICENAME-IX.CONTAINERNAME, where IX is a numeric index, one per service instance (replica).

The default behavior, if a container is rescheduled to another host, is to delete the container log topic for the previous container. However, if container-log-archive is set to true in the application specification, the container log will instead be appended with a timestamp and become read only in this case.

An archived container log will continue to exist on the original host for container-log-archive-days.

Topic: system:logs

Available on every site.

This topic contains tenant-specific log info generated by the avassa system. Each log item is on the format

<LEVEL> (TENANT) DATE TIME HOSTNAME SRCFILE PID

Where LEVEL is one of EMERGENCY, ALERT, CRITICAL, ERROR, WARNING, NOTICE, INFO, DEBUG.

Tenant specific alerts

An alert item for a tenant specific topic is an alert record, which are JSON objects described in the Alerts section.

Topic: system:alerts

Available on every site.

This topic contains important alerts from the system.

Alert: application-error

Alert signaling when an application ends up in an erroneous oper-status.

Alert-specific data:

NameTypeDescription
applicationnameThe name of the application.
application-versionstringThe version of the application.

Note that this is the application version triggering the change of the
application oper-state.
application-deploymentstringThe name of the application deployment to which the application
belongs.
error-messagestringAdditional information about the application failure.

Alert: bound-cidrs-violation

Issued when an attempt is made to use a token from an IP address not in the bound-cidrs range.

Alert-specific data:

NameTypeDescription
idnameId of the token that triggered security breach.
hostnamedomain-nameThe hostname of the host where the alert arose.
metastring
peer-ipip-addressPeer IP of this attempt.
x-forwarded-forarray of ip-addressX-Forwarded-For IP.

Alert: container-layer-threshold-reached

Alert signaling that a container-layer reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamenameThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
containernameThe name of the container.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.

Alert: custom-alert

A custom alert issued and controlled by the tenant.

Alert-specific data:

NameTypeDescription
custom-idstringThe custom identifier of the alert. This could be a unique
identifier of the entity effected by the fault causing the
alert. As an example it could be a unique identifier of the
failing service instance (replica) of an application on a
site.
custom-namestringThe custom name of the specific alert. This should be a
unique name describing the alert.
applicationnameThe name of the application. This attribute will only be
available if the alert or clear operation has been executed
from within the application container using an approle.
service-instancenameThe name of the service instance id. This attribute will only be
available if the alert or clear operation has been executed
from within the application container using an approle.

Alert: ephemeral-volume-threshold-reached

Alert signaling that a ephemeral volume reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamenameThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestringThe ephemeral volume name.

Alert: failed-login-attempts

Issued when multiple login attempts has failed. It may be an indication of a security breach.

Alert-specific data:

NameTypeDescription
usernamenameUsername subject to login attempts.
attemptsuint32Number of failed consecutive attempts at the time of the alert.
peer-ipip-addressPeer IP of last failed login attempt.
x-forwarded-forarray of ip-addressX-Forwarded-For IP.

Alert: invalid-auto-cert-configuration

An auto-cert has been configured with a TTL that exceeds either the renew-threshold or the activate-threshold of the CA certificate.

Alert-specific data:

NameTypeDescription
secret-namenameName of secret with invalid auto-cert configuration.
ca-namenameName of CA used to generate certificate.

Alert: os-upgrade-failed

Alert signaling that an OS upgrade has failed on a host.

Alert-specific data:

NameTypeDescription
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: persistent-volume-threshold-reached

Alert signaling that a persistent volume reached an alerting threshold.

Alert-specific data:

NameTypeDescription
hostnamestringThe hostname of the host where the alert arose.
applicationnameThe name of the application.
service-instancenameThe name of the service instance.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
volume-namestringThe persistent volume name.

Alert: suspected-security-breach

Issued when use of honeypot token detected.

Alert-specific data:

NameTypeDescription
idnameId of the token that triggered security breach.
hostnamedomain-nameThe hostname of the host where the alert arose.
metastring
peer-ipip-addressPeer IP of this attempt.
x-forwarded-forarray of ip-addressX-Forwarded-For IP.

Alert: unwrap-failure

Issued when multiple unwraps of a secret is attempted. It may be an indication of a security breach.

Alert-specific data:

NameTypeDescription
idnameId of the token subject to multiple unwraps.
metastring
peer-ipip-addressPeer IP of this unwrap attempt.
successful-peer-ipip-addressPeer IP of successful unwrap (first attempted unwrap)
successful-timedate-and-timeTime of successful unwrap.

Alert: volga-topic-integrity

Issued when a volga topic contains signed messages that cannot be verified, or messages appear out of order. Suggests tampering or file system corruption.

Alert-specific data:

NameTypeDescription
topicstringThe name of the affected topic.
fileuint64The affected topic file number.
positionuint64The file offset of the bad message.
expected-sequence-numberuint64For an out of order message, the expected sequence number.
actual-sequence-numberuint64For an out of order message, the actual sequence number.

Site provider specific events

An item on a topic readable by site providers is an event record. It's a JSON object described in the Events section.

Topic: system:all-scheduler-events

Available on every site.

This topic contains the union of all scheduler related events on the local site for all tenants and are readable only by site providers.

The events are the same as for the topic system:scheduler-events.

Site provider specific audit trail log

Topic: system:unauthenticated-audit-trail-log

Available only on the Control Tower.

This log is available to the top site provider, and contains all unauthenticated requests to the system. The format is the same as for the system:audit-trail-log.

Site provider specific host metrics

This topic contains host-related metrics.

Topic: system:host-metrics

Available on every site, but only readable by sys tenant on the Control Tower.

Object

The system samples host metrics related to cpu, memory and disk once every 30 seconds.

NameTypeDescription
timedate-and-timeThe time the sample was taken.
sitenameThe name of the site where the event was generated.
cluster-hostnamenameThe cluster hostname of the host where the event was generated.
hostnamedomain-nameThe hostname of the host where the event was generated.
cpuObject
see cpu-params
memoryObject
see mem-params
loadavgObject
see loadavg-params
diskarray of Object
see disk-entry
A set of disk usage parameters for file systems used by either the
Edge Enforcer (supd) or by any application managed by the Edge
Enforcer. All file systems listed are from inside the Edge Enforcer
container.

Metrics with mount points reported as CONTAINER-ROOT indicates that
it reports the Edge Enforcer containers root file system usage, not the
hosts root file system.
disk-ioarray of Object
see disk-io-entry
A set of disk io metrics for disk partitions used by either the
Edge Enforcer (supd) or by any application managed by the Edge
Enforcer.
cpusarray of Object
see mpstat-cpu-entry
CPU related statistics. Entries can be either aggregated for
all CPUs or per CPU.
thermal-zonesarray of Object
see thermal-zone-entry
Temperature related statistics per thermal zone.

The cpu-params Object

CPU metrics taken from /proc/cpuinfo.

NameTypeDescription
vcpusuint32Total number of CPUs.

The mem-params Object

Memory metrics taken from /proc/meminfo.

NameTypeDescription
totaluint64Total usable RAM memory in bytes.
freeuint64Free usable RAM memory in bytes.
availableuint64An estimate of how much RAM memory in bytes is available for
starting new applications, without swapping.
useduint64Used RAM memory in bytes, calculated as:
(MemTotal + SwapTotal) -
(MemFree + SwapFree) -
Buffers -
(Cached + SReclaimable).

The loadavg-params Object

Average load metrics taken from /proc/loadavg.

Metrics avg1, avg5 and avg15 reflects the avarage load of processes:

  • queued for execution
  • executed
  • sleeping while being uninterruptible, typically waiting for I/O over time periods of 1, 5 and 15 minutes respectively.
NameTypeDescription
avg1decimal64Avarage load of processes last minute.
avg5decimal64Avarage load of processes last 5 minutes.
avg15decimal64Avarage load of processes last 15 minutes.
runninggauge64Number of currently runnable kernel scheduling entities (processes,
threads).
totalgauge64Number of kernel scheduling entities (processes, threads) that
currently exist on the system.

The disk-entry Object

NameTypeDescription
filesystemstringThe source of the mount point, usually a device.
typestringFile system type.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
mountstringThe mount point.

The disk-io-entry Object

NameTypeDescription
device-namestringThe device name.
reads-completedgauge64The total number of reads completed successfully.
sectors-readgauge64The total number of sectors read successfully.
time-spent-readinggauge64This is the total number of milliseconds spent by all reads.
writes-completedgauge64The total number of writes completed successfully.
sectors-writtengauge64The total number of sectors written successfully.
time-spent-writinggauge64This is the total number of milliseconds spent by all writes.
ios-in-progressgauge64This is the number of I/Os currently in progress.
time-spent-on-iogauge64This is the total number of milliseconds spent doing I/Os, i.e. when
time when ios-in-progress is non-zero

The mpstat-cpu-entry Object

NameTypeDescription
cpustringCPU number or all.
usrdecimal64Percentage of CPU utilization that occurred while executing at the
user level (application).
nicedecimal64Percentage of CPU utilization that occurred while executing at the
user level with nice priority.
sysdecimal64Percentage of CPU utilization that occurred while executing at the
system level (kernel). Note that this does not include time spent
servicing hardware and software interrupts.
iowaitdecimal64Percentage of time that the CPU or CPUs were idle during which the
system had an outstanding disk I/O request.
irqdecimal64Percentage of time spent by the CPU or CPUs to service hardware
interrupts.
softdecimal64Percentage of time spent by the CPU or CPUs to service software
interrupts.
stealdecimal64Percentage of time spent in involuntary wait by the virtual CPU or
CPUs while the hypervisor was servicing another virtual processor.
guestdecimal64Percentage of time spent by the CPU or CPUs to run a virtual
processor.
idledecimal64Percentage of time that the CPU or CPUs were idle and the system did
not have an outstanding disk I/O request.

The thermal-zone-entry Object

NameTypeDescription
iduint32Numeric id of the thermal zone based on number in path
/sys/class/thermal_zone*.
typestringType of the thermal zone.
tempint32Temperature of the thermal zone in millidegree Celsius.

Site provider specific supd metrics

This topic contains supd-related metrics.

Topic: system:supd-metrics

Available on every site, but only readable by sys tenant on the Control Tower.

Object

The system samples metrics on supd container to cpu and memory once every 30 seconds.

NameTypeDescription
timedate-and-timeThe time the sample was taken.
sitenameThe name of the site where the event was generated.
cluster-hostnamenameThe cluster hostname of the host where the event was generated.
hostnamedomain-nameThe hostname of the host where the event was generated.
cpuObject
see supd-cpu-metric
CPU metric data.
memoryObject
see supd-memory-metric
Memory metric data.
long-gcsarray of Object
see gc-metric
Long GC times.
large-heapsarray of Object
see gc-metric
Large heap sizes when GC.
long-schedulesarray of Object
see schedule-metric
Long schedule times.
long-msg-queuesarray of Object
see msg-queue-metric
Long msg queues.
busy-portsarray of Object
see busy-port
Busy ports.
busy-dist-portsarray of Object
see busy-port
Busy distribution ports.

The supd-cpu-metric Object

NameTypeDescription
nanosecondsuint64The total CPU usage of all tasks in the SUPD container. The value is
measured in nano seconds.
cpusdecimal64The CPUs limit for the SUPD container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of CPU used by the SUPD container in relation to limits.

The supd-memory-metric Object

NameTypeDescription
useduint64The memory used by the SUPD container in bytes, calculated at the time the
sample was taken.
totaluint64The total memory available in bytes for the SUPD container.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of memory used by the SUPD container divided by total.
used-hotuint64The hot memory used by the SUPD container in bytes, calculated at
the time the sample was taken. Hot memory is same as used,
but excluding inactive file buffers (inactive_file) and
inactive anonymous mappings (inactive_anon)
percentage-used-hotpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of hot memory used by the SUPD container divided by total

The gc-metric Object

NameTypeDescription
pidstringThe suspended erlang process identifier.
registered-namestringThe suspended erlang process's registered name.
millisecondsuint64The GC time in milliseconds.
heap-sizeuint64The size of the used part of the heap.
heap-block-sizeuint64The size of the memory block used for storing the heap and the stack.
old-heap-sizeuint64The size of the used part of the old heap.
old-heap-block-sizeuint64The size of the memory block used for storing the old heap.
stack-sizeuint64The size of the stack.
mbuf-sizeuint64The combined size of message buffers associated with the process.

The schedule-metric Object

NameTypeDescription
idstringThe erlang process or port identifier.
registered-namestringThe erlang process's registered name.
millisecondsuint64The schedule time in milliseconds.

The busy-port Object

NameTypeDescription
pidstringThe suspended erlang process identifier.
registered-namestringThe suspended erlang process's registered name.
portstringThe busy port identifier.

The msg-queue-metric Object

NameTypeDescription
pidstringThe erlang process identifier.
registered-namestringThe erlang process's registered name.

Site provider specific alerts

An alert item for a site provider specific topic is an alert record, which are JSON objects described in the Alerts section.

Topic: system:site-alerts

Available on every site.

This topic contains important site-related alerts from the system.

Alert: disk-threshold-reached

Alert signaling that a disk reached an alerting threshold.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.
filesystemstringThe source of the mount point, usually a device.
typestringFile system type.
sizeuint64Total number of 1K-blocks where K is 1024 bytes.
usedgauge64Number of used 1K-blocks where K is 1024 bytes.
freegauge64Number of available (free) 1K-blocks where K is 1024 bytes.
percentage-usedpercentA percent value with up to 2 fractional digits. For example 12.77%.

Percentage of used divided by size.
mountstringThe mount point.

Alert: host-certificate-expires

Alert signaling that a host certificates are about to expire, probably due to it being off-line.

Alert-specific data:

NameTypeDescription
cluster-hostnamedomain-nameThe cluster hostname of the host where the certificate is
about to expire.
hostnamedomain-nameThe hostname of the host where the certificate is
about to expire.
sitenameThe name of the site where the host is located.
expiresstringDate and time when the certificates expires.

Alert: host-down

Alert signaling that a host is down.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: host-in-disaster-mode

Alert signaling that a host ended up in disaster mode.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: host-in-distress

Alert signaling that a host ended up in distress.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.
distressenumeration
  • none
  • down
  • disaster-mode
  • no-space-on-disk
  • disk-threshold-reached
  • supd-down
  • system-certificates-expired
Indicating the level of distress in the host.

  • none
    No indication of distress in the host.
  • down
    Indicating that the host is down.
  • disaster-mode
    Indicating that the host is in disaster mode.
  • no-space-on-disk
    Indicating that a disk has no more space.
  • disk-threshold-reached
    Indicating that a disk has reached a threshold value.
  • supd-down
    Indicating that the supd is down.
  • system-certificates-expired
    Indicating that the system certificates have expired.
distress-infostringDetail information of the distress.

Alert: no-space-left-on-disk

Alert signaling that a host has no space left on a critical disk.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.

Alert: reauthorize-requested

Alert signaling that a host is attempting to obtain a new set of certificates. It may indicate that the site has been off-line long enough for the old certificates to expire.

Alert-specific data:

NameTypeDescription
cluster-hostnamedomain-nameThe cluster hostname of the host where the certificate is
about to expire.
sitenameThe name of the site where the host is located.
peer-ipip-addressPeer IP of recovery request.
permittedbooleanWas the operation permitted, ie, has the reauthorize-host
action been invoked for the host on this site, and this is
the first attempt at trying to recover certificates after that.
Note that the operation is only allowed to be performed once
after calling reauthorize-host.
successfulbooleanIndicates if new certificates were successfully issued. This
value can only be true if permitted is also true, but may
be false if the request fails for some reason (eg, root
certificates)

Alert: site-disconnected

This alert is generated at the Control Tower when the connection to an edge site is lost if when-disconnected is set to treat-as-error.

Alert-specific data:

NameTypeDescription
sitename

Alert: supd-down

Alert signaling that supd is down.

Alert-specific data:

NameTypeDescription
cluster-hostnamenameThe cluster hostname of the host where the alert arose.
hostnamedomain-nameThe hostname of the host where the alert arose.