Volga Topics
The system offers a set of built-in volga topics that applications can consume.
Events
An item on an event topic is an event record. It's a JSON object with the following structure:
Definition of events sent on volga streams to tenants.
Name | Type | Description |
---|---|---|
event | string | The name of the specific event. |
occurred-at | date-and-time | Time when the event occurred. |
id | string | Unique identifier of the event instance. |
tenant | string | The event occurred in the context of this tenant. |
data | Object | Event-specific data. |
Alerts
An item on an alert topic is an alert record. It's a JSON object with the following structure:
Definition of alerts sent on volga streams to tenants.
Name | Type | Description |
---|---|---|
alert | string | The name of the specific alert. |
time | date-and-time | The time the alert was generated. |
id | string | The site unique id of the alert. |
site | name | The name of the site where the alert was generated. |
kind | enumeration
| Type of alert, infrastructure, application, or security.
|
severity | enumeration
| The severity of the alert.
|
description | string | The description of the reason behind the alert. |
expiry-time | date-and-time | After this date and time this alert should be considered expired. |
cleared | boolean | Is true if the alert has been cleared, otherwise false. |
data | Object | Alert-specific data. |
Threshold alerts
For threshold alerts like: container-layer-threshold-reached
, ephemeral-volume-threshold-reached
, persistent-volume-threshold-reached
and disk-threshold-reached
the levels are:
- CRITICAL: 100%
- MAJOR: 90%
- WARNING: 80%
Tenant specific events
An item on a tenant specific events topic is an event record. It's a JSON object described in the Events section.
Topic: system:scheduler-events
Available on every site.
This topic contains scheduler related events on the local site.
Event: application-status-changed
This event is generated when the oper-status
of an application
changes. The oper-status
of an application is an aggregated
status of all service instances in the application.
In some cases, this event may be generated even if the oper-status
hasn't been changed. Clients need to be prepared for that.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
oper-status | enumeration
| The oper-status of the application.
|
application-version | string | The version of the application. |
application-deployment | string | The name of the application deployment to which the application belongs. |
site | name | The name of the site where the application is running. |
Event: container-completed
This event is generated when an init container has completed successfully.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
Event: container-exited
This event is generated when the system detects that a VM container has exited with no errors. Such container is restarted directly.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
Event: container-failed
This event is generated when the system detects that a container has failed; either because it exited, or because its startup or liveness probe failed.
A VM container that exited with status 0 does not generate this event,
instead it generates a container-exited
event.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
reason | enumeration
|
|
restart-after | duration | A duration in years, days, hours, minutes and seconds. Format is [<digits>y][<digits>d][<digits>m][<digits>s] .Examples: 1y2d5h , 5h or 10m30s Time until the container is restarted. |
Event: container-ready
This event is generated when a container's readiness probe is successful. If no readiness probe is configured, this event is generated as soon as the container has started.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
Event: container-starting
This event is generated when the container has been created, and just before it is started.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
Event: container-stopped
This event is generated when the system has stopped the container for any reason.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
Event: container-unready
This event is generated when a container's readiness probe fails.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | name | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
container-name | name | The name of the container. |
container-id | string | The internal id of the container. |
container-image | name | The name of the image that the container is running. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
Event: service-instance-creation-failed
This event is generated when no service instances could be created for an application.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
site | name | The name of the site where the event was generated. |
error-message | string | Additional information about the failure to create service instances. |
Event: service-instance-failed
This event is generated when some container in a service failed, or when the service instance could not be scheduled or started.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
application-ips | array of ip-address | The IP addresses assigned to this service instance on the application network. |
ingress-ips | array of ip-address | Ingress IP addresses assigned to this service instance. |
container-name | name | The name of the failing container. |
reason | enumeration
|
|
Event: service-instance-ready
This event is generated when all containers in a service are ready, i.e., their readiness probes succeeded.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
application-ips | array of ip-address | The IP addresses assigned to this service instance on the application network. |
ingress-ips | array of ip-address | Ingress IP addresses assigned to this service instance. |
Event: service-instance-starting
This event is generated just before the containers in a service instance are started.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
application-ips | array of ip-address | The IP addresses assigned to this service instance on the application network. |
ingress-ips | array of ip-address | Ingress IP addresses assigned to this service instance. |
Event: service-instance-stopped
This event is generated when all containers in a service instance have been stopped.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
application-ips | array of ip-address | The IP addresses assigned to this service instance on the application network. |
ingress-ips | array of ip-address | Ingress IP addresses assigned to this service instance. |
reason | enumeration
|
|
Event: service-instance-unready
This event is generated when one or more containers in a service are unready; i.e., their readiness probes failed.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
application-ips | array of ip-address | The IP addresses assigned to this service instance on the application network. |
ingress-ips | array of ip-address | Ingress IP addresses assigned to this service instance. |
Event: service-instance-updated
This event is generated when a service instance's specification has been updated.
Event-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. |
service-name | name | The name of the service. |
service-instance | name | The name of the service instance. |
site | name | The name of the site where the event was generated. |
hostname | name | The name of the host where the event was generated. |
application-ips | array of ip-address | The IP addresses assigned to this service instance on the application network. |
ingress-ips | array of ip-address | Ingress IP addresses assigned to this service instance. |
Topic: system:deployment-events
Available only on the Control Tower.
This topic contains application deployment related events.
Event: application-summary-status-changed
Sent when a site has reported that an application deployed to the
site has changed its oper-status
.
Event-specific data:
Name | Type | Description |
---|---|---|
application-deployment | name | The name of the application deployment. |
application | name | The name of the application. |
application-version | string | The version of the application. |
is-canary | empty | Present if the event relates to deployment to a canary site. |
sites | array of name | The affected sites. |
oper-status | enumeration
| This is an aggregated status of the application.
|
error-message | string | Gives additional information when oper-status is error . |
Event: deployment-failed
Sent when a set of sites have reported that the application has failed to be deployed at the sites.
These sites are moved to the deploy-failed
list in the
application deployment's state.
Event-specific data:
Name | Type | Description |
---|---|---|
application-deployment | name | The name of the application deployment. |
application | name | The name of the application. |
application-version | string | The version of the application. |
is-canary | empty | Present if the event relates to deployment to a canary site. |
sites | array of name | The affected sites. |
error-message | string | Error message describing the failure. |
Event: deployment-initiated
Sent when a set of sites have been instructed to deploy a specific version of an application.
These sites are moved to the deploying-to
list in the
application deployment's state.
Event-specific data:
Name | Type | Description |
---|---|---|
application-deployment | name | The name of the application deployment. |
application | name | The name of the application. |
application-version | string | The version of the application. |
is-canary | empty | Present if the event relates to deployment to a canary site. |
sites | array of name | The affected sites. |
Event: deployment-sites-removed
Sent when an application no longer is deployed to set of sites.
When this event is sent, the sites have been instructed to remove the application.
Event-specific data:
Name | Type | Description |
---|---|---|
application-deployment | name | The name of the application deployment. |
application | name | The name of the application. |
application-version | string | The version of the application. |
is-canary | empty | Present if the event relates to deployment to a canary site. |
sites | array of name | The affected sites. |
Event: deployment-succeeded
Sent when a set of sites have reported that the application has
been successfully deployed at the sites. Successfully deployed
means that all service instances are in state running
.
These sites are moved to the deployed-to
list in the
application deployment's state.
Event-specific data:
Name | Type | Description |
---|---|---|
application-deployment | name | The name of the application deployment. |
application | name | The name of the application. |
application-version | string | The version of the application. |
is-canary | empty | Present if the event relates to deployment to a canary site. |
sites | array of name | The affected sites. |
Topic: system:config-events
Available only on the Control Tower.
This topic contains configuration changes events.
Event: config-created
Sent when a new config object was created.
Event-specific data:
Name | Type | Description |
---|---|---|
username | string | Identifies the user that initiated the config change. |
object | name | Kind of config object. |
path | string | Identifier of uniqe object. |
Event: config-deleted
Sent when a config object was deleted.
Event-specific data:
Name | Type | Description |
---|---|---|
username | string | Identifies the user that initiated the config change. |
object | name | Kind of config object. |
path | string | Identifier of uniqe object. |
Event: config-updated
Sent when a config object was updated.
Event-specific data:
Name | Type | Description |
---|---|---|
username | string | Identifies the user that initiated the config change. |
object | name | Kind of config object. |
path | string | Identifier of uniqe object. |
Topic: system:events
Available only on the Control Tower.
This topic contains important events from the system.
Event: site-connected
This event is generated at the Control Tower when an edge site connects.
Event-specific data:
Name | Type | Description |
---|---|---|
site | name |
Event: site-disconnected
This event is generated at the Control Tower when the connection to an edge site is lost.
Event-specific data:
Name | Type | Description |
---|---|---|
site | name |
Event: supd-download
This event is generated at the Control Tower when supd is being downloaded from a site.
Event-specific data:
Name | Type | Description |
---|---|---|
environment | string | The name of the Control Tower environment. |
site | string | The name of the site that downloaded supd. |
tenant | string | The name of the tenant that downloaded supd. |
image | string | The full name of the supd image that the site downloaded. |
peer | ip-address | The IP address of the site that downloaded supd. Note that when a load-balancer is used in front of the control tower, this will be the address of the load-balancer instead. |
forwarded | array of ip-address | When a load-balancer is used in front of the control tower the true address of the site will be found here. |
Event: supd-upgrade-announcement
This event is generated at the Control Tower when supd is about to be upgraded.
Event-specific data:
Name | Type | Description |
---|---|---|
current-version | version | |
new-version | version | |
wait-time | duration | A duration in years, days, hours, minutes and seconds. Format is [<digits>y][<digits>d][<digits>m][<digits>s] .Examples: 1y2d5h , 5h or 10m30s |
Event: supd-upgraded
This event is generated at the Control Tower when supd has been successfully upgraded.
Event-specific data:
Name | Type | Description |
---|---|---|
version | version |
Event: ui-upgrade-announcement
This event is generated at the Control Tower when the UI is about to be upgraded.
Event-specific data:
Name | Type | Description |
---|---|---|
current-version | version | |
new-version | version | |
wait-time | duration | A duration in years, days, hours, minutes and seconds. Format is [<digits>y][<digits>d][<digits>m][<digits>s] .Examples: 1y2d5h , 5h or 10m30s |
Event: ui-upgrade-failed
This event is generated at the Control Tower when the UI has failed to be upgraded.
Event-specific data:
Name | Type | Description |
---|---|---|
version | version | |
error-message | string |
Event: ui-upgraded
This event is generated at the Control Tower when the UI has been successfully upgraded.
Event-specific data:
Name | Type | Description |
---|---|---|
version | version |
Tenant specific application metrics
Topic: system:application-metrics
Available on every site except on the Control Tower.
This topic contains application related metrics for the tenant.
Object
The system samples metrics related to cpu, memory and network traffic once every 10 seconds. Every 30 seconds, related samples are collected into this object.
Metrics are sampled on three different levels, per container, per service instance or for the whole application on a host.
Name | Type | Description |
---|---|---|
site | name | The name of the site where the metrics originated from. |
tenant | name | The name of the tenant. |
entries | array of Object see metrics-entry | A set of samples for this tenant. |
The metrics-entry Object
Name | Type | Description |
---|---|---|
time | date-and-time | The time the sample was taken. |
host | string | The host where the sample was taken. |
application | name | The name of the application that was sampled. |
per-container ORper-service ORper-application | Object see per-container-metrics ORObject see per-service-metrics ORObject see per-application-metrics |
The per-container-metrics Object
Name | Type | Description |
---|---|---|
service-instance | name | The name of the service instance. |
container | name | The name of the container. |
memory | Object see memory-metric | Memory metric data. |
cpu | Object see cpu-metric | CPU metric data. |
container-layer | Object see disk-metric | Container layer storage metric data. This section of metrics is only available if the underlying file system has support for quota . |
The memory-metric Object
Name | Type | Description |
---|---|---|
used | uint64 | The memory used by the container in bytes, calculated at the time the sample was taken. |
total | uint64 | The total memory available in bytes for the container. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of memory used by the container divided by total |
used-hot | uint64 | The hot memory used by the container in bytes, calculated at the time the sample was taken. Hot memory is same as used ,but excluding inactive file buffers (inactive_file) and inactive anonymous mappings (inactive_anon) |
percentage-used-hot | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of hot memory used by the container divided by total |
The cpu-metric Object
Name | Type | Description |
---|---|---|
nanoseconds | uint64 | The total number of CPU nanoseconds used by the container. |
cpus | decimal64 | The CPUs limit for the container. I.e. the maximum number of CPUs used by the container. |
shares | uint16 | The CPU shares limit for the container. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of CPU used in relation to limits. |
The disk-metric Object
Name | Type | Description |
---|---|---|
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
The per-service-metrics Object
Name | Type | Description |
---|---|---|
service-instance | name | The name of the service instance. |
ephemeral-volumes | array of Object see disk-volume-metric | Ephemeral volume storage parameters. Ephemeral volume parameters are only available for volumes where the the underlying file system has support for quota . |
persistent-volumes | array of Object see disk-volume-metric | Persistent volume storage parameters. Persistent volume parameters are only available for volumes where the the underlying file system has support for quota . |
The disk-volume-metric Object
Name | Type | Description |
---|---|---|
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
volume-name | string |
The per-application-metrics Object
Per-application metrics, aggregated for all service instances that are part of the same application.
Name | Type | Description |
---|---|---|
gateway-network | Object see gateway-network-metrics | |
hosts | array of Object see host-metric | Resource metrics aggregated per application per host. |
The gateway-network-metrics Object
Name | Type | Description |
---|---|---|
tx-packets | uint64 | Transmitted external traffic on the gateway network, i.e., traffic leaving the host. Note that transmitted traffic is counted before bandwidth restrictions are applied. This is usually not a problem for protocols with flow-control, but in case of an application sending uncontrollably large amount of traffic with the destination outside the host the traffic is included here. |
tx-bytes | uint64 | Transmitted external traffic on the gateway network, i.e., traffic leaving the host. The value represents frame payload, Layer 2 header not included. Note that transmitted traffic is counted before bandwidth restrictions are applied. This is usually not a problem for protocols with flow-control, but in case of an application sending uncontrollably large amount of traffic with the destination outside the host the traffic is included here. |
rx-packets | uint64 | Received external traffic on the gateway network, i.e., traffic originated outside the host. Only the traffic allowed by the firewall is included, i.e. traffic on the open ingress ports and traffic on established connections. Note that received traffic is counted before bandwidth restrictions are applied. This is usually not a problem for protocols with flow-control, but in case of an external party sending uncontrollably large amount of traffic to the application the traffic is included here. |
rx-bytes | uint64 | Received external traffic on the gateway network, i.e., traffic originated outside the host. The value represents frame payload, Layer 2 header not included. Only the traffic allowed by the firewall is included, i.e. traffic on the open ingress ports and traffic on established connections. Note that received traffic is counted before bandwidth restrictions are applied. This is usually not a problem for protocols with flow-control, but in case of an external party sending uncontrollably large amount of traffic to the application the traffic is included here. |
tx-packets-per-second | uint64 | Intensity of transmitted external traffic in packets per second. This value is the average intensity over the interval between the last two reported samples. Hence it is not reported when there is not enough recent samples available. |
tx-bytes-per-second | uint64 | Intensity of transmitted external traffic in bytes per second. This value is the average intensity over the interval between the last two reported samples. Hence it is not reported when there is not enough recent samples available. |
rx-packets-per-second | uint64 | Intensity of received external traffic in packets per second. This value is the average intensity over the interval between the last two reported samples. Hence it is not reported when there is not enough recent samples available. |
rx-bytes-per-second | uint64 | Intensity of received external traffic in bytes per second. This value is the average intensity over the interval between the last two reported samples. Hence it is not reported when there is not enough recent samples available. |
upstream-bandwidth-utilization | decimal64 | If an upstream-bandwidth-per-host is configured for this application,then this value indicates the fraction of the available bandwidth used by the application. This value is based on the tx-bytes-per-second metric. |
downstream-bandwidth-utilization | decimal64 | If an downstream-bandwidth-per-host is configured for thisapplication, then this value indicates the fraction of the available bandwidth used by the application. This value is based on the rx-bytes-per-second metric. |
The host-metric Object
Name | Type | Description |
---|---|---|
host | string | The application metrics are from this host. |
memory-percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of aggregated memory used by the application on this host in relation to total available memory. |
cpu-percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of maximum CPU used among all containers for the application on this host in relation to total available CPUs. |
disk-percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of maximum disk used among all container layers, ephemeral and persistent volumes with quota support for the application onthis host. This metric is not available if no filesystem, for this application, does support quota . |
Tenant specific audit trail log
The audit trail log records all operations performed by a tenant. The log includes the access token, if provided, the operation, and any parameters provided.
In order to protect sensitive data, e.g., tokens and secrets, all
such data is hashed using a tenant-specific HMAC before being
logged. To search for some specific sensitive data in the logs, e.g.,
operations performed using a specific access token, the plain text
version of the data can be hashed using the strongbox/audit/hmac
operation.
Audit trail logs are streamed upwards from edge sites to the Control Tower. This allows inspection of audit logs even if a site is compromised.
Messages in the audit trail log are cryptographically signed when stored to disk in order to guarantee their integrity. The signature is always verified when a message is retrieved from disk.
Topic: system:audit-trail-log
Available on every site.
This log contains all authenticated requests for the tenant.
Object
Name | Type | Description |
---|---|---|
occurred-at | date-and-time | Time when access occurred. |
request-time-ms | gauge64 | Time to process request in milliseconds. |
user | string | User or Approle that performed the access. |
path | string | Path that was accessed |
query | string | Any supplied URL query parameters. |
method | string | HTTP method that was used. |
status | uint32 | HTTP response status. |
status-info | string | Text representation of status. |
request-parameters | Object | The parameters included in the request, if any, ie, request body. |
client-ip | string | Address from which the client accessed the host. |
x-forwarded-for | string | If the request went through a load balance the clients real ip may appear as x-forwarded-for. It may be a list of addresses where the first address should be the clients original address. |
x-real-ip | string | If the request went through a load balance the clients real ip may appear as x-real-ip. It may be a list of addresses where the first address should be the clients original address. |
site | string | Site where the access occurred. |
host | string | Host on which the access occurred. |
tenant | name | Tenant that performed the access. |
token | string | Hashed representation of the access token. It can be used when identifying a specific session, and all accesses/operations using the same token. |
accessor | string | Token accessor that can be used to terminate an ongoing session. |
from | string | The 'from' request header from the originating HTTP request. |
user-agent | string | User-agent that performed request, eg, curl, firefox etc. |
Tenant specific logs
An item on a log topic is a string.
Topic: system:container-logs:CONTAINER-ID
Available on the host where the container is running.
There is one topic per running container. It contains all output from standard output and standard error from the container.
The CONTAINER-ID
is a string on the format
APPLICATIONNAME.SERVICENAME-IX.CONTAINERNAME
, where IX is a numeric
index, one per service instance (replica).
The default behavior, if a container is rescheduled to another host, is to
delete the container log topic for the previous container. However, if
container-log-archive
is set to true
in the application specification, the
container log will instead be appended with a timestamp and become read only
in this case.
An archived container log will continue to exist on the original host for
container-log-archive-days
.
Topic: system:logs
Available on every site.
This topic contains tenant-specific log info generated by the avassa system. Each log item is on the format
<LEVEL> (TENANT) DATE TIME HOSTNAME SRCFILE PID
Where LEVEL
is one of EMERGENCY
, ALERT
, CRITICAL
, ERROR
,
WARNING
, NOTICE
, INFO
, DEBUG
.
Tenant specific alerts
An alert item for a tenant specific topic is an alert record, which are JSON objects described in the Alerts section.
Topic: system:alerts
Available on every site.
This topic contains important alerts from the system.
Alert: application-error
Alert signaling when an application ends up in an erroneous
oper-status
.
Alert-specific data:
Name | Type | Description |
---|---|---|
application | name | The name of the application. |
application-version | string | The version of the application. Note that this is the application version triggering the change of the application oper-state . |
application-deployment | string | The name of the application deployment to which the application belongs. |
error-message | string | Additional information about the application failure. |
Alert: bound-cidrs-violation
Issued when an attempt is made to use a token from an IP address not in the bound-cidrs range.
Alert-specific data:
Name | Type | Description |
---|---|---|
id | name | Id of the token that triggered security breach. |
hostname | domain-name | The hostname of the host where the alert arose. |
meta | string | |
peer-ip | ip-address | Peer IP of this attempt. |
x-forwarded-for | array of ip-address | X-Forwarded-For IP. |
Alert: container-layer-threshold-reached
Alert signaling that a container-layer reached an alerting threshold.
Alert-specific data:
Name | Type | Description |
---|---|---|
hostname | name | The hostname of the host where the alert arose. |
application | name | The name of the application. |
service-instance | name | The name of the service instance. |
container | name | The name of the container. |
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
Alert: custom-alert
A custom alert issued and controlled by the tenant.
Alert-specific data:
Name | Type | Description |
---|---|---|
custom-id | string | The custom identifier of the alert. This could be a unique identifier of the entity effected by the fault causing the alert. As an example it could be a unique identifier of the failing service instance (replica) of an application on a site. |
custom-name | string | The custom name of the specific alert. This should be a unique name describing the alert. |
application | name | The name of the application. This attribute will only be available if the alert or clear operation has been executed from within the application container using an approle . |
service-instance | name | The name of the service instance id. This attribute will only be available if the alert or clear operation has been executed from within the application container using an approle . |
Alert: ephemeral-volume-threshold-reached
Alert signaling that a ephemeral volume reached an alerting threshold.
Alert-specific data:
Name | Type | Description |
---|---|---|
hostname | name | The hostname of the host where the alert arose. |
application | name | The name of the application. |
service-instance | name | The name of the service instance. |
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
volume-name | string | The ephemeral volume name. |
Alert: failed-login-attempts
Issued when multiple login attempts has failed. It may be an indication of a security breach.
Alert-specific data:
Name | Type | Description |
---|---|---|
username | name | Username subject to login attempts. |
attempts | uint32 | Number of failed consecutive attempts at the time of the alert. |
peer-ip | ip-address | Peer IP of last failed login attempt. |
x-forwarded-for | array of ip-address | X-Forwarded-For IP. |
Alert: invalid-auto-cert-configuration
An auto-cert has been configured with a TTL that exceeds either the renew-threshold or the activate-threshold of the CA certificate.
Alert-specific data:
Name | Type | Description |
---|---|---|
secret-name | name | Name of secret with invalid auto-cert configuration. |
ca-name | name | Name of CA used to generate certificate. |
Alert: os-upgrade-failed
Alert signaling that an OS upgrade has failed on a host.
Alert-specific data:
Name | Type | Description |
---|---|---|
hostname | domain-name | The hostname of the host where the alert arose. |
Alert: persistent-volume-threshold-reached
Alert signaling that a persistent volume reached an alerting threshold.
Alert-specific data:
Name | Type | Description |
---|---|---|
hostname | string | The hostname of the host where the alert arose. |
application | name | The name of the application. |
service-instance | name | The name of the service instance. |
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
volume-name | string | The persistent volume name. |
Alert: suspected-security-breach
Issued when use of honeypot token detected.
Alert-specific data:
Name | Type | Description |
---|---|---|
id | name | Id of the token that triggered security breach. |
hostname | domain-name | The hostname of the host where the alert arose. |
meta | string | |
peer-ip | ip-address | Peer IP of this attempt. |
x-forwarded-for | array of ip-address | X-Forwarded-For IP. |
Alert: unwrap-failure
Issued when multiple unwraps of a secret is attempted. It may be an indication of a security breach.
Alert-specific data:
Name | Type | Description |
---|---|---|
id | name | Id of the token subject to multiple unwraps. |
meta | string | |
peer-ip | ip-address | Peer IP of this unwrap attempt. |
successful-peer-ip | ip-address | Peer IP of successful unwrap (first attempted unwrap) |
successful-time | date-and-time | Time of successful unwrap. |
Alert: volga-topic-integrity
Issued when a volga topic contains signed messages that cannot be verified, or messages appear out of order. Suggests tampering or file system corruption.
Alert-specific data:
Name | Type | Description |
---|---|---|
topic | string | The name of the affected topic. |
file | uint64 | The affected topic file number. |
position | uint64 | The file offset of the bad message. |
expected-sequence-number | uint64 | For an out of order message, the expected sequence number. |
actual-sequence-number | uint64 | For an out of order message, the actual sequence number. |
Site provider specific events
An item on a topic readable by site providers is an event record. It's a JSON object described in the Events section.
Topic: system:all-scheduler-events
Available on every site.
This topic contains the union of all scheduler related events on the local site for all tenants and are readable only by site providers.
The events are the same as for the topic
system:scheduler-events
.
Site provider specific audit trail log
Topic: system:unauthenticated-audit-trail-log
Available only on the Control Tower.
This log is available to the top site provider, and contains all
unauthenticated requests to the system. The format is the same as for
the system:audit-trail-log
.
Site provider specific host metrics
This topic contains host-related metrics.
Topic: system:host-metrics
Available on every site, but only readable by sys
tenant on the
Control Tower.
Object
The system samples host metrics related to cpu, memory and disk once every 30 seconds.
Name | Type | Description |
---|---|---|
time | date-and-time | The time the sample was taken. |
site | name | The name of the site where the event was generated. |
cluster-hostname | name | The cluster hostname of the host where the event was generated. |
hostname | domain-name | The hostname of the host where the event was generated. |
cpu | Object see cpu-params | |
memory | Object see mem-params | |
loadavg | Object see loadavg-params | |
disk | array of Object see disk-entry | A set of disk usage parameters for file systems used by either the Edge Enforcer (supd) or by any application managed by the Edge Enforcer. All file systems listed are from inside the Edge Enforcer container. Metrics with mount points reported as CONTAINER-ROOT indicates thatit reports the Edge Enforcer containers root file system usage, not the hosts root file system. |
disk-io | array of Object see disk-io-entry | A set of disk io metrics for disk partitions used by either the Edge Enforcer (supd) or by any application managed by the Edge Enforcer. |
cpus | array of Object see mpstat-cpu-entry | CPU related statistics. Entries can be either aggregated for all CPUs or per CPU. |
thermal-zones | array of Object see thermal-zone-entry | Temperature related statistics per thermal zone. |
The cpu-params Object
CPU metrics taken from /proc/cpuinfo.
Name | Type | Description |
---|---|---|
vcpus | uint32 | Total number of CPUs. |
The mem-params Object
Memory metrics taken from /proc/meminfo.
Name | Type | Description |
---|---|---|
total | uint64 | Total usable RAM memory in bytes. |
free | uint64 | Free usable RAM memory in bytes. |
available | uint64 | An estimate of how much RAM memory in bytes is available for starting new applications, without swapping. |
used | uint64 | Used RAM memory in bytes, calculated as: (MemTotal + SwapTotal) - (MemFree + SwapFree) - Buffers - (Cached + SReclaimable). |
The loadavg-params Object
Average load metrics taken from /proc/loadavg.
Metrics avg1, avg5 and avg15 reflects the avarage load of processes:
- queued for execution
- executed
- sleeping while being uninterruptible, typically waiting for I/O over time periods of 1, 5 and 15 minutes respectively.
Name | Type | Description |
---|---|---|
avg1 | decimal64 | Avarage load of processes last minute. |
avg5 | decimal64 | Avarage load of processes last 5 minutes. |
avg15 | decimal64 | Avarage load of processes last 15 minutes. |
running | gauge64 | Number of currently runnable kernel scheduling entities (processes, threads). |
total | gauge64 | Number of kernel scheduling entities (processes, threads) that currently exist on the system. |
The disk-entry Object
Name | Type | Description |
---|---|---|
filesystem | string | The source of the mount point, usually a device. |
type | string | File system type. |
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
mount | string | The mount point. |
The disk-io-entry Object
Name | Type | Description |
---|---|---|
device-name | string | The device name. |
reads-completed | gauge64 | The total number of reads completed successfully. |
sectors-read | gauge64 | The total number of sectors read successfully. |
time-spent-reading | gauge64 | This is the total number of milliseconds spent by all reads. |
writes-completed | gauge64 | The total number of writes completed successfully. |
sectors-written | gauge64 | The total number of sectors written successfully. |
time-spent-writing | gauge64 | This is the total number of milliseconds spent by all writes. |
ios-in-progress | gauge64 | This is the number of I/Os currently in progress. |
time-spent-on-io | gauge64 | This is the total number of milliseconds spent doing I/Os, i.e. when time when ios-in-progress is non-zero |
The mpstat-cpu-entry Object
Name | Type | Description |
---|---|---|
cpu | string | CPU number or all . |
usr | decimal64 | Percentage of CPU utilization that occurred while executing at the user level (application). |
nice | decimal64 | Percentage of CPU utilization that occurred while executing at the user level with nice priority. |
sys | decimal64 | Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing hardware and software interrupts. |
iowait | decimal64 | Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. |
irq | decimal64 | Percentage of time spent by the CPU or CPUs to service hardware interrupts. |
soft | decimal64 | Percentage of time spent by the CPU or CPUs to service software interrupts. |
steal | decimal64 | Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor. |
guest | decimal64 | Percentage of time spent by the CPU or CPUs to run a virtual processor. |
idle | decimal64 | Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request. |
The thermal-zone-entry Object
Name | Type | Description |
---|---|---|
id | uint32 | Numeric id of the thermal zone based on number in path /sys/class/thermal_zone*. |
type | string | Type of the thermal zone. |
temp | int32 | Temperature of the thermal zone in millidegree Celsius. |
Site provider specific supd metrics
This topic contains supd-related metrics.
Topic: system:supd-metrics
Available on every site, but only readable by sys
tenant on the
Control Tower.
Object
The system samples metrics on supd container to cpu and memory once every 30 seconds.
Name | Type | Description |
---|---|---|
time | date-and-time | The time the sample was taken. |
site | name | The name of the site where the event was generated. |
cluster-hostname | name | The cluster hostname of the host where the event was generated. |
hostname | domain-name | The hostname of the host where the event was generated. |
cpu | Object see supd-cpu-metric | CPU metric data. |
memory | Object see supd-memory-metric | Memory metric data. |
long-gcs | array of Object see gc-metric | Long GC times. |
large-heaps | array of Object see gc-metric | Large heap sizes when GC. |
long-schedules | array of Object see schedule-metric | Long schedule times. |
long-msg-queues | array of Object see msg-queue-metric | Long msg queues. |
busy-ports | array of Object see busy-port | Busy ports. |
busy-dist-ports | array of Object see busy-port | Busy distribution ports. |
The supd-cpu-metric Object
Name | Type | Description |
---|---|---|
nanoseconds | uint64 | The total CPU usage of all tasks in the SUPD container. The value is measured in nano seconds. |
cpus | decimal64 | The CPUs limit for the SUPD container. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of CPU used by the SUPD container in relation to limits. |
The supd-memory-metric Object
Name | Type | Description |
---|---|---|
used | uint64 | The memory used by the SUPD container in bytes, calculated at the time the sample was taken. |
total | uint64 | The total memory available in bytes for the SUPD container. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of memory used by the SUPD container divided by total. |
used-hot | uint64 | The hot memory used by the SUPD container in bytes, calculated at the time the sample was taken. Hot memory is same as used ,but excluding inactive file buffers (inactive_file) and inactive anonymous mappings (inactive_anon) |
percentage-used-hot | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of hot memory used by the SUPD container divided by total |
The gc-metric Object
Name | Type | Description |
---|---|---|
pid | string | The suspended erlang process identifier. |
registered-name | string | The suspended erlang process's registered name. |
milliseconds | uint64 | The GC time in milliseconds. |
heap-size | uint64 | The size of the used part of the heap. |
heap-block-size | uint64 | The size of the memory block used for storing the heap and the stack. |
old-heap-size | uint64 | The size of the used part of the old heap. |
old-heap-block-size | uint64 | The size of the memory block used for storing the old heap. |
stack-size | uint64 | The size of the stack. |
mbuf-size | uint64 | The combined size of message buffers associated with the process. |
The schedule-metric Object
Name | Type | Description |
---|---|---|
id | string | The erlang process or port identifier. |
registered-name | string | The erlang process's registered name. |
milliseconds | uint64 | The schedule time in milliseconds. |
The busy-port Object
Name | Type | Description |
---|---|---|
pid | string | The suspended erlang process identifier. |
registered-name | string | The suspended erlang process's registered name. |
port | string | The busy port identifier. |
The msg-queue-metric Object
Name | Type | Description |
---|---|---|
pid | string | The erlang process identifier. |
registered-name | string | The erlang process's registered name. |
Site provider specific alerts
An alert item for a site provider specific topic is an alert record, which are JSON objects described in the Alerts section.
Topic: system:site-alerts
Available on every site.
This topic contains important site-related alerts from the system.
Alert: disk-threshold-reached
Alert signaling that a disk reached an alerting threshold.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | name | The cluster hostname of the host where the alert arose. |
hostname | domain-name | The hostname of the host where the alert arose. |
filesystem | string | The source of the mount point, usually a device. |
type | string | File system type. |
size | uint64 | Total number of 1K-blocks where K is 1024 bytes. |
used | gauge64 | Number of used 1K-blocks where K is 1024 bytes. |
free | gauge64 | Number of available (free) 1K-blocks where K is 1024 bytes. |
percentage-used | percent | A percent value with up to 2 fractional digits. For example 12.77% .Percentage of used divided by size. |
mount | string | The mount point. |
Alert: host-certificate-expires
Alert signaling that a host certificates are about to expire, probably due to it being off-line.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | domain-name | The cluster hostname of the host where the certificate is about to expire. |
hostname | domain-name | The hostname of the host where the certificate is about to expire. |
site | name | The name of the site where the host is located. |
expires | string | Date and time when the certificates expires. |
Alert: host-down
Alert signaling that a host is down.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | name | The cluster hostname of the host where the alert arose. |
hostname | domain-name | The hostname of the host where the alert arose. |
Alert: host-in-disaster-mode
Alert signaling that a host ended up in disaster mode.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | name | The cluster hostname of the host where the alert arose. |
hostname | domain-name | The hostname of the host where the alert arose. |
Alert: host-in-distress
Alert signaling that a host ended up in distress.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | name | The cluster hostname of the host where the alert arose. |
hostname | domain-name | The hostname of the host where the alert arose. |
distress | enumeration
| Indicating the level of distress in the host.
|
distress-info | string | Detail information of the distress. |
Alert: no-space-left-on-disk
Alert signaling that a host has no space left on a critical disk.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | name | The cluster hostname of the host where the alert arose. |
hostname | domain-name | The hostname of the host where the alert arose. |
Alert: reauthorize-requested
Alert signaling that a host is attempting to obtain a new set of certificates. It may indicate that the site has been off-line long enough for the old certificates to expire.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | domain-name | The cluster hostname of the host where the certificate is about to expire. |
site | name | The name of the site where the host is located. |
peer-ip | ip-address | Peer IP of recovery request. |
permitted | boolean | Was the operation permitted, ie, has the reauthorize-host action been invoked for the host on this site, and this is the first attempt at trying to recover certificates after that. Note that the operation is only allowed to be performed once after calling reauthorize-host . |
successful | boolean | Indicates if new certificates were successfully issued. This value can only be true if permitted is also true, but maybe false if the request fails for some reason (eg, root certificates) |
Alert: site-disconnected
This alert is generated at the Control Tower when the
connection to an edge site is lost if
when-disconnected
is set to treat-as-error
.
Alert-specific data:
Name | Type | Description |
---|---|---|
site | name |
Alert: supd-down
Alert signaling that supd is down.
Alert-specific data:
Name | Type | Description |
---|---|---|
cluster-hostname | name | The cluster hostname of the host where the alert arose. |
hostname | domain-name | The hostname of the host where the alert arose. |