Viewing KUMA metrics

To monitor the performance of its components, the event stream, and the correlation context, KUMA collects and stores a large number of parameters. The VictoriaMetrics time series database is used to collect, store and analyze the parameters. The collected metrics are visualized using Grafana. Dashboards that visualize key performance parameters of various KUMA components can be found in the KUMA → Metrics section.
The KUMA Core service configures VictoriaMetrics and Grafana automatically, no user action is required.

The collected metrics are visualized using the Grafana solution. The RPM package of the 'kuma-core' service generates the Grafana configuration and creates a separate dashboard for visualizing the metrics of each service. Graphs in the Metrics section appear with a delay of approximately 1.5 minutes.

For full information about the metrics, you can refer to the Metrics section of the KUMA web interface. Selecting this section opens the Grafana portal that is deployed as part of Core installation and is updated automatically. If the Metrics section shows core: <port number>, this means that KUMA is deployed in a high availability configuration and the metrics were received from the host on which the Core was installed. In other configurations, the name of the host from which KUMA receives metrics is displayed.

Collector metrics

Metric name

Description

IO—metrics related to the service input and output.

Processing EPS

The number of events processed per second.

Output EPS

The number of events per second sent to the destination.

Output Latency

The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.

Output Errors

The number of errors occurring per second while event packets were sent to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.

Output Event Loss

The number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.

Output Disk Buffer SIze

The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.

Write Network BPS

The number of bytes received into the network per second.

Connector errors

The number of errors in the connector logs.

Normalization—metrics related to the normalizers.

Raw & Normalized event size

The size of the raw event and size of the normalized event. The median value is displayed.

Errors

The number of normalization errors per second.

Filtration—metrics related to filters.

EPS

The number of events per second matching the filter conditions and sent for processing. The collector only processes events that match the filtering criteria if the user has added the filter to the configuration of the collector service.

Aggregation—metrics related to the aggregation rules.

EPS

The number of events received and generated by the aggregation rule per second. This metric helps determine the effectiveness of aggregation rules.

Buckets

The number of buckets in the aggregation rule.

Enrichment—metrics related to enrichment rules.

Cache RPS

The number of requests per second to the local cache.

Source RPS

The number of requests per second to an enrichment source, such as a dictionary.

Source Latency

Time in milliseconds passed while sending a request to the enrichment source and receiving a response from it. The median value is displayed.

Queue

The size of the enrichment request queue. This metric helps to find bottleneck enrichment rules.

Errors

The number of errors per second while sending requests to the enrichment source.

Correlator metrics

Metric name

Description

IO—metrics related to the service input and output.

Processing EPS

The number of events processed per second.

Output EPS

The number of events per second sent to the destination.

Output Latency

The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.

Output Errors

The number of errors occurring per second while event packets were sent to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.

Output Event Loss

The number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.

Output Disk Buffer SIze

The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.

Correlation—metrics related to correlation rules.

EPS

The number of correlation events per second generated by the correlation rule.

Buckets

The number of buckers in a correlation rule of the standard type.

Rate Limiter Hits

The number of times the correlation rule exceeded the rate limit per second.

Active Lists OPS

The number of operations requests per second sent to the active list, and the operations themselves.

Active Lists Records

The number of records in the active list.

Active Lists On-Disk Size

The size of the active list on the disk, in bytes.

Enrichment—metrics related to enrichment rules.

Cache RPS

The number of requests per second to the local cache.

Source RPS

The number of requests per second to an enrichment source, such as a dictionary.

Source Latency

Time in milliseconds passed while sending a request to the enrichment source and receiving a response from it. The median value is displayed.

Queue

The size of the enrichment request queue. This metric helps to find bottleneck enrichment rules.

Errors

The number of errors per second while sending requests to the enrichment source.

Response—metrics associated with response rules.

RPS

The number of times a response rule was activated per second.

Storage metrics

Metric name

Description

ClickHouse / General—metrics related to the general settings of the ClickHouse cluster.

Active Queries

The number of active queries sent to the ClickHouse cluster. This metric is displayed for each ClickHouse instance.

QPS

The number of queries per second sent to the ClickHouse cluster.

Failed QPS

The number of failed queries per second sent to the ClickHouse cluster.

Allocated memory

The amount of RAM, in gigabytes, allocated to the ClickHouse process.

ClickHouse / Insert—metrics related to inserting events into a ClickHouse instance.

Insert EPS

The number of events per second inserted into the ClickHouse instance.

Insert QPS

The number of ClickHouse instance insert queries per second sent to the ClickHouse cluster.

Failed Insert QPS

The number of failed ClickHouse instance insert queries per second sent to the ClickHouse cluster.

Delayed Insert QPS

The number of delayed ClickHouse instance insert queries per second sent to the ClickHouse cluster. Queries were delayed by the ClickHouse node due to exceeding the soft limit on active merges.

Rejected Insert QPS

The number of rejected ClickHouse instance insert queries per second sent to the ClickHouse cluster. Queries were rejected by the ClickHouse node due to exceeding the hard limit on active merges.

Active Merges

The number of active merges.

Distribution Queue

The number of temporary files with events that could not be inserted into the ClickHouse instance because it was unavailable. These events cannot be found using search.

ClickHouse / Select—metrics related to event selections in the ClickHouse instance.

Select QPS

The number of ClickHouse instance event select queries per second sent to the ClickHouse cluster.

Failed Select QPS

The number of failed ClickHouse instance event select queries per second sent to the ClickHouse cluster.

ClickHouse / Replication—metrics related to replicas of ClickHouse nodes.

Active Zookeeper Connections

The number of active connections to the Zookeeper cluster nodes. In normal operation, this number should be equal to the number of nodes in the Zookeeper cluster.

Read-only Replicas

The number of read-only replicas of ClickHouse nodes. In normal operation, no such replicas of ClickHouse nodes must exist.

Active Replication Fetches

The number of active processes of downloading data from the ClickHouse node during data replication.

Active Replication Sends

The number of active processes of sending data to the ClickHouse node during data replication.

Active Replication Consistency Checks

The number of active data consistency checks on replicas of ClickHouse nodes during data replication.

ClickHouse / Networking—metrics related to the network of the ClickHouse cluster.

Active HTTP Connections

The number of active connections to the HTTP server of the ClickHouse cluster.

Active TCP Connections

The number of active connections to the TCP server of the ClickHouse cluster.

Active Interserver Connections

The number of active service connections between ClickHouse nodes.

Core metrics

Metric name

Description

Raft—metrics related to reading and updating the state of the Core.

Lookup RPS

The number of lookup procedure requests per second sent to the Core, and the procedures themselves.

Lookup Latency

Time in milliseconds spent running the lookup procedures, and the procedures themselves. The time is displayed for the 99th percentile of lookup procedures. One percent of lookup procedures may take longer to run.

Propose RPS

The number of propose procedure requests per second sent to the Core, and the procedures themselves.

Propose Latency

Time in milliseconds spent running the propose procedures, and the procedures themselves. The time is displayed for the 99th percentile of propose procedures. One percent of propose procedures may take longer to run.

API—metrics related to API requests.

RPS

The number of API requests made to the Core per second.

Latency

The time in milliseconds spent processing a single API request to the Core. The median value is displayed.

Errors

The number of errors per second while sending API requests to the Core.

Notification Feed—metrics related to user activity.

Subscriptions

The number of clients connected to the Core via SSE to receive server messages in real time. This number is normally equal to the number of clients that are using the KUMA web interface.

Errors

The number of errors per second while sending notifications to users.

Schedulers—metrics related to Core tasks.

Active

The number of repeating active system tasks. The tasks created by the user are ignored.

Latency

The time in milliseconds spent running the task. The median value is displayed.

Errors

The number of errors that occurred per second while performing tasks.

KUMA agent metrics

Metric name

Description

IO—metrics related to the service input and output.

Processing EPS

The number of events processed per second.

Output EPS

The number of events per second sent to the destination.

Output Latency

The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.

Output Errors

The number of errors occurring per second while event packets were sent to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.

Output Event Loss

The number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.

Output Disk Buffer SIze

The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.

Write Network BPS

The number of bytes received into the network per second.

Event routers metrics

Metric name

Description

IO—metrics related to the service input and output.

Processing EPS

The number of events processed per second.

Output EPS

The number of events per second sent to the destination.

Output Latency

The time in milliseconds that passed while sending an event packet and receiving a response from the destination. The median value is displayed.

Output Errors

The number of errors occurring per second while event packets were sent to the destination. Network errors and errors writing to the disk buffer of the destination are displayed separately.

Output Event Loss

The number of events lost per second. Events can be lost due to network errors or errors writing the disk buffer of the destination. Events are also lost if the destination responds with an error code, for example, in case of an invalid request.

Output Disk Buffer SIze

The size of the disk buffer of the collector associated with the destination, in bytes. If a zero value is displayed, no event packets have been placed in the collector's disk buffer and the service is operating correctly.

Write Network BPS

The number of bytes received into the network per second.

Connector Errors

The number of errors in the connector log.

General metrics common for all services

Metric name

Description

Process—General process metrics.

Memory

RAM usage (RSS) in megabytes.

DISK BPS

The number of bytes read from or written to the disk per second.

Network BPS

The number of bytes received/transmitted over the network per second.

Network Packet Loss

The number of network packets lost per second.

GC Latency

The time, in milliseconds, spent executing a GO garbage collection cycle The median value is displayed.

Goroutines

The number of active goroutines. This number is different from the operating system's thread count.

OS—metrics related to the operating system.

Load

Average load.

CPU

CPU load as a percentage.

Memory

RAM usage (RSS) as a percentage.

Disk

Disk space usage as a percentage.

Metrics storage period

KUMA operation data is saved for 3 months by default. This storage period can be changed.

To change the storage period for KUMA metrics:

  1. Log in to the OS of the server where the KUMA Core is installed.
  2. In the file /etc/systemd/system/multi-user.target.wants/kuma-victoria-metrics.service, in the ExecStart parameter, edit the --retentionPeriod=<metrics storage period, in months> flag by inserting the necessary period. For example, --retentionPeriod=4 means that the metrics will be stored for 4 months.
  3. Restart KUMA by running the following commands in sequence:
    1. systemctl daemon-reload
    2. systemctl restart kuma-victoria-metrics

The storage period for metrics has been changed.

Page top