Comprehensive information about the performance of the KUMA Core, storage, collectors, and correlators is available in the Metrics section of the KUMA web interface. Selecting this section opens the Grafana portal deployed as part of KUMA Core installation and is updated automatically. If the Metrics section shows core: <port number>, this means that KUMA is deployed in a high availability configuration and the metrics were received from the host on which the Core was installed. In other configurations, the name of the host from which KUMA receives metrics is displayed.
To determine on which host the Core is running, run the following command in the terminal of one of the controllers:
k0s kubectl get pod -n kuma -o wide
The default Grafana user name and password are admin and admin.
Available metrics
Collector indicators:
IO—metrics related to the service input and output.
Processing EPS—the number of processed events per second.
Processing Latency—the time required to process a single event (the median is displayed).
Output EPS—the number of events, sent to the destination per second.
Output Latency—the time required to send a batch of events to the destination and receive a response from it (the median is displayed).
Output Errors—the number or errors when sending event batches to the destination per second. Network errors and errors writing the disk buffer are displayed separately.
Output Event Loss—the number of lost events per second. Events can be lost due to network errors or errors writing the disk buffer. Events are also lost if the destination responded with an error code (for example, if the request was invalid).
Normalization—metrics related to the normalizers.
Raw & Normalized event size—the size of the raw event and size of the normalized event (the median is displayed).
Errors—the number of normalization errors per second.
Filtration—metrics related to the filters.
EPS (Events Processed Per Second) - the number of events that match the filter conditions and are sent for processing per second. The collector only processes events that match the filtering criteria if the user has added the filter to the configuration of the collector service.
Aggregation—metrics related to the aggregation rules.
EPS—the number of events received and created by the aggregation rule per second. This metric helps determine the effectiveness of aggregation rules.
Buckets—the number of buckets in the aggregation rule.
Enrichment—metrics related to the enrichment rules.
Cache RPS—the number requests to the local cache per second.
Source RPS—the number of requests to the enrichment source (for example, the Dictionary resource).
Source Latency—the time required to send a request to the enrichment source and receive a response from it (the median is displayed).
Queue—the enrichment requests queue size. This metric helps to find bottleneck enrichment rules.
Errors—the number of enrichment source request errors per second.
Correlator metrics
IO—metrics related to the service input and output.
Processing EPS—the number of processed events per second.
Processing Latency—the time required to process a single event (the median is displayed).
Output EPS—the number of events, sent to the destination per second.
Output Latency—the time required to send a batch of events to the destination and receive a response from it (the median is displayed).
Output Errors—the number or errors when sending event batches to the destination per second. Network errors and errors writing the disk buffer are displayed separately.
Output Event Loss—the number of lost events per second. Events can be lost due to network errors or errors writing the disk buffer. Events are also lost if the destination responded with an error code (for example, if the request was invalid).
Correlation—metrics related to the correlation rules.
EPS—the number of correlation events created per second.
Buckets—the number of buckets in the correlation rule (only for the standard kind of correlation rules).
Active Lists—metrics related to the active lists.
RPS—the number of requests (and their type) to the Active list per second.
Records—the number of entries in the Active list.
WAL Size—the size of the Write-Ahead-Log. This metric helps determine the size of the Active list.
Storage indicators
IO—metrics related to the service input and output.
RPS—the number of requests to the Storage service per second.
Latency—the time of proxying a single request to the ClickHouse node (the median is displayed).
Core service metrics
IO—metrics related to the service input and output.
RPS—the number of requests to the Core service per second.
Latency—the time of processing a single request (the median is displayed).
Errors—the number of request errors per second.
Notification Feed—metrics related to user activity.
Subscriptions—the number of clients, connected to the Core via SSE to receive server messages in real time. This number usually correlates with the number of clients using the KUMA web interface.
Errors—the number of message sending errors per second.
Schedulers—metrics related to Core tasks.
Active—the number of repeating active system tasks. The tasks created by the user are ignored.
Latency—the time of processing a single request (the median is displayed).
Position—the position (timestamp) of the alert creation task. The next ClickHouse scan for correlation events will start from this position.
Errors—the number of task errors per second.
General metrics common for all services
Process—general process metrics.
CPU—CPU usage.
Memory—RAM usage (RSS).
DISK IOPS—the number of disk read/write operations per second.
DISK BPS—the number of bytes read/written to the disk per second.
Network BPS—the number of bytes received/sent per second.
Network Packet Loss—the number of network packets lost per second.
GC Latency—the time of the GO Garbage Collector cycle (the median is displayed).
Goroutines—the number of active goroutines. This number differs from the thread count.
OS—metrics related to the operating system.
Load—the average load.
CPU—CPU usage.
Memory—RAM usage (RSS).
Disk—disk space usage.
Metrics storage period
KUMA operation data is saved for 3 months by default. This storage period can be changed.
To change the storage period for KUMA metrics:
Log in to the OS of the server where the KUMA Core is installed.
In the file /etc/systemd/system/multi-user.target.wants/kuma-victoria-metrics.service, in the ExecStart parameter, edit the --retentionPeriod=<metrics storage period, in months> flag by inserting the necessary period. For example, --retentionPeriod=4 means that the metrics will be stored for 4 months.
Restart KUMA by running the following commands in sequence: