Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nomad?at=release/7.2
HashiCorp Nomad by HTTP
Overview
This template is designed to monitor HashiCorp Nomad by Zabbix. It works without any external scripts. Currently the template supports Nomad servers and clients discovery.
Requirements
Zabbix version: 7.2 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Create a synthetic Nomad host. It should be one of the Nomad cluster members, load-balancing service (if cluster is used) or a single node in a selected Nomad region.
- Define the
{$NOMAD.ENDPOINT.API.URL}
macro value with correct web protocol, host and port. - Prepare an ACL token with
node:read
,namespace:read-job
,agent:read
andmanagement
permissions applied. Define the{$NOMAD.TOKEN}
macro value.
Refer to the vendor documentation about
Nomad native ACL
orNomad Vault-generated tokens
if you have the HashiCorp Vault integration configured.
Additional information:
- Synthetic Nomad host will be used just as an endpoint for servers and clients discovery (general cluster information), it will not be monitored as a Nomad server or client, so that to prevent duplicate entities.
- If you're not using ACL - skip 3rd setup step.
- The Nomad servers/clients discovery is limited by region. If you're using multi-region cluster- create one synthetic host per region.
- The Nomad server/client templates are ready for separate usage. Feel free to use if you prefer manual host creation.
Useful links
- HashiCorp Nomad multi-region federation
- HashiCorp Nomad agent API reference
- HashiCorp Nomad raft operator API reference
- HashiCorp Nomad nodes API reference
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.ENDPOINT.API.URL} | API endpoint URL for one of the Nomad cluster members. |
http://localhost:4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for script and HTTP agent items. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.SERVER.NAME.MATCHES} | The filter to include HashiCorp Nomad servers by name. |
.* |
{$NOMAD.SERVER.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad servers by name. |
CHANGE_IF_NEEDED |
{$NOMAD.SERVER.DC.MATCHES} | The filter to include HashiCorp Nomad servers by datacenter belonging. |
.* |
{$NOMAD.SERVER.DC.NOT_MATCHES} | The filter to exclude HashiCorp Nomad servers by datacenter belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.NAME.MATCHES} | The filter to include HashiCorp Nomad clients by name. |
.* |
{$NOMAD.CLIENT.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by name. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.DC.MATCHES} | The filter to include HashiCorp Nomad clients by datacenter belonging. |
.* |
{$NOMAD.CLIENT.DC.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by datacenter belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.MATCHES} | The filter to include HashiCorp Nomad clients by scheduling eligibility. |
.* |
{$NOMAD.CLIENT.SCHEDULE.ELIGIBILITY.NOT_MATCHES} | The filter to exclude HashiCorp Nomad clients by scheduling eligibility. |
CHANGE_IF_NEEDED |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Nomad clients get | Nomad clients data in raw format. |
HTTP agent | nomad.client.nodes.get Preprocessing
|
Client nodes API response | Client nodes API response message. |
Dependent item | nomad.client.nodes.api.response Preprocessing
|
Nomad servers get | Nomad servers data in raw format. |
Script | nomad.server.nodes.get |
Server-related APIs response | Server-related ( |
Dependent item | nomad.server.api.response Preprocessing
|
Region | Current cluster region. |
Dependent item | nomad.region Preprocessing
|
Nomad servers count | Nomad servers count. |
Dependent item | nomad.servers.count Preprocessing
|
Nomad clients count | Nomad clients count. |
Dependent item | nomad.clients.count Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad: Client nodes API connection has failed | Client nodes API connection has failed. |
find(/HashiCorp Nomad by HTTP/nomad.client.nodes.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad: Server-related API connection has failed | Server-related API connection has failed. |
find(/HashiCorp Nomad by HTTP/nomad.server.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
LLD rule Clients discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Clients discovery | Client nodes discovery. |
Dependent item | nomad.clients.discovery Preprocessing
|
LLD rule Servers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Servers discovery | Server nodes discovery. |
Dependent item | nomad.servers.discovery Preprocessing
|
HashiCorp Nomad Client by HTTP
Overview
This template is designed to monitor HashiCorp Nomad clients by Zabbix. It works without any external scripts.
Requirements
Zabbix version: 7.2 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.
Refer to the
vendor documentation
.
- Prepare an ACL token with
node:read
,namespace:read-job
permissions applied. Define the{$NOMAD.TOKEN}
macro value.
Refer to the vendor documentation about
Nomad native ACL
orNomad Vault-generated tokens
if you're using integration with HashiCorp Vault.
- Set the values for the
{$NOMAD.CLIENT.API.SCHEME}
and{$NOMAD.CLIENT.API.PORT}
macros to define the common Nomad API web schema and connection port.
Additional information:
-
You have to prepare an additional ACL token only if you wish to monitor Nomad clients as separate entities. If you're using clients discovery - token will be inherited from the master host linked to the HashiCorp Nomad by HTTP template.
-
If you're not using ACL - skip 2nd setup step.
-
The Nomad clients use the default web schema -
HTTP
and default API port -4646
. If you're using clients discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.CLIENT.API.SCHEME:NECESSARY.IP
}} or/and {{$NOMAD.CLIENT.API.PORT:NECESSARY.IP
}} on master host or template level. -
Some metrics may not be collected depending on your HashiCorp Nomad agent version and configuration.
Useful links:
- HashiCorp Nomad metrics list
- HashiCorp Nomad telemetry configuration reference
- HashiCorp Nomad metrics API reference
- HashiCorp Nomad nodes API reference
- HashiCorp Nomad allocations API reference
- Zabbix user macros with context
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.CLIENT.API.SCHEME} | Nomad client API scheme. |
http |
{$NOMAD.CLIENT.API.PORT} | Nomad client API port. |
4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.CLIENT.RPC.PORT} | Nomad RPC service port. |
4647 |
{$NOMAD.CLIENT.SERF.PORT} | Nomad serf service port. |
4648 |
{$NOMAD.CLIENT.OPEN.FDS.MAX.WARN} | Maximum percentage of used file descriptors. |
90 |
{$NOMAD.DISK.NAME.MATCHES} | The filter to include HashiCorp Nomad client disks by name. |
.* |
{$NOMAD.DISK.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client disks by name. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.NAME.MATCHES} | The filter to include HashiCorp Nomad client jobs by name. |
.* |
{$NOMAD.JOB.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by name. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.NAMESPACE.MATCHES} | The filter to include HashiCorp Nomad client jobs by namespace. |
.* |
{$NOMAD.JOB.NAMESPACE.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by namespace. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.TYPE.MATCHES} | The filter to include HashiCorp Nomad client jobs by type. |
.* |
{$NOMAD.JOB.TYPE.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by type. |
CHANGE_IF_NEEDED |
{$NOMAD.JOB.TASK.GROUP.MATCHES} | The filter to include HashiCorp Nomad client jobs by task group belonging. |
.* |
{$NOMAD.JOB.TASK.GROUP.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client jobs by task group belonging. |
CHANGE_IF_NEEDED |
{$NOMAD.DRIVER.NAME.MATCHES} | The filter to include HashiCorp Nomad client drivers by name. |
.* |
{$NOMAD.DRIVER.NAME.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client drivers by name. |
CHANGE_IF_NEEDED |
{$NOMAD.DRIVER.DETECT.MATCHES} | The filter to include HashiCorp Nomad client drivers by detection state. Possible filtering values: |
.* |
{$NOMAD.DRIVER.DETECT.NOT_MATCHES} | The filter to exclude HashiCorp Nomad client drivers by detection state. Possible filtering values: |
CHANGE_IF_NEEDED |
{$NOMAD.CPU.UTIL.MIN} | CPU utilization threshold. Measured as a percentage. |
90 |
{$NOMAD.RAM.AVAIL.MIN} | CPU utilization threshold. Measured as a percentage. |
5 |
{$NOMAD.INODES.FREE.MIN.WARN} | Warning threshold of the filesystem metadata utilization. Measured as a percentage. |
20 |
{$NOMAD.INODES.FREE.MIN.CRIT} | Critical threshold of the filesystem metadata utilization. Measured as a percentage. |
10 |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Telemetry get | Telemetry data in raw format. |
HTTP agent | nomad.client.data.get Preprocessing
|
Metrics | Nomad client metrics in raw format. |
Dependent item | nomad.client.metrics.get Preprocessing
|
Monitoring API response | Monitoring API response message. |
Dependent item | nomad.client.data.api.response Preprocessing
|
Service [rpc] state | Current [rpc] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}] Preprocessing
|
Service [serf] state | Current [serf] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}] Preprocessing
|
CPU allocated | Total amount of CPU shares the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.cpu Preprocessing
|
CPU unallocated | Total amount of CPU shares free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.cpu Preprocessing
|
Memory allocated | Total amount of memory the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.memory Preprocessing
|
Memory unallocated | Total amount of memory free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.memory Preprocessing
|
Disk allocated | Total amount of disk space the scheduler has allocated to tasks. |
Dependent item | nomad.client.allocated.disk Preprocessing
|
Disk unallocated | Total amount of disk space free for the scheduler to allocate to tasks. |
Dependent item | nomad.client.unallocated.disk Preprocessing
|
Allocations blocked | Number of allocations waiting for previous versions. |
Dependent item | nomad.client.allocations.blocked Preprocessing
|
Allocations migrating | Number of allocations migrating data from previous versions. |
Dependent item | nomad.client.allocations.migrating Preprocessing
|
Allocations pending | Number of allocations pending (received by the client but not yet running). |
Dependent item | nomad.client.allocations.pending Preprocessing
|
Allocations starting | Number of allocations starting. |
Dependent item | nomad.client.allocations.start Preprocessing
|
Allocations running | Number of allocations running. |
Dependent item | nomad.client.allocations.running Preprocessing
|
Allocations terminal | Number of allocations terminal. |
Dependent item | nomad.client.allocations.terminal Preprocessing
|
Allocations failed, rate | Number of allocations failed. |
Dependent item | nomad.client.allocations.failed Preprocessing
|
Allocations completed, rate | Number of allocations completed. |
Dependent item | nomad.client.allocations.complete Preprocessing
|
Allocations restarted, rate | Number of allocations restarted. |
Dependent item | nomad.client.allocations.restart Preprocessing
|
Allocations OOM killed | Number of allocations OOM killed. |
Dependent item | nomad.client.allocations.oom_killed Preprocessing
|
CPU idle utilization | CPU utilization in idle state. |
Dependent item | nomad.client.cpu.idle Preprocessing
|
CPU system utilization | CPU utilization in system space. |
Dependent item | nomad.client.cpu.system Preprocessing
|
CPU total utilization | Total CPU utilization. |
Dependent item | nomad.client.cpu.total Preprocessing
|
CPU user utilization | CPU utilization in user space. |
Dependent item | nomad.client.cpu.user Preprocessing
|
Memory available | Total amount of memory available to processes which includes free and cached memory. |
Dependent item | nomad.client.memory.available Preprocessing
|
Memory free | Amount of memory which is free. |
Dependent item | nomad.client.memory.free Preprocessing
|
Memory size | Total amount of physical memory on the node. |
Dependent item | nomad.client.memory.total Preprocessing
|
Memory used | Amount of memory used by processes. |
Dependent item | nomad.client.memory.used Preprocessing
|
Uptime | Uptime of the host running the Nomad client. |
Dependent item | nomad.client.uptime Preprocessing
|
Node info get | Node info data in raw format. |
HTTP agent | nomad.client.node.info.get Preprocessing
|
Nomad client version | Nomad client version. |
Dependent item | nomad.client.version Preprocessing
|
Nodes API response | Nodes API response message. |
Dependent item | nomad.client.node.info.api.response Preprocessing
|
Allocated jobs get | Allocated jobs data in raw format. |
HTTP agent | nomad.client.job.allocs.get Preprocessing
|
Allocations API response | Allocations API response message. |
Dependent item | nomad.client.job.allocs.api.response Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Monitoring API connection has failed | Monitoring API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: Service [rpc] is down | Cannot establish the connection to [rpc] service port {$NOMAD.CLIENT.RPC.PORT}. |
last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.RPC.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: Service [serf] is down | Cannot establish the connection to [serf] service port {$NOMAD.CLIENT.SERF.PORT}. |
last(/HashiCorp Nomad Client by HTTP/net.tcp.service[tcp,,{$NOMAD.CLIENT.SERF.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Client: OOM killed allocations found | OOM killed allocations found. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.allocations.oom_killed) > 0 |
Warning | Manual close: Yes |
HashiCorp Nomad Client: High CPU utilization | CPU utilization is too high. The system might be slow to respond. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.cpu.total, 10m) >= {$NOMAD.CPU.UTIL.MIN} |
Average | |
HashiCorp Nomad Client: High memory utilization | RAM utilization is too high. The system might be slow to respond. |
(min(/HashiCorp Nomad Client by HTTP/nomad.client.memory.available, 10m) / last(/HashiCorp Nomad Client by HTTP/nomad.client.memory.total))*100 <= {$NOMAD.RAM.AVAIL.MIN} |
Average | |
HashiCorp Nomad Client: The host has been restarted | The host uptime is less than 10 minutes. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.uptime) < 10m |
Warning | Manual close: Yes |
HashiCorp Nomad Client: Nomad client version has changed | Nomad client version has changed. |
change(/HashiCorp Nomad Client by HTTP/nomad.client.version)<>0 |
Info | Manual close: Yes |
HashiCorp Nomad Client: Nodes API connection has failed | Nodes API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.node.info.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: Allocations API connection has failed | Allocations API connection has failed. |
find(/HashiCorp Nomad Client by HTTP/nomad.client.job.allocs.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
LLD rule Drivers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Drivers discovery | Client drivers discovery. |
Dependent item | nomad.client.drivers.discovery Preprocessing
|
Item prototypes for Drivers discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Driver [{#DRIVER.NAME}] state | Driver [{#DRIVER.NAME}] state. |
Dependent item | nomad.client.driver.state["{#DRIVER.NAME}"] Preprocessing
|
Driver [{#DRIVER.NAME}] detection state | Driver [{#DRIVER.NAME}] detection state. |
Dependent item | nomad.client.driver.detected["{#DRIVER.NAME}"] Preprocessing
|
Trigger prototypes for Drivers discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] is in unhealthy state | The [{#DRIVER.NAME}] driver detected, but its state is unhealthy. |
last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.state["{#DRIVER.NAME}"]) = 0 and last(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) = 1 |
Warning | Manual close: Yes |
HashiCorp Nomad Client: Driver [{#DRIVER.NAME}] detection state has changed | The [{#DRIVER.NAME}] driver detection state has changed. |
change(/HashiCorp Nomad Client by HTTP/nomad.client.driver.detected["{#DRIVER.NAME}"]) <> 0 |
Info | Manual close: Yes |
LLD rule Physical disks discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Physical disks discovery | Physical disks discovery. |
Dependent item | nomad.client.disk.discovery Preprocessing
|
Item prototypes for Physical disks discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Disk ["{#DEV.NAME}"] space available | Amount of space which is available on ["{#DEV.NAME}"] disk. |
Dependent item | nomad.client.disk.available["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] inodes utilization | Disk space consumed by the inodes on ["{#DEV.NAME}"] disk. |
Dependent item | nomad.client.disk.inodes_percent["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] size | Total size of the ["{#DEV.NAME}"] device. |
Dependent item | nomad.client.disk.size["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] space utilization | Percentage of disk ["{#DEV.NAME}"] space used. |
Dependent item | nomad.client.disk.used_percent["{#DEV.NAME}"] Preprocessing
|
Disk ["{#DEV.NAME}"] space used | Amount of disk ["{#DEV.NAME}"] space which has been used. |
Dependent item | nomad.client.disk.used["{#DEV.NAME}"] Preprocessing
|
Trigger prototypes for Physical disks discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device | It may become impossible to write to a disk if there are no index nodes left. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.WARN:"{#DEV.NAME}"} |
Warning | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: Running out of free inodes on [{#DEV.NAME}] device | It may become impossible to write to a disk if there are no index nodes left. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.inodes_percent["{#DEV.NAME}"],5m) >= {$NOMAD.INODES.FREE.MIN.CRIT:"{#DEV.NAME}"} |
Average | Manual close: Yes |
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization | High disk [{#DEV.NAME}] utilization. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.WARN:"{#DEV.NAME}"} |
Warning | Manual close: Yes Depends on:
|
HashiCorp Nomad Client: High disk [{#DEV.NAME}] utilization | High disk [{#DEV.NAME}] utilization. |
min(/HashiCorp Nomad Client by HTTP/nomad.client.disk.used_percent["{#DEV.NAME}"],5m) >= {$NOMAD.DISK.UTIL.MIN.CRIT:"{#DEV.NAME}"} |
Average | Manual close: Yes |
LLD rule Allocated jobs discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Allocated jobs discovery | Allocated jobs discovery. |
Dependent item | nomad.client.alloc.discovery Preprocessing
|
Item prototypes for Allocated jobs discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
Job ["{#JOB.NAME}"] CPU allocated | Total CPU resources allocated by the ["{#JOB.NAME}"] job across all cores. |
Dependent item | nomad.client.allocs.cpu.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU system utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job in system space. |
Dependent item | nomad.client.allocs.cpu.system["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU user utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job in user space. |
Dependent item | nomad.client.allocs.cpu.user["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU total utilization | Total CPU resources consumed by the ["{#JOB.NAME}"] job across all cores. |
Dependent item | nomad.client.allocs.cpu.total_percent["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU throttled periods time | Total number of CPU periods that the ["{#JOB.NAME}"] job was throttled. |
Dependent item | nomad.client.allocs.cpu.throttled_periods["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU throttled time | Total time that the ["{#JOB.NAME}"] job was throttled. |
Dependent item | nomad.client.allocs.cpu.throttled_time["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] CPU ticks | CPU ticks consumed by the process for the ["{#JOB.NAME}"] job in the last collection interval. |
Dependent item | nomad.client.allocs.cpu.total_ticks["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory allocated | Amount of memory allocated by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.allocated["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory cached | Amount of memory cached by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.cache["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory used | Total amount of memory used by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.usage["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
Job ["{#JOB.NAME}"] Memory swapped | Amount of memory swapped by the ["{#JOB.NAME}"] job. |
Dependent item | nomad.client.allocs.memory.swap["{#JOB.NAME}","{#JOB.TASK.GROUP}","{#JOB.NAMESPACE}"] Preprocessing
|
HashiCorp Nomad Server by HTTP
Overview
This template is designed to monitor HashiCorp Nomad servers by Zabbix. It works without any external scripts.
Requirements
Zabbix version: 7.2 and higher.
Tested versions
This template has been tested on:
- HashiCorp Nomad version 1.5.6/1.6.0
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Enable telemetry in HashiCorp Nomad agent configuration file. Set the Prometheus metrics format.
Refer to the
vendor documentation
.
- Set the values for the
{$NOMAD.SERVER.API.SCHEME}
and{$NOMAD.SERVER.API.PORT}
macros to define the common Nomad API web schema and connection port.
Additional information:
- The Nomad servers use the default web schema -
HTTP
and default API port -4646
. If you're using servers discovery and you need to re-define macros for the particular host created from prototype, use the context macros like {{$NOMAD.SERVER.API.SCHEME:NECESSARY.IP
}} or/and {{$NOMAD.SERVER.API.PORT:NECESSARY.IP
}} on master host or template level. - Some metrics may not be collected depending on your HashiCorp Nomad agent version, configuration and cluster role.
- Don't forget to define the
{$NOMAD.REDUNDANCY.MIN}
macro value, based on your cluster nodes amount to configure the failure tolerance triggers correctly.
Useful links:
- HashiCorp Nomad metrics list
- HashiCorp Nomad telemetry configuration reference
- HashiCorp Nomad metrics API reference
- HashiCorp Nomad agent API reference
- HashiCorp Nomad cluster failure tolerance reference
- Zabbix user macros with context
Macros used
Name | Description | Default |
---|---|---|
{$NOMAD.SERVER.API.SCHEME} | Nomad SERVER API scheme. |
http |
{$NOMAD.SERVER.API.PORT} | Nomad SERVER API port. |
4646 |
{$NOMAD.TOKEN} | Nomad authentication token. |
<PUT YOUR AUTH TOKEN> |
{$NOMAD.DATA.TIMEOUT} | Response timeout for an API. |
15s |
{$NOMAD.HTTP.PROXY} | Sets the HTTP proxy for HTTP agent item. If this parameter is empty, then no proxy is used. |
|
{$NOMAD.API.RESPONSE.SUCCESS} | HTTP API successful response code. Availability triggers threshold. Change, if needed. |
200 |
{$NOMAD.SERVER.RPC.PORT} | Nomad RPC service port. |
4647 |
{$NOMAD.SERVER.SERF.PORT} | Nomad serf service port. |
4648 |
{$NOMAD.REDUNDANCY.MIN} | Amount of redundant servers to keep the cluster safe. Default value - '1' for the 3-nodes cluster. Change if needed. |
1 |
{$NOMAD.OPEN.FDS.MAX} | Maximum percentage of used file descriptors. |
90 |
{$NOMAD.SERVER.LEADER.LATENCY} | Leader last contact latency threshold. |
0.3s |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Telemetry get | Telemetry data in raw format. |
HTTP agent | nomad.server.data.get Preprocessing
|
Metrics | Nomad server metrics in raw format. |
Dependent item | nomad.server.metrics.get Preprocessing
|
Monitoring API response | Monitoring API response message. |
Dependent item | nomad.server.data.api.response Preprocessing
|
Internal stats get | Internal stats data in raw format. |
HTTP agent | nomad.server.stats.get Preprocessing
|
Internal stats API response | Internal stats API response message. |
Dependent item | nomad.server.stats.api.response Preprocessing
|
Nomad server version | Nomad server version. |
Dependent item | nomad.server.version Preprocessing
|
Nomad raft version | Nomad raft version. |
Dependent item | nomad.raft.version Preprocessing
|
Raft peers | Current cluster raft peers amount. |
Dependent item | nomad.server.raft.peers Preprocessing
|
Cluster role | Current role in the cluster. |
Dependent item | nomad.server.raft.cluster_role Preprocessing
|
CPU time, rate | Total user and system CPU time spent in seconds. |
Dependent item | nomad.server.cpu.time Preprocessing
|
Memory used | Memory utilization in bytes. |
Dependent item | nomad.server.runtime.alloc_bytes Preprocessing
|
Virtual memory size | Virtual memory size in bytes. |
Dependent item | nomad.server.virtual_memory_bytes Preprocessing
|
Resident memory size | Resident memory size in bytes. |
Dependent item | nomad.server.resident_memory_bytes Preprocessing
|
Heap objects | Number of objects on the heap. General memory pressure indicator. |
Dependent item | nomad.server.runtime.heap_objects Preprocessing
|
Open file descriptors | Number of open file descriptors. |
Dependent item | nomad.server.process_open_fds Preprocessing
|
Open file descriptors, max | Maximum number of open file descriptors. |
Dependent item | nomad.server.process_max_fds Preprocessing
|
Goroutines | Number of goroutines and general load pressure indicator. |
Dependent item | nomad.server.runtime.num_goroutines Preprocessing
|
Evaluations pending | Evaluations that are pending until an existing evaluation for the same job completes. |
Dependent item | nomad.server.broker.total_pending Preprocessing
|
Evaluations ready | Number of evaluations ready to be processed. |
Dependent item | nomad.server.broker.total_ready Preprocessing
|
Evaluations unacked | Evaluations dispatched for processing but incomplete. |
Dependent item | nomad.server.broker.total_unacked Preprocessing
|
CPU shares for blocked evaluations | Amount of CPU shares requested by blocked evals. |
Dependent item | nomad.server.blocked_evals.cpu Preprocessing
|
Memory shares by blocked evaluations | Amount of memory requested by blocked evals. |
Dependent item | nomad.server.blocked_evals.memory Preprocessing
|
CPU shares for blocked job evaluations | Amount of CPU shares requested by blocked evals of a job. |
Dependent item | nomad.server.blocked_evals.job.cpu Preprocessing
|
Memory shares for blocked job evaluations | Amount of memory requested by blocked evals of a job. |
Dependent item | nomad.server.blocked_evals.job.memory Preprocessing
|
Evaluations blocked | Count of evals in the blocked state for any reason (cluster resource exhaustion or quota limits). |
Dependent item | nomad.server.blocked_evals.total_blocked Preprocessing
|
Evaluations escaped | Count of evals that have escaped computed node classes. This indicates a scheduler optimization was skipped and is not usually a source of concern. |
Dependent item | nomad.server.blocked_evals.total_escaped Preprocessing
|
Evaluations waiting | Count of evals waiting to be enqueued. |
Dependent item | nomad.server.broker.total_waiting Preprocessing
|
Evaluations blocked due to quota limit | Count of blocked evals due to quota limits (the resources for these jobs are not counted in other blocked_evals metrics, except for total_blocked). |
Dependent item | nomad.server.blocked_evals.total_quota_limit Preprocessing
|
Evaluations enqueue time | Average time elapsed with evaluations waiting to be enqueued. |
Dependent item | nomad.server.broker.eval_waiting Preprocessing
|
RPC evaluation acknowledgement time | Time elapsed for Eval.Ack RPC call. |
Dependent item | nomad.server.eval.ack Preprocessing
|
RPC job summary time | Time elapsed for Job.Summary RPC call. |
Dependent item | nomad.server.job_summary.get_job_summary Preprocessing
|
Heartbeats active | Number of active heartbeat timers. Each timer represents a Nomad client connection. |
Dependent item | nomad.server.heartbeat.active Preprocessing
|
RPC requests, rate | Number of RPC requests being handled. |
Dependent item | nomad.server.rpc.request Preprocessing
|
RPC error requests, rate | Number of RPC requests being handled that result in an error. |
Dependent item | nomad.server.rpc.request_error Preprocessing
|
RPC queries, rate | Number of RPC queries. |
Dependent item | nomad.server.rpc.query Preprocessing
|
RPC job allocations time | Time elapsed for Job.Allocations RPC call. |
Dependent item | nomad.server.job.allocations Preprocessing
|
RPC job evaluations time | Time elapsed for Job.Evaluations RPC call. |
Dependent item | nomad.server.job.evaluations Preprocessing
|
RPC get job time | Time elapsed for Job.GetJob RPC call. |
Dependent item | nomad.server.job.get_job Preprocessing
|
Plan apply time | Time elapsed to apply a plan. |
Dependent item | nomad.server.plan.apply Preprocessing
|
Plan evaluate time | Time elapsed to evaluate a plan. |
Dependent item | nomad.server.plan.evaluate Preprocessing
|
RPC plan submit time | Time elapsed for Plan.Submit RPC call. |
Dependent item | nomad.server.plan.submit Preprocessing
|
Plan raft index processing time | Time elapsed that planner waits for the raft index of the plan to be processed. |
Dependent item | nomad.server.plan.wait_for_index Preprocessing
|
RPC list time | Time elapsed for Node.List RPC call. |
Dependent item | nomad.server.client.list Preprocessing
|
RPC update allocations time | Time elapsed for Node.UpdateAlloc RPC call. |
Dependent item | nomad.server.client.update_alloc Preprocessing
|
RPC update status time | Time elapsed for Node.UpdateStatus RPC call. |
Dependent item | nomad.server.client.update_status Preprocessing
|
RPC get client allocs time | Time elapsed for Node.GetClientAllocs RPC call. |
Dependent item | nomad.server.client.get_client_allocs Preprocessing
|
RPC eval dequeue time | Time elapsed for Eval.Dequeue RPC call. |
Dependent item | nomad.server.client.dequeue Preprocessing
|
Vault token last renewal | Time since last successful Vault token renewal. |
Dependent item | nomad.server.vault.token_last_renewal Preprocessing
|
Vault token next renewal | Time until next Vault token renewal attempt. |
Dependent item | nomad.server.vault.token_next_renewal Preprocessing
|
Vault token TTL | Time to live for Vault token. |
Dependent item | nomad.server.vault.token_ttl Preprocessing
|
Vault tokens revoked | Count of revoked tokens. |
Dependent item | nomad.server.vault.distributed_tokens_revoked Preprocessing
|
Jobs dead | Number of dead jobs. |
Dependent item | nomad.server.job_status.dead Preprocessing
|
Jobs pending | Number of pending jobs. |
Dependent item | nomad.server.job_status.pending Preprocessing
|
Jobs running | Number of running jobs. |
Dependent item | nomad.server.job_status.running Preprocessing
|
Job allocations completed | Number of complete allocations for a job. |
Dependent item | nomad.server.job_summary.complete Preprocessing
|
Job allocations failed | Number of failed allocations for a job. |
Dependent item | nomad.server.job_summary.failed Preprocessing
|
Job allocations lost | Number of lost allocations for a job. |
Dependent item | nomad.server.job_summary.lost Preprocessing
|
Job allocations unknown | Number of unknown allocations for a job. |
Dependent item | nomad.server.job_summary.unknown Preprocessing
|
Job allocations queued | Number of queued allocations for a job. |
Dependent item | nomad.server.job_summary.queued Preprocessing
|
Job allocations running | Number of running allocations for a job. |
Dependent item | nomad.server.job_summary.running Preprocessing
|
Job allocations starting | Number of starting allocations for a job. |
Dependent item | nomad.server.job_summary.starting Preprocessing
|
Gossip time | Time elapsed to broadcast gossip messages. |
Dependent item | nomad.server.memberlist.gossip Preprocessing
|
Leader barrier time | Time elapsed to establish a raft barrier during leader transition. |
Dependent item | nomad.server.leader.barrier Preprocessing
|
Reconcile peer time | Time elapsed to reconcile a serf peer with state store. |
Dependent item | nomad.server.leader.reconcile_member Preprocessing
|
Total reconcile time | Time elapsed to reconcile all serf peers with state store. |
Dependent item | nomad.server.leader.reconcile Preprocessing
|
Leader last contact | Time since last contact to leader. General indicator of Raft latency. |
Dependent item | nomad.server.raft.leader.lastContact Preprocessing
|
Plan queue | Count of evals in the plan queue. |
Dependent item | nomad.server.plan.queue_depth Preprocessing
|
Worker evaluation create time | Time elapsed for worker to create an eval. |
Dependent item | nomad.server.worker.create_eval Preprocessing
|
Worker evaluation dequeue time | Time elapsed for worker to dequeue an eval. |
Dependent item | nomad.server.worker.dequeue_eval Preprocessing
|
Worker invoke scheduler time | Time elapsed for worker to invoke the scheduler. |
Dependent item | nomad.server.worker.invoke_scheduler_service Preprocessing
|
Worker acknowledgement send time | Time elapsed for worker to send acknowledgement. |
Dependent item | nomad.server.worker.send_ack Preprocessing
|
Worker submit plan time | Time elapsed for worker to submit plan. |
Dependent item | nomad.server.worker.submit_plan Preprocessing
|
Worker update evaluation time | Time elapsed for worker to submit updated eval. |
Dependent item | nomad.server.worker.update_eval Preprocessing
|
Worker log replication time | Time elapsed that worker waits for the raft index of the eval to be processed. |
Dependent item | nomad.server.worker.wait_for_index Preprocessing
|
Raft calls blocked, rate | Count of blocking raft API calls. |
Dependent item | nomad.server.raft.barrier Preprocessing
|
Raft commit logs enqueued | Count of logs enqueued. |
Dependent item | nomad.server.raft.commit_num_logs Preprocessing
|
Raft transactions, rate | Number of Raft transactions. |
Dependent item | nomad.server.raft.apply Preprocessing
|
Raft commit time | Time elapsed to commit writes. |
Dependent item | nomad.server.raft.commit_time Preprocessing
|
Raft transaction commit time | Raft transaction commit time. |
Dependent item | nomad.server.raft.replication.appendEntries Preprocessing
|
FSM apply time | Time elapsed to apply write to FSM. |
Dependent item | nomad.server.raft.fsm.apply Preprocessing
|
FSM enqueue time | Time elapsed to enqueue write to FSM. |
Dependent item | nomad.server.raft.fsm.enqueue Preprocessing
|
FSM autopilot time | Time elapsed to apply Autopilot raft entry. |
Dependent item | nomad.server.raft.fsm.autopilot Preprocessing
|
FSM register node time | Time elapsed to apply RegisterNode raft entry. |
Dependent item | nomad.server.raft.fsm.register_node Preprocessing
|
FSM index | Current index applied to FSM. |
Dependent item | nomad.server.raft.applied_index Preprocessing
|
Raft last index | Most recent index seen. |
Dependent item | nomad.server.raft.last_index Preprocessing
|
Dispatch log time | Time elapsed to write log, mark in flight, and start replication. |
Dependent item | nomad.server.raft.leader.dispatch_log Preprocessing
|
Logs dispatched | Count of logs dispatched. |
Dependent item | nomad.server.raft.leader.dispatch_num_logs Preprocessing
|
Heartbeat fails | Count of failing to heartbeat and starting election. |
Dependent item | nomad.server.raft.transition.heartbeat_timeout Preprocessing
|
Objects freed, rate | Count of objects freed from heap by go runtime GC. |
Dependent item | nomad.server.runtime.free_count Preprocessing
|
GC pause time | Go runtime GC pause times. |
Dependent item | nomad.server.runtime.gc_pause_ns Preprocessing
|
GC metadata size | Go runtime GC metadata size in bytes. |
Dependent item | nomad.server.runtime.sys_bytes Preprocessing
|
GC runs | Count of go runtime GC runs. |
Dependent item | nomad.server.runtime.total_gc_runs Preprocessing
|
Memberlist events | Count of memberlist events received. |
Dependent item | nomad.server.serf.queue.event Preprocessing
|
Memberlist changes | Count of memberlist changes. |
Dependent item | nomad.server.serf.queue.intent Preprocessing
|
Memberlist queries | Count of memberlist queries. |
Dependent item | nomad.server.serf.queue.queries Preprocessing
|
Snapshot index | Current snapshot index. |
Dependent item | nomad.server.state.snapshot.index Preprocessing
|
Services ready to schedule | Count of service evals ready to be scheduled. |
Dependent item | nomad.server.broker.service_ready Preprocessing
|
Services unacknowledged | Count of unacknowledged service evals. |
Dependent item | nomad.server.broker.service_unacked Preprocessing
|
System evaluations ready to schedule | Count of service evals ready to be scheduled. |
Dependent item | nomad.server.broker.system_ready Preprocessing
|
System evaluations unacknowledged | Count of unacknowledged system evals. |
Dependent item | nomad.server.broker.system_unacked Preprocessing
|
BoltDB free pages | Number of BoltDB free pages. |
Dependent item | nomad.server.raft.boltdb.num_free_pages Preprocessing
|
BoltDB pending pages | Number of BoltDB pending pages. |
Dependent item | nomad.server.raft.boltdb.num_pending_pages Preprocessing
|
BoltDB free page bytes | Number of free page bytes. |
Dependent item | nomad.server.raft.boltdb.free_page_bytes Preprocessing
|
BoltDB freelist bytes | Number of freelist bytes. |
Dependent item | nomad.server.raft.boltdb.freelist_bytes Preprocessing
|
BoltDB read transactions, rate | Count of total read transactions. |
Dependent item | nomad.server.raft.boltdb.total_read_txn Preprocessing
|
BoltDB open read transactions | Number of current open read transactions. |
Dependent item | nomad.server.raft.boltdb.open_read_txn Preprocessing
|
BoltDB pages in use | Number of pages in use. |
Dependent item | nomad.server.raft.boltdb.txstats.page_count Preprocessing
|
BoltDB page allocations, rate | Number of page allocations. |
Dependent item | nomad.server.raft.boltdb.txstats.page_alloc Preprocessing
|
BoltDB cursors | Count of total database cursors. |
Dependent item | nomad.server.raft.boltdb.txstats.cursor_count Preprocessing
|
BoltDB nodes, rate | Count of total database nodes. |
Dependent item | nomad.server.raft.boltdb.txstats.node_count Preprocessing
|
BoltDB node dereferences, rate | Count of total database node dereferences. |
Dependent item | nomad.server.raft.boltdb.txstats.node_deref Preprocessing
|
BoltDB rebalance operations, rate | Count of total rebalance operations. |
Dependent item | nomad.server.raft.boltdb.txstats.rebalance Preprocessing
|
BoltDB split operations, rate | Count of total split operations. |
Dependent item | nomad.server.raft.boltdb.txstats.split Preprocessing
|
BoltDB spill operations, rate | Count of total spill operations. |
Dependent item | nomad.server.raft.boltdb.txstats.spill Preprocessing
|
BoltDB write operations, rate | Count of total write operations. |
Dependent item | nomad.server.raft.boltdb.txstats.write Preprocessing
|
BoltDB rebalance time | Sample of rebalance operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.rebalance_time Preprocessing
|
BoltDB spill time | Sample of spill operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.spill_time Preprocessing
|
BoltDB write time | Sample of write operation times. |
Dependent item | nomad.server.raft.boltdb.txstats.write_time Preprocessing
|
Service [rpc] state | Current [rpc] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}] Preprocessing
|
Service [serf] state | Current [serf] service state. |
Simple check | net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}] Preprocessing
|
Namespace list time | Time elapsed for Namespace.ListNamespaces. |
Dependent item | nomad.server.namespace.list_namespace Preprocessing
|
Autopilot state | Current autopilot state. |
Dependent item | nomad.server.autopilot.state Preprocessing
|
Autopilot failure tolerance | The number of redundant healthy servers that can fail without causing an outage. |
Dependent item | nomad.server.autopilot.failure_tolerance Preprocessing
|
FSM allocation client update time | Time elapsed to apply AllocClientUpdate raft entry. |
Dependent item | nomad.server.alloc_client_update Preprocessing
|
FSM apply plan results time | Time elapsed to apply ApplyPlanResults raft entry. |
Dependent item | nomad.server.fsm.apply_plan_results Preprocessing
|
FSM update evaluation time | Time elapsed to apply UpdateEval raft entry. |
Dependent item | nomad.server.fsm.update_eval Preprocessing
|
FSM job registration time | Time elapsed to apply RegisterJob raft entry. |
Dependent item | nomad.server.fsm.register_job Preprocessing
|
Allocation reschedule attempts | Count of attempts to reschedule an allocation. |
Dependent item | nomad.server.scheduler.allocs.rescheduled.attempted Preprocessing
|
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
HashiCorp Nomad Server: Monitoring API connection has failed | Monitoring API connection has failed. |
find(/HashiCorp Nomad Server by HTTP/nomad.server.data.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Internal stats API connection has failed | Internal stats API connection has failed. |
find(/HashiCorp Nomad Server by HTTP/nomad.server.stats.api.response,,"like","{$NOMAD.API.RESPONSE.SUCCESS}")=0 |
Average | Manual close: Yes Depends on:
|
HashiCorp Nomad Server: Nomad server version has changed | Nomad server version has changed. |
change(/HashiCorp Nomad Server by HTTP/nomad.server.version)<>0 |
Info | Manual close: Yes |
HashiCorp Nomad Server: Cluster role has changed | Cluster role has changed. |
change(/HashiCorp Nomad Server by HTTP/nomad.server.raft.cluster_role) <> 0 |
Info | Manual close: Yes |
HashiCorp Nomad Server: Current number of open files is too high | Heavy file descriptor usage (i.e., near the process file descriptor limit) indicates a potential file descriptor exhaustion issue. |
min(/HashiCorp Nomad Server by HTTP/nomad.server.process_open_fds,5m)/last(/HashiCorp Nomad Server by HTTP/nomad.server.process_max_fds)*100>{$NOMAD.OPEN.FDS.MAX} |
Warning | |
HashiCorp Nomad Server: Dead jobs found | Jobs with the |
last(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead) > 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.job_status.dead,5m) = 0 |
Warning | Manual close: Yes |
HashiCorp Nomad Server: Leader last contact timeout exceeded | The nomad.raft.leader.lastContact metric is a general indicator of Raft latency which can be used to observe how Raft timing is performing and guide infrastructure provisioning. |
min(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) >= {$NOMAD.SERVER.LEADER.LATENCY} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.raft.leader.lastContact,5m) = 0 |
Warning | |
HashiCorp Nomad Server: Service [rpc] is down | Cannot establish the connection to [rpc] service port {$NOMAD.SERVER.RPC.PORT}. |
last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.RPC.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Service [serf] is down | Cannot establish the connection to [serf] service port {$NOMAD.SERVER.SERF.PORT}. |
last(/HashiCorp Nomad Server by HTTP/net.tcp.service[tcp,,{$NOMAD.SERVER.SERF.PORT}]) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Autopilot is unhealthy | The autopilot is in unhealthy state. The successful failover probability is extremely low. |
last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state) = 0 and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.state,5m) = 0 |
Average | Manual close: Yes |
HashiCorp Nomad Server: Autopilot redundancy is low | The autopilot redundancy is low. |
last(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance) < {$NOMAD.REDUNDANCY.MIN} and nodata(/HashiCorp Nomad Server by HTTP/nomad.server.autopilot.failure_tolerance,5m) = 0 |
Warning | Manual close: Yes |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums