Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia?at=release/7.2
Nvidia by Zabbix agent 2
Overview
This template is designed for Nvidia GPU monitoring and doesn't require any external scripts. All Nvidia GPUs will be discovered. Set filters with macros if you want to override default filter parameters.
Requirements
Zabbix version: 7.2 and higher.
Tested versions
This template has been tested on:
- Nvidia GTX 1650s
- Nvidia RTX 2070Ti
Configuration
Zabbix should be configured according to the instructions in the Templates out of the box section.
Setup
- Setup and configure Zabbix agent 2 compiled with the Nvidia monitoring plugin.
- Create a host with Zabbix agent interface and attach the template to it.
Test availability: zabbix_get -s nvidia-host -k nvml.system.driver.version
Macros used
Name | Description | Default |
---|---|---|
{$NVIDIA.GPU.UTIL.WARN} | Warning threshold for GPU overall utilization, in %. |
80 |
{$NVIDIA.GPU.UTIL.CRIT} | Critical threshold for GPU overall utilization, in %. |
90 |
{$NVIDIA.ENCODER.UTIL.WARN} | Warning threshold for encoder utilization, in %. |
80 |
{$NVIDIA.ENCODER.UTIL.CRIT} | Critical threshold for encoder utilization, in %. |
90 |
{$NVIDIA.DECODER.UTIL.WARN} | Warning threshold for decoder utilization, in %. |
80 |
{$NVIDIA.DECODER.UTIL.CRIT} | Critical threshold for decoder utilization, in %. |
90 |
{$NVIDIA.MEMORY.UTIL.WARN} | Warning threshold for memory utilization, in %. |
80 |
{$NVIDIA.MEMORY.UTIL.CRIT} | Critical threshold for memory utilization, in %. |
90 |
{$NVIDIA.FAN.SPEED.WARN} | Warning threshold for fan speed, in %. |
80 |
{$NVIDIA.FAN.SPEED.CRIT} | Critical threshold for fan speed, in %. |
90 |
{$NVIDIA.TEMPERATURE.WARN} | Warning threshold for temperature, in %. |
80 |
{$NVIDIA.TEMPERATURE.CRIT} | Critical threshold for temperature, in %. |
90 |
{$NVIDIA.POWER.UTIL.WARN} | Warning threshold for power usage, in %. |
80 |
{$NVIDIA.POWER.UTIL.CRIT} | Critical threshold for power usage, in %. |
90 |
{$NVIDIA.NAME.MATCHES} | Filter to include GPUs by name in discovery. |
.* |
{$NVIDIA.NAME.NOT_MATCHES} | Filter to exclude GPUs by name in discovery. |
CHANGE IF NEEDED |
{$NVIDIA.UUID.MATCHES} | Filter to include GPUs by UUID in discovery. |
.* |
{$NVIDIA.UUID.NOT_MATCHES} | Filter to exclude GPUs by UUID in discovery. |
CHANGE IF NEEDED |
Items
Name | Description | Type | Key and additional info |
---|---|---|---|
Driver version | Retrieves the version of the system's graphics driver. For all Nvidia products. |
Zabbix agent | nvml.system.driver.version Preprocessing
|
NVML library version | Retrieves the version of the NVML library. For all Nvidia products. |
Zabbix agent | nvml.version Preprocessing
|
Number of devices | Retrieves the number of compute devices in the system. A compute device is a single GPU. For all Nvidia products. |
Zabbix agent | nvml.device.count Preprocessing
|
Get devices | Retrieves list of Nvidia devices in the system. |
Zabbix agent | nvml.device.get |
Triggers
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Nvidia: Driver version has changed | Driver version has changed. |
change(/Nvidia by Zabbix agent 2/nvml.system.driver.version) <> 0 |
Info | Manual close: Yes |
Nvidia: NVML library has changed | NVML library version has changed. |
change(/Nvidia by Zabbix agent 2/nvml.version) <> 0 |
Info | Manual close: Yes |
Nvidia: Number of devices has changed | Number of devices has changed. Check out if it was intentional. |
change(/Nvidia by Zabbix agent 2/nvml.device.count) <> 0 |
Warning | Manual close: Yes |
LLD rule GPU Discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
GPU Discovery | Nvidia GPU discovery in the system. |
Dependent item | nvml.device.discovery Preprocessing
|
Item prototypes for GPU Discovery
Name | Description | Type | Key and additional info |
---|---|---|---|
[{#UUID}]: Serial number | Retrieves the globally unique board serial number associated with this device's board. For all products with an inforom. This number matches the serial number tag that is physically attached to the board. |
Zabbix agent | nvml.device.serial["{#UUID}"] Preprocessing
|
[{#UUID}]: Encoder utilization | Retrieves the current utilization for the Encoder. For Nvidia Kepler or newer fully supported devices. |
Zabbix agent | nvml.device.encoder.utilization["{#UUID}"] |
[{#UUID}]: Decoder utilization | Retrieves the current utilization for the Decoder. For Nvidia Kepler or newer fully supported devices. |
Zabbix agent | nvml.device.decoder.utilization["{#UUID}"] |
[{#UUID}]: Fan speed | Retrieves the intended operating speed of the device's specified fan. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, the output will not match the actual fan speed. For all Nvidia discrete products with dedicated fans. The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. This value may exceed 100% in certain cases. |
Zabbix agent | nvml.device.fan.speed.avg["{#UUID}"] |
[{#UUID}]: Power usage | Retrieves power usage for this GPU in watts and its associated circuitry (e.g. memory). For Nvidia Fermi or newer fully supported devices. On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw. On Ampere (except GA100) or newer GPUs, the API returns power averaged over 1 sec interval. On GA100 and older architectures, instantaneous power is returned. |
Zabbix agent | nvml.device.power.usage["{#UUID}"] Preprocessing
|
[{#UUID}]: Power limit | Retrieves the power management limit associated with this device. For Nvidia Fermi or newer fully supported devices. The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit the power management algorithm kicks in. This reading is only available if power management mode is supported. |
Zabbix agent | nvml.device.power.limit["{#UUID}"] Preprocessing
|
[{#UUID}]: Energy consumption | Retrieves total energy consumption for this GPU in joules (J) since the driver was last reloaded. For Nvidia Volta or newer fully supported devices. |
Zabbix agent | nvml.device.energy.consumption["{#UUID}"] Preprocessing
|
[{#UUID}]: Temperature | Retrieves the current temperature readings for the device, in degrees C. For Nvidia all products. |
Zabbix agent | nvml.device.temperature["{#UUID}"] |
[{#UUID}]: Memory frequency | Retrieves the current memory clock speed for the device. For Nvidia Fermi or newer fully supported devices. |
Zabbix agent | nvml.device.memory.frequency["{#UUID}"] Preprocessing
|
[{#UUID}]: SM frequency | Retrieves the current SM clock speed for the device. For Nvidia Fermi or newer fully supported devices. |
Zabbix agent | nvml.device.sm.frequency["{#UUID}"] Preprocessing
|
[{#UUID}]: Graphics frequency | Retrieves the current graphics clock speed for the device. For Nvidia Fermi or newer fully supported devices. |
Zabbix agent | nvml.device.graphics.frequency["{#UUID}"] Preprocessing
|
[{#UUID}]: Video frequency | Retrieves the current video encoder/decoder clock speed for the device. For Nvidia Fermi or newer fully supported devices. |
Zabbix agent | nvml.device.video.frequency["{#UUID}"] Preprocessing
|
[{#UUID}]: Performance state | Retrieves the current performance state for the device. For Nvidia Fermi or newer fully supported devices. |
Zabbix agent | nvml.device.performance.state["{#UUID}"] |
[{#UUID}]: Device utilization, get | Retrieves the current utilization rates for the device's major subsystems. For Nvidia Fermi or newer fully supported devices. |
Zabbix agent | nvml.device.utilization["{#UUID}"] |
[{#UUID}]: GPU utilization | Percent of time over the past sample period during which one or more kernels was executing on the GPU. For Nvidia Fermi or newer fully supported devices. |
Dependent item | nvml.device.utilization.gpu["{#UUID}"] Preprocessing
|
[{#UUID}]: Memory utilization | Percent of time over the past sample period during which global (device) memory was being read or written. For Nvidia Fermi or newer fully supported devices. |
Dependent item | nvml.device.utilization.memory["{#UUID}"] Preprocessing
|
[{#UUID}]: Encoder stats | Retrieves the current encoder statistics for a given device. For Nvidia Maxwell or newer fully supported devices. |
Zabbix agent | nvml.device.encoder.stats.get["{#UUID}"] |
[{#UUID}]: Encoder sessions | Retrieves the current count of active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices. |
Dependent item | nvml.device.encoder.stats.sessions["{#UUID}"] Preprocessing
|
[{#UUID}]: Encoder average FPS | Retrieves the trailing average FPS of all active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices. |
Dependent item | nvml.device.encoder.stats.fps["{#UUID}"] Preprocessing
|
[{#UUID}]: Encoder average latency | Retrieves the current encode latency for a given device. For Nvidia Maxwell or newer fully supported devices. |
Dependent item | nvml.device.encoder.stats.latency["{#UUID}"] Preprocessing
|
[{#UUID}]: FB memory, get | Retrieves the amount of used, free, reserved and total memory available on the device. For all Nvidia products. Enabling ECC reduces the amount of total available memory, due to the extra required parity bits. Under WDDM most device memory is allocated and managed on startup by Windows. Under Linux and Windows TCC, the reported amount of used memory is equal to the sum of memory allocated by all active channels on the device. |
Zabbix agent | nvml.device.memory.fb.get["{#UUID}"] |
[{#UUID}]: FB memory, total | Total physical memory on the device. For all Nvidia products. |
Dependent item | nvml.device.memory.fb.total["{#UUID}"] Preprocessing
|
[{#UUID}]: FB memory, reserved | Memory reserved for system use (driver or firmware) on the device. For all Nvidia products. |
Dependent item | nvml.device.memory.fb.reserved["{#UUID}"] Preprocessing
|
[{#UUID}]: FB memory, free | Unallocated memory on the device. For all Nvidia products. |
Dependent item | nvml.device.memory.fb.free["{#UUID}"] Preprocessing
|
[{#UUID}]: FB memory, used | Allocated memory on the device. For all Nvidia products. |
Dependent item | nvml.device.memory.fb.used["{#UUID}"] Preprocessing
|
[{#UUID}]: BAR1 memory, get | Gets Total, Available and Used size of BAR1 memory. BAR1 is used to map the FB (device memory) so that it can be directly accessed by the CPU or by 3rd party devices (peer-to-peer on the PCIE bus). For Nvidia Kepler or newer fully supported devices |
Zabbix agent | nvml.device.memory.bar1.get["{#UUID}"] |
[{#UUID}]: BAR1 memory, total | Total BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices |
Dependent item | nvml.device.memory.bar1.total["{#UUID}"] Preprocessing
|
[{#UUID}]: BAR1 memory, free | Unallocated BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices |
Dependent item | nvml.device.memory.bar1.free["{#UUID}"] Preprocessing
|
[{#UUID}]: BAR1 memory, used | Allocated used BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices |
Dependent item | nvml.device.memory.bar1.used["{#UUID}"] Preprocessing
|
[{#UUID}]: Memory ECC errors, get | Retrieves the GPU device memory error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled. |
Zabbix agent | nvml.device.errors.memory["{#UUID}"] Preprocessing
|
[{#UUID}]: Memory ECC errors, corrected | Retrieves the count of GPU device memory errors that were corrected. For ECC errors, these are single bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices. |
Dependent item | nvml.device.errors.memory.corrected["{#UUID}"] Preprocessing
|
[{#UUID}]: Memory ECC errors, uncorrected | Retrieves the count of GPU device memory errors that were not corrected. For ECC errors, these are double bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices. |
Dependent item | nvml.device.errors.memory.uncorrected["{#UUID}"] Preprocessing
|
[{#UUID}]: Register file errors, get | Retrieves the GPU register file error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled. |
Zabbix agent | nvml.device.errors.register["{#UUID}"] Preprocessing
|
[{#UUID}]: Register file errors, corrected | Retrieves the count of GPU register file errors that were corrected. For ECC errors, these are single bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices. |
Dependent item | nvml.device.errors.register.corrected["{#UUID}"] Preprocessing
|
[{#UUID}]: Register file errors, uncorrected | Retrieves the count of GPU register file errors that were not corrected. For ECC errors, these are double bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices. |
Dependent item | nvml.device.errors.register.uncorrected["{#UUID}"] Preprocessing
|
[{#UUID}]: PCIe utilization, get | Retrieve PCIe utilization information. For Maxwell or newer fully supported devices. |
Zabbix agent | nvml.device.pci.utilization["{#UUID}"] |
[{#UUID}]: PCIe utilization, Rx | The PCIe Rx (receive) throughput over 20ms interval on the device. For Maxwell or newer fully supported devices. |
Dependent item | nvml.device.pci.utilization.rx.rate["{#UUID}"] Preprocessing
|
[{#UUID}]: PCIe utilization, Tx | The PCIe Tx (transmit) throughput over 20ms interval on the device. For Maxwell or newer fully supported devices. |
Dependent item | nvml.device.pci.utilization.tx.rate["{#UUID}"] Preprocessing
|
Trigger prototypes for GPU Discovery
Name | Description | Expression | Severity | Dependencies and additional info |
---|---|---|---|---|
Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold | [{#UUID}]: Encoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.CRIT} |
Average | |
Nvidia: [{#UUID}]: Encoder utilization exceeded warning threshold | [{#UUID}]: Encoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold | [{#UUID}]: Decoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.CRIT} |
Average | |
Nvidia: [{#UUID}]: Decoder utilization exceeded warning threshold | [{#UUID}]: Decoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: Fan speed exceeded critical threshold | [{#UUID}]: Fan speed is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.CRIT} |
Average | |
Nvidia: [{#UUID}]: Fan speed exceeded warning threshold | [{#UUID}]: Fan speed is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: Power usage exceeded critical threshold | [{#UUID}]: Power usage is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.CRIT} |
Average | |
Nvidia: [{#UUID}]: Power usage exceeded warning threshold | [{#UUID}]: Power usage is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: Power limit has changed | Power limit for the device has changed. Checkout out if it was intentional. |
change(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"]) <> 0 |
Info | Manual close: Yes |
Nvidia: [{#UUID}]: Temperature exceeded critical threshold | [{#UUID}]: Temperature is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.CRIT} |
Average | |
Nvidia: [{#UUID}]: Temperature exceeded warning threshold | [{#UUID}]: Temperature is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold | [{#UUID}]: GPU utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.CRIT} |
Average | |
Nvidia: [{#UUID}]: GPU utilization exceeded warning threshold | [{#UUID}]: GPU utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold | [{#UUID}]: Memory utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.CRIT} |
Average | |
Nvidia: [{#UUID}]: Memory utilization exceeded warning threshold | [{#UUID}]: Memory utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive. |
min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.WARN} |
Warning | Depends on:
|
Nvidia: [{#UUID}]: Encoder average latency is high | last(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"]) > (2 * avg(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"],3m)) |
Warning | ||
Nvidia: [{#UUID}]: Total FB memory has changed | Total FB memory has changed. That could mean possible memory degradation, hardware configuration changes or memory reservation by system or software. |
change(/Nvidia by Zabbix agent 2/nvml.device.memory.fb.total["{#UUID}"]) <> 0 |
Warning | Manual close: Yes |
Nvidia: [{#UUID}]: Total BAR1 memory has changed | Total BAR1 memory has changed. That could mean possible memory degradation, hardware configuration changes or memory reservation by system or software. |
change(/Nvidia by Zabbix agent 2/nvml.device.memory.bar1.total["{#UUID}"]) <> 0 |
Warning | Manual close: Yes |
Nvidia: [{#UUID}]: Number of corrected memory ECC errors has changed | Increasing number of corrected ECC errors can indicate (but not necessary mean) aging or degrading memory. |
change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.corrected["{#UUID}"]) <> 0 |
Info | Manual close: Yes |
Nvidia: [{#UUID}]: Number of uncorrected memory ECC errors has changed | Increasing number of uncorrected ECC errors can indicate potential issues such as: data corruption, system instability, hardware issues |
change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.uncorrected["{#UUID}"]) <> 0 |
Info | Manual close: Yes |
Nvidia: [{#UUID}]: Number of corrected register file errors has changed | Increasing number of corrected register file errors can indicate (but not necessary mean) wearing, aging or degrading memory. |
change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.corrected["{#UUID}"]) <> 0 |
Info | Manual close: Yes |
Nvidia: [{#UUID}]: Number of uncorrected register file errors has changed | Increasing number of uncorrected register file errors can indicate potential issues such as: data corruption, system instability, hardware degradation |
change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.uncorrected["{#UUID}"]) <> 0 |
Info | Manual close: Yes |
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide feedback, discuss the template, or ask for help at ZABBIX forums