Available solutions

Nvidia by Zabbix agent 2
Nvidia by Zabbix agent 2 active
3rd party solutions

This template is for Zabbix version: 7.4

Also available for: 7.2

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2?at=release/7.4

Nvidia by Zabbix agent 2

Overview

This template is designed for Nvidia GPU monitoring and doesn't require any external scripts. All Nvidia GPUs will be discovered. Set filters with macros if you want to override default filter parameters.

Requirements

Zabbix version: 7.4 and higher.

Tested versions

This template has been tested on:

Nvidia GTX 1650s
Nvidia RTX 2070Ti

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Set up and configure Zabbix agent 2 compiled with the Nvidia monitoring plugin.
Create a host with a Zabbix agent interface and attach the template to it.

Test availability: zabbix_get -s nvidia-host -k nvml.system.driver.version

Macros used

Name	Description	Default
{$NVIDIA.GPU.UTIL.WARN}	Warning threshold for overall GPU utilization, in %.	`80`
{$NVIDIA.GPU.UTIL.CRIT}	Critical threshold for overall GPU utilization, in %.	`90`
{$NVIDIA.ENCODER.UTIL.WARN}	Warning threshold for encoder utilization, in %.	`80`
{$NVIDIA.ENCODER.UTIL.CRIT}	Critical threshold for encoder utilization, in %.	`90`
{$NVIDIA.DECODER.UTIL.WARN}	Warning threshold for decoder utilization, in %.	`80`
{$NVIDIA.DECODER.UTIL.CRIT}	Critical threshold for decoder utilization, in %.	`90`
{$NVIDIA.MEMORY.UTIL.WARN}	Warning threshold for memory utilization, in %.	`80`
{$NVIDIA.MEMORY.UTIL.CRIT}	Critical threshold for memory utilization, in %.	`90`
{$NVIDIA.FAN.SPEED.WARN}	Warning threshold for fan speed, in %.	`80`
{$NVIDIA.FAN.SPEED.CRIT}	Critical threshold for fan speed, in %.	`90`
{$NVIDIA.TEMPERATURE.WARN}	Warning threshold for temperature, in %.	`80`
{$NVIDIA.TEMPERATURE.CRIT}	Critical threshold for temperature, in %.	`90`
{$NVIDIA.POWER.UTIL.WARN}	Warning threshold for power usage, in %.	`80`
{$NVIDIA.POWER.UTIL.CRIT}	Critical threshold for power usage, in %.	`90`
{$NVIDIA.NAME.MATCHES}	Filter to include GPUs by name in discovery.	`.*`
{$NVIDIA.NAME.NOT_MATCHES}	Filter to exclude GPUs by name in discovery.	`CHANGE IF NEEDED`
{$NVIDIA.UUID.MATCHES}	Filter to include GPUs by UUID in discovery.	`.*`
{$NVIDIA.UUID.NOT_MATCHES}	Filter to exclude GPUs by UUID in discovery.	`CHANGE IF NEEDED`

Items

Name	Description	Type	Key and additional info
Driver version	Retrieves the version of the system's graphics driver. For all Nvidia products.	Zabbix agent	nvml.system.driver.version Preprocessing Discard unchanged with heartbeat: `1d`
NVML library version	Retrieves the version of the NVML library. For all Nvidia products.	Zabbix agent	nvml.version Preprocessing Discard unchanged with heartbeat: `1d`
Number of devices	Retrieves the number of compute devices in the system. A compute device is a single GPU. For all Nvidia products.	Zabbix agent	nvml.device.count Preprocessing Discard unchanged with heartbeat: `1d`
Get devices	Retrieves a list of Nvidia devices in the system.	Zabbix agent	nvml.device.get

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: Driver version has changed	Driver version has changed. Check the Nvidia website for the specific driver version: https://www.nvidia.com/en-us/drivers/	`change(/Nvidia by Zabbix agent 2/nvml.system.driver.version) <> 0`	Info	Manual close: Yes
Nvidia: NVML library has changed	NVML library version has changed. Check the changelog for details: https://docs.nvidia.com/deploy/nvml-api/change-log.html	`change(/Nvidia by Zabbix agent 2/nvml.version) <> 0`	Info	Manual close: Yes
Nvidia: Number of devices has changed	Number of devices has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2/nvml.device.count) <> 0`	Warning	Manual close: Yes

LLD rule GPU Discovery

Name Description Type Key and additional info

GPU Discovery

Name	Description	Type	Key and additional info
GPU Discovery	Nvidia GPU discovery in the system.	Dependent item	nvml.device.discovery Preprocessing Discard unchanged with heartbeat: `1d`

Nvidia GPU discovery in the system.

Dependent item

nvml.device.discovery

Preprocessing

Discard unchanged with heartbeat: 1d

Item prototypes for GPU Discovery

Name	Description	Type	Key and additional info
[{#UUID}]: Serial number	Retrieves the globally unique board serial number associated with this device's board. For all products with an inforom. This number matches the serial number tag that is physically attached to the board.	Zabbix agent	nvml.device.serial["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `The device does not support operation to retrieve serial number.`
[{#UUID}]: Encoder utilization	Retrieves the current utilization for the Encoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent	nvml.device.encoder.utilization["{#UUID}"]
[{#UUID}]: Decoder utilization	Retrieves the current utilization for the Decoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent	nvml.device.decoder.utilization["{#UUID}"]
[{#UUID}]: Fan speed	Retrieves the intended operating speed of the specified device fan. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, the output will not match the actual fan speed. For all Nvidia discrete products with dedicated fans. The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. In certain cases, this value may exceed 100%.	Zabbix agent	nvml.device.fan.speed.avg["{#UUID}"]
[{#UUID}]: Power usage	Retrieves power usage for this GPU (in watts) and its associated circuitry (e.g. memory). For Nvidia Fermi or newer fully supported devices. On Fermi and Kepler GPUs, the reading is accurate to within +/- 5% of current power draw. On Ampere (except GA100) or newer GPUs, the API returns power averaged over a 1 second interval. On GA100 and older architectures, instantaneous power is returned.	Zabbix agent	nvml.device.power.usage["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Power limit	Retrieves the power management limit associated with this device. For Nvidia Fermi or newer fully supported devices. The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit, the power management algorithm kicks in. This reading is only available if power management mode is supported.	Zabbix agent	nvml.device.power.limit["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Energy consumption	Retrieves the total energy consumption of this GPU in joules since the last driver reload. For Nvidia Volta or newer fully supported devices.	Zabbix agent	nvml.device.energy.consumption["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Temperature	Retrieves the current temperature readings for the device, in degrees C. For Nvidia all products.	Zabbix agent	nvml.device.temperature["{#UUID}"]
[{#UUID}]: Memory frequency	Retrieves the current memory clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.memory.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: SM frequency	Retrieves the current SM clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.sm.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Graphics frequency	Retrieves the current graphics clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.graphics.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Video frequency	Retrieves the current video encoder/decoder clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.video.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Performance state	Retrieves the current performance state for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.performance.state["{#UUID}"]
[{#UUID}]: Device utilization, get	Retrieves the current utilization rates for the device's major subsystems. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.utilization["{#UUID}"]
[{#UUID}]: GPU utilization	Percentage of time over the past sampling period during which one or more kernels were running on the GPU. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.gpu["{#UUID}"] Preprocessing JSON Path: `$.device`
[{#UUID}]: Memory utilization	Percentage of time over the past sampling period during which global (device) memory was being read or written. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.memory["{#UUID}"] Preprocessing JSON Path: `$.memory`
[{#UUID}]: Encoder stats	Retrieves the current encoder statistics for a given device. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent	nvml.device.encoder.stats.get["{#UUID}"]
[{#UUID}]: Encoder sessions	Retrieves the current count of active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.sessions["{#UUID}"] Preprocessing JSON Path: `$.session_count`
[{#UUID}]: Encoder average FPS	Retrieves the trailing average FPS of all active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.fps["{#UUID}"] Preprocessing JSON Path: `$.average_fps`
[{#UUID}]: Encoder average latency	Retrieves the current encode latency for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.latency["{#UUID}"] Preprocessing JSON Path: `$.average_latency_ms` Custom multiplier: `0.001`
[{#UUID}]: FB memory, get	Retrieves the amount of used, free, reserved, and total memory available on the device. For all Nvidia products. Enabling ECC reduces the amount of total available memory due to the extra required parity bits. Under WDDM, most of the device memory is allocated and managed on startup by Windows. Under Linux and Windows TCC, the reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.	Zabbix agent	nvml.device.memory.fb.get["{#UUID}"]
[{#UUID}]: FB memory, total	Total physical memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: FB memory, reserved	Memory reserved for system use (driver or firmware) on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.reserved["{#UUID}"] Preprocessing JSON Path: `$.reserved_memory_bytes` ⛔️Custom on fail: Set error to: `NVML library too old to support this metric.`
[{#UUID}]: FB memory, free	Unallocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: FB memory, used	Allocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: BAR1 memory, get	Gets Total, Available, and Used size of BAR1 memory. BAR1 is used to map the FB (device memory) so that it can be directly accessed by the CPU or 3rd party devices (peer-to-peer on the PCIE bus). For Nvidia Kepler or newer fully supported devices	Zabbix agent	nvml.device.memory.bar1.get["{#UUID}"]
[{#UUID}]: BAR1 memory, total	Total BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: BAR1 memory, free	Unallocated BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: BAR1 memory, used	Allocated used BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: Memory ECC errors, get	Retrieves the GPU device memory error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent	nvml.device.errors.memory["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Memory ECC errors, corrected	Retrieves the count of GPU device memory errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Memory ECC errors, uncorrected	Retrieves the count of GPU device memory errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: Register file errors, get	Retrieves the GPU register file error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent	nvml.device.errors.register["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Register file errors, corrected	Retrieves the count of GPU register file errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Register file errors, uncorrected	Retrieves the count of GPU register file errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: PCIe utilization, get	Retrieves PCIe utilization information. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent	nvml.device.pci.utilization["{#UUID}"]
[{#UUID}]: PCIe utilization, Rx	The PCIe Rx (receive) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.rx.rate["{#UUID}"] Preprocessing JSON Path: `$.rx_rate_kb_s` Custom multiplier: `1024`
[{#UUID}]: PCIe utilization, Tx	The PCIe Tx (transmit) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.tx.rate["{#UUID}"] Preprocessing JSON Path: `$.tx_rate_kb_s` Custom multiplier: `1024`

Trigger prototypes for GPU Discovery

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold	[{#UUID}]: Encoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Encoder utilization exceeded warning threshold	[{#UUID}]: Encoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold	[{#UUID}]: Decoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Decoder utilization exceeded warning threshold	[{#UUID}]: Decoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Fan speed exceeded critical threshold	[{#UUID}]: Fan speed is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.CRIT}`	Average
Nvidia: [{#UUID}]: Fan speed exceeded warning threshold	[{#UUID}]: Fan speed is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Fan speed exceeded critical threshold
Nvidia: [{#UUID}]: Power usage exceeded critical threshold	[{#UUID}]: Power usage is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Power usage exceeded warning threshold	[{#UUID}]: Power usage is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Power usage exceeded critical threshold
Nvidia: [{#UUID}]: Power limit has changed	Power limit for the device has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Temperature exceeded critical threshold	[{#UUID}]: Temperature is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.CRIT}`	Average
Nvidia: [{#UUID}]: Temperature exceeded warning threshold	[{#UUID}]: Temperature is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Temperature exceeded critical threshold
Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold	[{#UUID}]: GPU utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: GPU utilization exceeded warning threshold	[{#UUID}]: GPU utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold
Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold	[{#UUID}]: Memory utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Memory utilization exceeded warning threshold	[{#UUID}]: Memory utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold
Nvidia: [{#UUID}]: Encoder average latency is high		`last(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"]) > (2 * avg(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"],3m))`	Warning
Nvidia: [{#UUID}]: Total FB memory has changed	Total FB memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2/nvml.device.memory.fb.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Total BAR1 memory has changed	Total BAR1 memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2/nvml.device.memory.bar1.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected memory ECC errors has changed	An increasing number of corrected ECC errors can indicate (but not necessary mean) aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected memory ECC errors has changed	An increasing number of uncorrected ECC errors can indicate potential issues such as: data corruption, system instability, hardware issues	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected register file errors has changed	An increasing number of corrected register file errors can indicate (but not necessary mean) wearing, aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected register file errors has changed	An increasing number of uncorrected register file errors can indicate potential issues such as: data corruption, system instability, hardware degradation.	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.2

Also available for: 7.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2?at=release/7.2

Nvidia by Zabbix agent 2

Overview

Requirements

Zabbix version: 7.2 and higher.

Tested versions

This template has been tested on:

Nvidia GTX 1650s
Nvidia RTX 2070Ti

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Set up and configure Zabbix agent 2 compiled with the Nvidia monitoring plugin.
Create a host with a Zabbix agent interface and attach the template to it.

Test availability: zabbix_get -s nvidia-host -k nvml.system.driver.version

Macros used

Name	Description	Default
{$NVIDIA.GPU.UTIL.WARN}	Warning threshold for overall GPU utilization, in %.	`80`
{$NVIDIA.GPU.UTIL.CRIT}	Critical threshold for overall GPU utilization, in %.	`90`
{$NVIDIA.ENCODER.UTIL.WARN}	Warning threshold for encoder utilization, in %.	`80`
{$NVIDIA.ENCODER.UTIL.CRIT}	Critical threshold for encoder utilization, in %.	`90`
{$NVIDIA.DECODER.UTIL.WARN}	Warning threshold for decoder utilization, in %.	`80`
{$NVIDIA.DECODER.UTIL.CRIT}	Critical threshold for decoder utilization, in %.	`90`
{$NVIDIA.MEMORY.UTIL.WARN}	Warning threshold for memory utilization, in %.	`80`
{$NVIDIA.MEMORY.UTIL.CRIT}	Critical threshold for memory utilization, in %.	`90`
{$NVIDIA.FAN.SPEED.WARN}	Warning threshold for fan speed, in %.	`80`
{$NVIDIA.FAN.SPEED.CRIT}	Critical threshold for fan speed, in %.	`90`
{$NVIDIA.TEMPERATURE.WARN}	Warning threshold for temperature, in %.	`80`
{$NVIDIA.TEMPERATURE.CRIT}	Critical threshold for temperature, in %.	`90`
{$NVIDIA.POWER.UTIL.WARN}	Warning threshold for power usage, in %.	`80`
{$NVIDIA.POWER.UTIL.CRIT}	Critical threshold for power usage, in %.	`90`
{$NVIDIA.NAME.MATCHES}	Filter to include GPUs by name in discovery.	`.*`
{$NVIDIA.NAME.NOT_MATCHES}	Filter to exclude GPUs by name in discovery.	`CHANGE IF NEEDED`
{$NVIDIA.UUID.MATCHES}	Filter to include GPUs by UUID in discovery.	`.*`
{$NVIDIA.UUID.NOT_MATCHES}	Filter to exclude GPUs by UUID in discovery.	`CHANGE IF NEEDED`

Items

Name	Description	Type	Key and additional info
Driver version	Retrieves the version of the system's graphics driver. For all Nvidia products.	Zabbix agent	nvml.system.driver.version Preprocessing Discard unchanged with heartbeat: `1d`
NVML library version	Retrieves the version of the NVML library. For all Nvidia products.	Zabbix agent	nvml.version Preprocessing Discard unchanged with heartbeat: `1d`
Number of devices	Retrieves the number of compute devices in the system. A compute device is a single GPU. For all Nvidia products.	Zabbix agent	nvml.device.count Preprocessing Discard unchanged with heartbeat: `1d`
Get devices	Retrieves a list of Nvidia devices in the system.	Zabbix agent	nvml.device.get

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: Driver version has changed	Driver version has changed. Check the Nvidia website for the specific driver version: https://www.nvidia.com/en-us/drivers/	`change(/Nvidia by Zabbix agent 2/nvml.system.driver.version) <> 0`	Info	Manual close: Yes
Nvidia: NVML library has changed	NVML library version has changed. Check the changelog for details: https://docs.nvidia.com/deploy/nvml-api/change-log.html	`change(/Nvidia by Zabbix agent 2/nvml.version) <> 0`	Info	Manual close: Yes
Nvidia: Number of devices has changed	Number of devices has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2/nvml.device.count) <> 0`	Warning	Manual close: Yes

LLD rule GPU Discovery

Name Description Type Key and additional info

GPU Discovery

Name	Description	Type	Key and additional info
GPU Discovery	Nvidia GPU discovery in the system.	Dependent item	nvml.device.discovery Preprocessing Discard unchanged with heartbeat: `1d`

Nvidia GPU discovery in the system.

Dependent item

nvml.device.discovery

Preprocessing

Discard unchanged with heartbeat: 1d

Item prototypes for GPU Discovery

Name	Description	Type	Key and additional info
[{#UUID}]: Serial number	Retrieves the globally unique board serial number associated with this device's board. For all products with an inforom. This number matches the serial number tag that is physically attached to the board.	Zabbix agent	nvml.device.serial["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `The device does not support operation to retrieve serial number.`
[{#UUID}]: Encoder utilization	Retrieves the current utilization for the Encoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent	nvml.device.encoder.utilization["{#UUID}"]
[{#UUID}]: Decoder utilization	Retrieves the current utilization for the Decoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent	nvml.device.decoder.utilization["{#UUID}"]
[{#UUID}]: Fan speed	Retrieves the intended operating speed of the specified device fan. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, the output will not match the actual fan speed. For all Nvidia discrete products with dedicated fans. The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. In certain cases, this value may exceed 100%.	Zabbix agent	nvml.device.fan.speed.avg["{#UUID}"]
[{#UUID}]: Power usage	Retrieves power usage for this GPU (in watts) and its associated circuitry (e.g. memory). For Nvidia Fermi or newer fully supported devices. On Fermi and Kepler GPUs, the reading is accurate to within +/- 5% of current power draw. On Ampere (except GA100) or newer GPUs, the API returns power averaged over a 1 second interval. On GA100 and older architectures, instantaneous power is returned.	Zabbix agent	nvml.device.power.usage["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Power limit	Retrieves the power management limit associated with this device. For Nvidia Fermi or newer fully supported devices. The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit, the power management algorithm kicks in. This reading is only available if power management mode is supported.	Zabbix agent	nvml.device.power.limit["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Energy consumption	Retrieves the total energy consumption of this GPU in joules since the last driver reload. For Nvidia Volta or newer fully supported devices.	Zabbix agent	nvml.device.energy.consumption["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Temperature	Retrieves the current temperature readings for the device, in degrees C. For Nvidia all products.	Zabbix agent	nvml.device.temperature["{#UUID}"]
[{#UUID}]: Memory frequency	Retrieves the current memory clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.memory.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: SM frequency	Retrieves the current SM clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.sm.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Graphics frequency	Retrieves the current graphics clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.graphics.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Video frequency	Retrieves the current video encoder/decoder clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.video.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Performance state	Retrieves the current performance state for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.performance.state["{#UUID}"]
[{#UUID}]: Device utilization, get	Retrieves the current utilization rates for the device's major subsystems. For Nvidia Fermi or newer fully supported devices.	Zabbix agent	nvml.device.utilization["{#UUID}"]
[{#UUID}]: GPU utilization	Percentage of time over the past sampling period during which one or more kernels were running on the GPU. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.gpu["{#UUID}"] Preprocessing JSON Path: `$.device`
[{#UUID}]: Memory utilization	Percentage of time over the past sampling period during which global (device) memory was being read or written. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.memory["{#UUID}"] Preprocessing JSON Path: `$.memory`
[{#UUID}]: Encoder stats	Retrieves the current encoder statistics for a given device. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent	nvml.device.encoder.stats.get["{#UUID}"]
[{#UUID}]: Encoder sessions	Retrieves the current count of active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.sessions["{#UUID}"] Preprocessing JSON Path: `$.session_count`
[{#UUID}]: Encoder average FPS	Retrieves the trailing average FPS of all active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.fps["{#UUID}"] Preprocessing JSON Path: `$.average_fps`
[{#UUID}]: Encoder average latency	Retrieves the current encode latency for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.latency["{#UUID}"] Preprocessing JSON Path: `$.average_latency_ms` Custom multiplier: `0.001`
[{#UUID}]: FB memory, get	Retrieves the amount of used, free, reserved, and total memory available on the device. For all Nvidia products. Enabling ECC reduces the amount of total available memory due to the extra required parity bits. Under WDDM, most of the device memory is allocated and managed on startup by Windows. Under Linux and Windows TCC, the reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.	Zabbix agent	nvml.device.memory.fb.get["{#UUID}"]
[{#UUID}]: FB memory, total	Total physical memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: FB memory, reserved	Memory reserved for system use (driver or firmware) on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.reserved["{#UUID}"] Preprocessing JSON Path: `$.reserved_memory_bytes` ⛔️Custom on fail: Set error to: `NVML library too old to support this metric.`
[{#UUID}]: FB memory, free	Unallocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: FB memory, used	Allocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: BAR1 memory, get	Gets Total, Available, and Used size of BAR1 memory. BAR1 is used to map the FB (device memory) so that it can be directly accessed by the CPU or 3rd party devices (peer-to-peer on the PCIE bus). For Nvidia Kepler or newer fully supported devices	Zabbix agent	nvml.device.memory.bar1.get["{#UUID}"]
[{#UUID}]: BAR1 memory, total	Total BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: BAR1 memory, free	Unallocated BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: BAR1 memory, used	Allocated used BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: Memory ECC errors, get	Retrieves the GPU device memory error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent	nvml.device.errors.memory["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Memory ECC errors, corrected	Retrieves the count of GPU device memory errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Memory ECC errors, uncorrected	Retrieves the count of GPU device memory errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: Register file errors, get	Retrieves the GPU register file error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent	nvml.device.errors.register["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Register file errors, corrected	Retrieves the count of GPU register file errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Register file errors, uncorrected	Retrieves the count of GPU register file errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: PCIe utilization, get	Retrieves PCIe utilization information. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent	nvml.device.pci.utilization["{#UUID}"]
[{#UUID}]: PCIe utilization, Rx	The PCIe Rx (receive) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.rx.rate["{#UUID}"] Preprocessing JSON Path: `$.rx_rate_kb_s` Custom multiplier: `1024`
[{#UUID}]: PCIe utilization, Tx	The PCIe Tx (transmit) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.tx.rate["{#UUID}"] Preprocessing JSON Path: `$.tx_rate_kb_s` Custom multiplier: `1024`

Trigger prototypes for GPU Discovery

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold	[{#UUID}]: Encoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Encoder utilization exceeded warning threshold	[{#UUID}]: Encoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold	[{#UUID}]: Decoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Decoder utilization exceeded warning threshold	[{#UUID}]: Decoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Fan speed exceeded critical threshold	[{#UUID}]: Fan speed is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.CRIT}`	Average
Nvidia: [{#UUID}]: Fan speed exceeded warning threshold	[{#UUID}]: Fan speed is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Fan speed exceeded critical threshold
Nvidia: [{#UUID}]: Power usage exceeded critical threshold	[{#UUID}]: Power usage is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Power usage exceeded warning threshold	[{#UUID}]: Power usage is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Power usage exceeded critical threshold
Nvidia: [{#UUID}]: Power limit has changed	Power limit for the device has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Temperature exceeded critical threshold	[{#UUID}]: Temperature is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.CRIT}`	Average
Nvidia: [{#UUID}]: Temperature exceeded warning threshold	[{#UUID}]: Temperature is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Temperature exceeded critical threshold
Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold	[{#UUID}]: GPU utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: GPU utilization exceeded warning threshold	[{#UUID}]: GPU utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold
Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold	[{#UUID}]: Memory utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Memory utilization exceeded warning threshold	[{#UUID}]: Memory utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold
Nvidia: [{#UUID}]: Encoder average latency is high		`last(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"]) > (2 * avg(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"],3m))`	Warning
Nvidia: [{#UUID}]: Total FB memory has changed	Total FB memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2/nvml.device.memory.fb.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Total BAR1 memory has changed	Total BAR1 memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2/nvml.device.memory.bar1.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected memory ECC errors has changed	An increasing number of corrected ECC errors can indicate (but not necessary mean) aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected memory ECC errors has changed	An increasing number of uncorrected ECC errors can indicate potential issues such as: data corruption, system instability, hardware issues	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected register file errors has changed	An increasing number of corrected register file errors can indicate (but not necessary mean) wearing, aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected register file errors has changed	An increasing number of uncorrected register file errors can indicate potential issues such as: data corruption, system instability, hardware degradation.	`change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.4

Also available for: 7.2

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2_active?at=release/7.4

Nvidia by Zabbix agent 2 active

Overview

Requirements

Zabbix version: 7.4 and higher.

Tested versions

This template has been tested on:

Nvidia GTX 1650s
Nvidia RTX 2070Ti

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Set up and configure Zabbix agent 2 compiled with the Nvidia monitoring plugin.
Create a host and attach the template to it.

Test availability: zabbix_get -s nvidia-host -k nvml.system.driver.version

Macros used

Name	Description	Default
{$NVIDIA.GPU.UTIL.WARN}	Warning threshold for overall GPU utilization, in %.	`80`
{$NVIDIA.GPU.UTIL.CRIT}	Critical threshold for overall GPU utilization, in %.	`90`
{$NVIDIA.ENCODER.UTIL.WARN}	Warning threshold for encoder utilization, in %.	`80`
{$NVIDIA.ENCODER.UTIL.CRIT}	Critical threshold for encoder utilization, in %.	`90`
{$NVIDIA.DECODER.UTIL.WARN}	Warning threshold for decoder utilization, in %.	`80`
{$NVIDIA.DECODER.UTIL.CRIT}	Critical threshold for decoder utilization, in %.	`90`
{$NVIDIA.MEMORY.UTIL.WARN}	Warning threshold for memory utilization, in %.	`80`
{$NVIDIA.MEMORY.UTIL.CRIT}	Critical threshold for memory utilization, in %.	`90`
{$NVIDIA.FAN.SPEED.WARN}	Warning threshold for fan speed, in %.	`80`
{$NVIDIA.FAN.SPEED.CRIT}	Critical threshold for fan speed, in %.	`90`
{$NVIDIA.TEMPERATURE.WARN}	Warning threshold for temperature, in %.	`80`
{$NVIDIA.TEMPERATURE.CRIT}	Critical threshold for temperature, in %.	`90`
{$NVIDIA.POWER.UTIL.WARN}	Warning threshold for power usage, in %.	`80`
{$NVIDIA.POWER.UTIL.CRIT}	Critical threshold for power usage, in %.	`90`
{$NVIDIA.NAME.MATCHES}	Filter to include GPUs by name in discovery.	`.*`
{$NVIDIA.NAME.NOT_MATCHES}	Filter to exclude GPUs by name in discovery.	`CHANGE IF NEEDED`
{$NVIDIA.UUID.MATCHES}	Filter to include GPUs by UUID in discovery.	`.*`
{$NVIDIA.UUID.NOT_MATCHES}	Filter to exclude GPUs by UUID in discovery.	`CHANGE IF NEEDED`

Items

Name	Description	Type	Key and additional info
Driver version	Retrieves the version of the system's graphics driver. For all Nvidia products.	Zabbix agent (active)	nvml.system.driver.version Preprocessing Discard unchanged with heartbeat: `1d`
NVML library version	Retrieves the version of the NVML library. For all Nvidia products.	Zabbix agent (active)	nvml.version Preprocessing Discard unchanged with heartbeat: `1d`
Number of devices	Retrieves the number of compute devices in the system. A compute device is a single GPU. For all Nvidia products.	Zabbix agent (active)	nvml.device.count Preprocessing Discard unchanged with heartbeat: `1d`
Get devices	Retrieves a list of Nvidia devices in the system.	Zabbix agent (active)	nvml.device.get

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: Driver version has changed	Driver version has changed. Check the Nvidia website for the specific driver version: https://www.nvidia.com/en-us/drivers/	`change(/Nvidia by Zabbix agent 2 active/nvml.system.driver.version) <> 0`	Info	Manual close: Yes
Nvidia: NVML library has changed	NVML library version has changed. Check the changelog for details: https://docs.nvidia.com/deploy/nvml-api/change-log.html	`change(/Nvidia by Zabbix agent 2 active/nvml.version) <> 0`	Info	Manual close: Yes
Nvidia: Number of devices has changed	Number of devices has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.count) <> 0`	Warning	Manual close: Yes

LLD rule GPU Discovery

Name Description Type Key and additional info

GPU Discovery

Name	Description	Type	Key and additional info
GPU Discovery	Nvidia GPU discovery in the system.	Dependent item	nvml.device.discovery Preprocessing Discard unchanged with heartbeat: `1d`

Nvidia GPU discovery in the system.

Dependent item

nvml.device.discovery

Preprocessing

Discard unchanged with heartbeat: 1d

Item prototypes for GPU Discovery

Name	Description	Type	Key and additional info
[{#UUID}]: Serial number	Retrieves the globally unique board serial number associated with this device's board. For all products with an inforom. This number matches the serial number tag that is physically attached to the board.	Zabbix agent (active)	nvml.device.serial["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `The device does not support operation to retrieve serial number.`
[{#UUID}]: Encoder utilization	Retrieves the current utilization for the Encoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent (active)	nvml.device.encoder.utilization["{#UUID}"]
[{#UUID}]: Decoder utilization	Retrieves the current utilization for the Decoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent (active)	nvml.device.decoder.utilization["{#UUID}"]
[{#UUID}]: Fan speed	Retrieves the intended operating speed of the specified device fan. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, the output will not match the actual fan speed. For all Nvidia discrete products with dedicated fans. The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. In certain cases, this value may exceed 100%.	Zabbix agent (active)	nvml.device.fan.speed.avg["{#UUID}"]
[{#UUID}]: Power usage	Retrieves power usage for this GPU (in watts) and its associated circuitry (e.g. memory). For Nvidia Fermi or newer fully supported devices. On Fermi and Kepler GPUs, the reading is accurate to within +/- 5% of current power draw. On Ampere (except GA100) or newer GPUs, the API returns power averaged over a 1 second interval. On GA100 and older architectures, instantaneous power is returned.	Zabbix agent (active)	nvml.device.power.usage["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Power limit	Retrieves the power management limit associated with this device. For Nvidia Fermi or newer fully supported devices. The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit, the power management algorithm kicks in. This reading is only available if power management mode is supported.	Zabbix agent (active)	nvml.device.power.limit["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Energy consumption	Retrieves the total energy consumption of this GPU in joules since the last driver reload. For Nvidia Volta or newer fully supported devices.	Zabbix agent (active)	nvml.device.energy.consumption["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Temperature	Retrieves the current temperature readings for the device, in degrees C. For Nvidia all products.	Zabbix agent (active)	nvml.device.temperature["{#UUID}"]
[{#UUID}]: Memory frequency	Retrieves the current memory clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.memory.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: SM frequency	Retrieves the current SM clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.sm.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Graphics frequency	Retrieves the current graphics clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.graphics.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Video frequency	Retrieves the current video encoder/decoder clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.video.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Performance state	Retrieves the current performance state for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.performance.state["{#UUID}"]
[{#UUID}]: Device utilization, get	Retrieves the current utilization rates for the device's major subsystems. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.utilization["{#UUID}"]
[{#UUID}]: GPU utilization	Percentage of time over the past sampling period during which one or more kernels were running on the GPU. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.gpu["{#UUID}"] Preprocessing JSON Path: `$.device`
[{#UUID}]: Memory utilization	Percentage of time over the past sampling period during which global (device) memory was being read or written. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.memory["{#UUID}"] Preprocessing JSON Path: `$.memory`
[{#UUID}]: Encoder stats	Retrieves the current encoder statistics for a given device. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent (active)	nvml.device.encoder.stats.get["{#UUID}"]
[{#UUID}]: Encoder sessions	Retrieves the current count of active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.sessions["{#UUID}"] Preprocessing JSON Path: `$.session_count`
[{#UUID}]: Encoder average FPS	Retrieves the trailing average FPS of all active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.fps["{#UUID}"] Preprocessing JSON Path: `$.average_fps`
[{#UUID}]: Encoder average latency	Retrieves the current encode latency for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.latency["{#UUID}"] Preprocessing JSON Path: `$.average_latency_ms` Custom multiplier: `0.001`
[{#UUID}]: FB memory, get	Retrieves the amount of used, free, reserved, and total memory available on the device. For all Nvidia products. Enabling ECC reduces the amount of total available memory due to the extra required parity bits. Under WDDM, most of the device memory is allocated and managed on startup by Windows. Under Linux and Windows TCC, the reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.	Zabbix agent (active)	nvml.device.memory.fb.get["{#UUID}"]
[{#UUID}]: FB memory, total	Total physical memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: FB memory, reserved	Memory reserved for system use (driver or firmware) on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.reserved["{#UUID}"] Preprocessing JSON Path: `$.reserved_memory_bytes` ⛔️Custom on fail: Set error to: `NVML library too old to support this metric.`
[{#UUID}]: FB memory, free	Unallocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: FB memory, used	Allocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: BAR1 memory, get	Gets Total, Available, and Used size of BAR1 memory. BAR1 is used to map the FB (device memory) so that it can be directly accessed by the CPU or 3rd party devices (peer-to-peer on the PCIE bus). For Nvidia Kepler or newer fully supported devices	Zabbix agent (active)	nvml.device.memory.bar1.get["{#UUID}"]
[{#UUID}]: BAR1 memory, total	Total BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: BAR1 memory, free	Unallocated BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: BAR1 memory, used	Allocated used BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: Memory ECC errors, get	Retrieves the GPU device memory error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent (active)	nvml.device.errors.memory["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Memory ECC errors, corrected	Retrieves the count of GPU device memory errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Memory ECC errors, uncorrected	Retrieves the count of GPU device memory errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: Register file errors, get	Retrieves the GPU register file error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent (active)	nvml.device.errors.register["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Register file errors, corrected	Retrieves the count of GPU register file errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Register file errors, uncorrected	Retrieves the count of GPU register file errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: PCIe utilization, get	Retrieves PCIe utilization information. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent (active)	nvml.device.pci.utilization["{#UUID}"]
[{#UUID}]: PCIe utilization, Rx	The PCIe Rx (receive) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.rx.rate["{#UUID}"] Preprocessing JSON Path: `$.rx_rate_kb_s` Custom multiplier: `1024`
[{#UUID}]: PCIe utilization, Tx	The PCIe Tx (transmit) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.tx.rate["{#UUID}"] Preprocessing JSON Path: `$.tx_rate_kb_s` Custom multiplier: `1024`

Trigger prototypes for GPU Discovery

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold	[{#UUID}]: Encoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Encoder utilization exceeded warning threshold	[{#UUID}]: Encoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold	[{#UUID}]: Decoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Decoder utilization exceeded warning threshold	[{#UUID}]: Decoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Fan speed exceeded critical threshold	[{#UUID}]: Fan speed is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.CRIT}`	Average
Nvidia: [{#UUID}]: Fan speed exceeded warning threshold	[{#UUID}]: Fan speed is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Fan speed exceeded critical threshold
Nvidia: [{#UUID}]: Power usage exceeded critical threshold	[{#UUID}]: Power usage is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2 active/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2 active/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Power usage exceeded warning threshold	[{#UUID}]: Power usage is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2 active/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2 active/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Power usage exceeded critical threshold
Nvidia: [{#UUID}]: Power limit has changed	Power limit for the device has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.power.limit["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Temperature exceeded critical threshold	[{#UUID}]: Temperature is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.CRIT}`	Average
Nvidia: [{#UUID}]: Temperature exceeded warning threshold	[{#UUID}]: Temperature is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Temperature exceeded critical threshold
Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold	[{#UUID}]: GPU utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: GPU utilization exceeded warning threshold	[{#UUID}]: GPU utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold
Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold	[{#UUID}]: Memory utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Memory utilization exceeded warning threshold	[{#UUID}]: Memory utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold
Nvidia: [{#UUID}]: Encoder average latency is high		`last(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.stats.latency["{#UUID}"]) > (2 * avg(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.stats.latency["{#UUID}"],3m))`	Warning
Nvidia: [{#UUID}]: Total FB memory has changed	Total FB memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.memory.fb.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Total BAR1 memory has changed	Total BAR1 memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.memory.bar1.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected memory ECC errors has changed	An increasing number of corrected ECC errors can indicate (but not necessary mean) aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.memory.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected memory ECC errors has changed	An increasing number of uncorrected ECC errors can indicate potential issues such as: data corruption, system instability, hardware issues	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.memory.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected register file errors has changed	An increasing number of corrected register file errors can indicate (but not necessary mean) wearing, aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.register.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected register file errors has changed	An increasing number of uncorrected register file errors can indicate potential issues such as: data corruption, system instability, hardware degradation.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.register.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.2

Also available for: 7.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2_active?at=release/7.2

Nvidia by Zabbix agent 2 active

Overview

Requirements

Zabbix version: 7.2 and higher.

Tested versions

This template has been tested on:

Nvidia GTX 1650s
Nvidia RTX 2070Ti

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Set up and configure Zabbix agent 2 compiled with the Nvidia monitoring plugin.
Create a host and attach the template to it.

Test availability: zabbix_get -s nvidia-host -k nvml.system.driver.version

Macros used

Name	Description	Default
{$NVIDIA.GPU.UTIL.WARN}	Warning threshold for overall GPU utilization, in %.	`80`
{$NVIDIA.GPU.UTIL.CRIT}	Critical threshold for overall GPU utilization, in %.	`90`
{$NVIDIA.ENCODER.UTIL.WARN}	Warning threshold for encoder utilization, in %.	`80`
{$NVIDIA.ENCODER.UTIL.CRIT}	Critical threshold for encoder utilization, in %.	`90`
{$NVIDIA.DECODER.UTIL.WARN}	Warning threshold for decoder utilization, in %.	`80`
{$NVIDIA.DECODER.UTIL.CRIT}	Critical threshold for decoder utilization, in %.	`90`
{$NVIDIA.MEMORY.UTIL.WARN}	Warning threshold for memory utilization, in %.	`80`
{$NVIDIA.MEMORY.UTIL.CRIT}	Critical threshold for memory utilization, in %.	`90`
{$NVIDIA.FAN.SPEED.WARN}	Warning threshold for fan speed, in %.	`80`
{$NVIDIA.FAN.SPEED.CRIT}	Critical threshold for fan speed, in %.	`90`
{$NVIDIA.TEMPERATURE.WARN}	Warning threshold for temperature, in %.	`80`
{$NVIDIA.TEMPERATURE.CRIT}	Critical threshold for temperature, in %.	`90`
{$NVIDIA.POWER.UTIL.WARN}	Warning threshold for power usage, in %.	`80`
{$NVIDIA.POWER.UTIL.CRIT}	Critical threshold for power usage, in %.	`90`
{$NVIDIA.NAME.MATCHES}	Filter to include GPUs by name in discovery.	`.*`
{$NVIDIA.NAME.NOT_MATCHES}	Filter to exclude GPUs by name in discovery.	`CHANGE IF NEEDED`
{$NVIDIA.UUID.MATCHES}	Filter to include GPUs by UUID in discovery.	`.*`
{$NVIDIA.UUID.NOT_MATCHES}	Filter to exclude GPUs by UUID in discovery.	`CHANGE IF NEEDED`

Items

Name	Description	Type	Key and additional info
Driver version	Retrieves the version of the system's graphics driver. For all Nvidia products.	Zabbix agent (active)	nvml.system.driver.version Preprocessing Discard unchanged with heartbeat: `1d`
NVML library version	Retrieves the version of the NVML library. For all Nvidia products.	Zabbix agent (active)	nvml.version Preprocessing Discard unchanged with heartbeat: `1d`
Number of devices	Retrieves the number of compute devices in the system. A compute device is a single GPU. For all Nvidia products.	Zabbix agent (active)	nvml.device.count Preprocessing Discard unchanged with heartbeat: `1d`
Get devices	Retrieves a list of Nvidia devices in the system.	Zabbix agent (active)	nvml.device.get

Triggers

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: Driver version has changed	Driver version has changed. Check the Nvidia website for the specific driver version: https://www.nvidia.com/en-us/drivers/	`change(/Nvidia by Zabbix agent 2 active/nvml.system.driver.version) <> 0`	Info	Manual close: Yes
Nvidia: NVML library has changed	NVML library version has changed. Check the changelog for details: https://docs.nvidia.com/deploy/nvml-api/change-log.html	`change(/Nvidia by Zabbix agent 2 active/nvml.version) <> 0`	Info	Manual close: Yes
Nvidia: Number of devices has changed	Number of devices has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.count) <> 0`	Warning	Manual close: Yes

LLD rule GPU Discovery

Name Description Type Key and additional info

GPU Discovery

Name	Description	Type	Key and additional info
GPU Discovery	Nvidia GPU discovery in the system.	Dependent item	nvml.device.discovery Preprocessing Discard unchanged with heartbeat: `1d`

Nvidia GPU discovery in the system.

Dependent item

nvml.device.discovery

Preprocessing

Discard unchanged with heartbeat: 1d

Item prototypes for GPU Discovery

Name	Description	Type	Key and additional info
[{#UUID}]: Serial number	Retrieves the globally unique board serial number associated with this device's board. For all products with an inforom. This number matches the serial number tag that is physically attached to the board.	Zabbix agent (active)	nvml.device.serial["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `The device does not support operation to retrieve serial number.`
[{#UUID}]: Encoder utilization	Retrieves the current utilization for the Encoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent (active)	nvml.device.encoder.utilization["{#UUID}"]
[{#UUID}]: Decoder utilization	Retrieves the current utilization for the Decoder. For Nvidia Kepler or newer fully supported devices.	Zabbix agent (active)	nvml.device.decoder.utilization["{#UUID}"]
[{#UUID}]: Fan speed	Retrieves the intended operating speed of the specified device fan. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, the output will not match the actual fan speed. For all Nvidia discrete products with dedicated fans. The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. In certain cases, this value may exceed 100%.	Zabbix agent (active)	nvml.device.fan.speed.avg["{#UUID}"]
[{#UUID}]: Power usage	Retrieves power usage for this GPU (in watts) and its associated circuitry (e.g. memory). For Nvidia Fermi or newer fully supported devices. On Fermi and Kepler GPUs, the reading is accurate to within +/- 5% of current power draw. On Ampere (except GA100) or newer GPUs, the API returns power averaged over a 1 second interval. On GA100 and older architectures, instantaneous power is returned.	Zabbix agent (active)	nvml.device.power.usage["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Power limit	Retrieves the power management limit associated with this device. For Nvidia Fermi or newer fully supported devices. The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit, the power management algorithm kicks in. This reading is only available if power management mode is supported.	Zabbix agent (active)	nvml.device.power.limit["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Energy consumption	Retrieves the total energy consumption of this GPU in joules since the last driver reload. For Nvidia Volta or newer fully supported devices.	Zabbix agent (active)	nvml.device.energy.consumption["{#UUID}"] Preprocessing Custom multiplier: `0.001`
[{#UUID}]: Temperature	Retrieves the current temperature readings for the device, in degrees C. For Nvidia all products.	Zabbix agent (active)	nvml.device.temperature["{#UUID}"]
[{#UUID}]: Memory frequency	Retrieves the current memory clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.memory.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: SM frequency	Retrieves the current SM clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.sm.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Graphics frequency	Retrieves the current graphics clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.graphics.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Video frequency	Retrieves the current video encoder/decoder clock speed for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.video.frequency["{#UUID}"] Preprocessing Custom multiplier: `1000000`
[{#UUID}]: Performance state	Retrieves the current performance state for the device. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.performance.state["{#UUID}"]
[{#UUID}]: Device utilization, get	Retrieves the current utilization rates for the device's major subsystems. For Nvidia Fermi or newer fully supported devices.	Zabbix agent (active)	nvml.device.utilization["{#UUID}"]
[{#UUID}]: GPU utilization	Percentage of time over the past sampling period during which one or more kernels were running on the GPU. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.gpu["{#UUID}"] Preprocessing JSON Path: `$.device`
[{#UUID}]: Memory utilization	Percentage of time over the past sampling period during which global (device) memory was being read or written. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.utilization.memory["{#UUID}"] Preprocessing JSON Path: `$.memory`
[{#UUID}]: Encoder stats	Retrieves the current encoder statistics for a given device. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent (active)	nvml.device.encoder.stats.get["{#UUID}"]
[{#UUID}]: Encoder sessions	Retrieves the current count of active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.sessions["{#UUID}"] Preprocessing JSON Path: `$.session_count`
[{#UUID}]: Encoder average FPS	Retrieves the trailing average FPS of all active encoder sessions for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.fps["{#UUID}"] Preprocessing JSON Path: `$.average_fps`
[{#UUID}]: Encoder average latency	Retrieves the current encode latency for a given device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.encoder.stats.latency["{#UUID}"] Preprocessing JSON Path: `$.average_latency_ms` Custom multiplier: `0.001`
[{#UUID}]: FB memory, get	Retrieves the amount of used, free, reserved, and total memory available on the device. For all Nvidia products. Enabling ECC reduces the amount of total available memory due to the extra required parity bits. Under WDDM, most of the device memory is allocated and managed on startup by Windows. Under Linux and Windows TCC, the reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.	Zabbix agent (active)	nvml.device.memory.fb.get["{#UUID}"]
[{#UUID}]: FB memory, total	Total physical memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: FB memory, reserved	Memory reserved for system use (driver or firmware) on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.reserved["{#UUID}"] Preprocessing JSON Path: `$.reserved_memory_bytes` ⛔️Custom on fail: Set error to: `NVML library too old to support this metric.`
[{#UUID}]: FB memory, free	Unallocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: FB memory, used	Allocated memory on the device. For all Nvidia products.	Dependent item	nvml.device.memory.fb.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: BAR1 memory, get	Gets Total, Available, and Used size of BAR1 memory. BAR1 is used to map the FB (device memory) so that it can be directly accessed by the CPU or 3rd party devices (peer-to-peer on the PCIE bus). For Nvidia Kepler or newer fully supported devices	Zabbix agent (active)	nvml.device.memory.bar1.get["{#UUID}"]
[{#UUID}]: BAR1 memory, total	Total BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.total["{#UUID}"] Preprocessing JSON Path: `$.total_memory_bytes`
[{#UUID}]: BAR1 memory, free	Unallocated BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.free["{#UUID}"] Preprocessing JSON Path: `$.free_memory_bytes`
[{#UUID}]: BAR1 memory, used	Allocated used BAR1 memory on the device. For Nvidia Kepler or newer fully supported devices	Dependent item	nvml.device.memory.bar1.used["{#UUID}"] Preprocessing JSON Path: `$.used_memory_bytes`
[{#UUID}]: Memory ECC errors, get	Retrieves the GPU device memory error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent (active)	nvml.device.errors.memory["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Memory ECC errors, corrected	Retrieves the count of GPU device memory errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Memory ECC errors, uncorrected	Retrieves the count of GPU device memory errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.memory.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: Register file errors, get	Retrieves the GPU register file error counters for the device. For Nvidia Fermi or newer fully supported devices. Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts. Only applicable to devices with ECC. Requires ECC Mode to be enabled.	Zabbix agent (active)	nvml.device.errors.register["{#UUID}"] Preprocessing Check for not supported value: `The text is too long. Please see the template.` ⛔️Custom on fail: Set error to: `No ECC on the device or ECC mode is turned off.`
[{#UUID}]: Register file errors, corrected	Retrieves the count of GPU register file errors that were corrected. For ECC errors, these are single-bit errors, for Texture memory, these are errors fixed by resend. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.corrected["{#UUID}"] Preprocessing JSON Path: `$.corrected`
[{#UUID}]: Register file errors, uncorrected	Retrieves the count of GPU register file errors that were not corrected. For ECC errors, these are double-bit errors, for Texture memory, these are errors where the resend fails. For Nvidia Fermi or newer fully supported devices.	Dependent item	nvml.device.errors.register.uncorrected["{#UUID}"] Preprocessing JSON Path: `$.uncorrected`
[{#UUID}]: PCIe utilization, get	Retrieves PCIe utilization information. For Nvidia Maxwell or newer fully supported devices.	Zabbix agent (active)	nvml.device.pci.utilization["{#UUID}"]
[{#UUID}]: PCIe utilization, Rx	The PCIe Rx (receive) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.rx.rate["{#UUID}"] Preprocessing JSON Path: `$.rx_rate_kb_s` Custom multiplier: `1024`
[{#UUID}]: PCIe utilization, Tx	The PCIe Tx (transmit) throughput over a 20ms interval on the device. For Nvidia Maxwell or newer fully supported devices.	Dependent item	nvml.device.pci.utilization.tx.rate["{#UUID}"] Preprocessing JSON Path: `$.tx_rate_kb_s` Custom multiplier: `1024`

Trigger prototypes for GPU Discovery

Name	Description	Expression	Severity	Dependencies and additional info
Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold	[{#UUID}]: Encoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Encoder utilization exceeded warning threshold	[{#UUID}]: Encoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold	[{#UUID}]: Decoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Decoder utilization exceeded warning threshold	[{#UUID}]: Decoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Fan speed exceeded critical threshold	[{#UUID}]: Fan speed is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.CRIT}`	Average
Nvidia: [{#UUID}]: Fan speed exceeded warning threshold	[{#UUID}]: Fan speed is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Fan speed exceeded critical threshold
Nvidia: [{#UUID}]: Power usage exceeded critical threshold	[{#UUID}]: Power usage is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2 active/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2 active/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Power usage exceeded warning threshold	[{#UUID}]: Power usage is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`(min(/Nvidia by Zabbix agent 2 active/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2 active/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Power usage exceeded critical threshold
Nvidia: [{#UUID}]: Power limit has changed	Power limit for the device has changed. Check if this was intentional.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.power.limit["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Temperature exceeded critical threshold	[{#UUID}]: Temperature is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.CRIT}`	Average
Nvidia: [{#UUID}]: Temperature exceeded warning threshold	[{#UUID}]: Temperature is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Temperature exceeded critical threshold
Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold	[{#UUID}]: GPU utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: GPU utilization exceeded warning threshold	[{#UUID}]: GPU utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold
Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold	[{#UUID}]: Memory utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.CRIT}`	Average
Nvidia: [{#UUID}]: Memory utilization exceeded warning threshold	[{#UUID}]: Memory utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.	`min(/Nvidia by Zabbix agent 2 active/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.WARN}`	Warning	Depends on: Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold
Nvidia: [{#UUID}]: Encoder average latency is high		`last(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.stats.latency["{#UUID}"]) > (2 * avg(/Nvidia by Zabbix agent 2 active/nvml.device.encoder.stats.latency["{#UUID}"],3m))`	Warning
Nvidia: [{#UUID}]: Total FB memory has changed	Total FB memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.memory.fb.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Total BAR1 memory has changed	Total BAR1 memory has changed. This could mean possible memory degradation, hardware configuration changes, or memory reservation by system or software.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.memory.bar1.total["{#UUID}"]) <> 0`	Warning	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected memory ECC errors has changed	An increasing number of corrected ECC errors can indicate (but not necessary mean) aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.memory.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected memory ECC errors has changed	An increasing number of uncorrected ECC errors can indicate potential issues such as: data corruption, system instability, hardware issues	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.memory.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected register file errors has changed	An increasing number of corrected register file errors can indicate (but not necessary mean) wearing, aging or degrading of memory.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.register.corrected["{#UUID}"]) <> 0`	Info	Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected register file errors has changed	An increasing number of uncorrected register file errors can indicate potential issues such as: data corruption, system instability, hardware degradation.	`change(/Nvidia by Zabbix agent 2 active/nvml.device.errors.register.uncorrected["{#UUID}"]) <> 0`	Info	Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Link	Source	Compatibility	Type, Technology	Created Updated	Rating
NVidia Sensors This template integrates NVidia SMI for a single graphics card with Zabbix.The template adds monitoring of:GPU UtilisationGPU Power ConsumptionGPU Memory (Used, Free, Total)GPU TemperatureGPU Fan SpeedThe following agent parameters can be used to add the metrics into Zabbix.UserParameter=gpu.temp,nvidia-smi ... template_nvidia-smi_integration	GitHub Community Templates	5.0+

See all Zabbix community templates

Zabbix + NVIDIA

NVIDIA

Available solutions

Nvidia by Zabbix agent 2

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule GPU Discovery

Item prototypes for GPU Discovery

Trigger prototypes for GPU Discovery

Feedback

Nvidia by Zabbix agent 2

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule GPU Discovery

Item prototypes for GPU Discovery

Trigger prototypes for GPU Discovery

Feedback

Nvidia by Zabbix agent 2 active

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule GPU Discovery

Item prototypes for GPU Discovery

Trigger prototypes for GPU Discovery

Feedback

Nvidia by Zabbix agent 2 active

Overview

Requirements

Tested versions

Configuration

Setup

Macros used

Items

Triggers

LLD rule GPU Discovery

Item prototypes for GPU Discovery

Trigger prototypes for GPU Discovery

Feedback

Articles and documentation

Request custom integration

Propose your integration