NVIDIA

it is a software and fabless company which designs and supplies graphics processing units (GPUs), application programming interfaces (APIs) for data science and high-performance computing, as well as system on a chip units (SoCs) for mobile computing and the automotive market.

Available solutions




This template is for Zabbix version: 7.2

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia?at=release/7.2

Nvidia by Zabbix agent 2

Overview

This template is designed for Nvidia GPU monitoring and doesn't require any external scripts. All Nvidia GPUs will be discovered. Set filters with macros if you want to override default filter parameters.

Requirements

Zabbix version: 7.2 and higher.

Tested versions

This template has been tested on:

  • Nvidia GTX 1650s
  • Nvidia RTX 2070Ti

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Setup and configure Zabbix agent 2 compiled with the Nvidia monitoring plugin.
  2. Create a host with Zabbix agent interface and attach the template to it.

Test availability: zabbix_get -s nvidia-host -k nvml.system.driver.version

Macros used

Name Description Default
{$NVIDIA.GPU.UTIL.WARN}

Warning threshold for GPU overall utilization, in %.

80
{$NVIDIA.GPU.UTIL.CRIT}

Critical threshold for GPU overall utilization, in %.

90
{$NVIDIA.ENCODER.UTIL.WARN}

Warning threshold for encoder utilization, in %.

80
{$NVIDIA.ENCODER.UTIL.CRIT}

Critical threshold for encoder utilization, in %.

90
{$NVIDIA.DECODER.UTIL.WARN}

Warning threshold for decoder utilization, in %.

80
{$NVIDIA.DECODER.UTIL.CRIT}

Critical threshold for decoder utilization, in %.

90
{$NVIDIA.MEMORY.UTIL.WARN}

Warning threshold for memory utilization, in %.

80
{$NVIDIA.MEMORY.UTIL.CRIT}

Critical threshold for memory utilization, in %.

90
{$NVIDIA.FAN.SPEED.WARN}

Warning threshold for fan speed, in %.

80
{$NVIDIA.FAN.SPEED.CRIT}

Critical threshold for fan speed, in %.

90
{$NVIDIA.TEMPERATURE.WARN}

Warning threshold for temperature, in %.

80
{$NVIDIA.TEMPERATURE.CRIT}

Critical threshold for temperature, in %.

90
{$NVIDIA.POWER.UTIL.WARN}

Warning threshold for power usage, in %.

80
{$NVIDIA.POWER.UTIL.CRIT}

Critical threshold for power usage, in %.

90
{$NVIDIA.NAME.MATCHES}

Filter to include GPUs by name in discovery.

.*
{$NVIDIA.NAME.NOT_MATCHES}

Filter to exclude GPUs by name in discovery.

CHANGE IF NEEDED
{$NVIDIA.UUID.MATCHES}

Filter to include GPUs by UUID in discovery.

.*
{$NVIDIA.UUID.NOT_MATCHES}

Filter to exclude GPUs by UUID in discovery.

CHANGE IF NEEDED

Items

Name Description Type Key and additional info
Driver version

Retrieves the version of the system's graphics driver.

For all Nvidia products.

Zabbix agent nvml.system.driver.version

Preprocessing

  • Discard unchanged with heartbeat: 1d

NVML library version

Retrieves the version of the NVML library.

For all Nvidia products.

Zabbix agent nvml.version

Preprocessing

  • Discard unchanged with heartbeat: 1d

Number of devices

Retrieves the number of compute devices in the system. A compute device is a single GPU.

For all Nvidia products.

Zabbix agent nvml.device.count

Preprocessing

  • Discard unchanged with heartbeat: 1d

Get devices

Retrieves list of Nvidia devices in the system.

Zabbix agent nvml.device.get

Triggers

Name Description Expression Severity Dependencies and additional info
Nvidia: Driver version has changed

Driver version has changed.
Check out changelog for specific driver version at Nvidia website: https://www.nvidia.com/en-us/drivers/

change(/Nvidia by Zabbix agent 2/nvml.system.driver.version) <> 0 Info Manual close: Yes
Nvidia: NVML library has changed

NVML library version has changed.
Changelog can be found here: https://docs.nvidia.com/deploy/nvml-api/change-log.html

change(/Nvidia by Zabbix agent 2/nvml.version) <> 0 Info Manual close: Yes
Nvidia: Number of devices has changed

Number of devices has changed. Check out if it was intentional.

change(/Nvidia by Zabbix agent 2/nvml.device.count) <> 0 Warning Manual close: Yes

LLD rule GPU Discovery

Name Description Type Key and additional info
GPU Discovery

Nvidia GPU discovery in the system.

Dependent item nvml.device.discovery

Preprocessing

  • Discard unchanged with heartbeat: 1d

Item prototypes for GPU Discovery

Name Description Type Key and additional info
[{#UUID}]: Serial number

Retrieves the globally unique board serial number associated with this device's board.

For all products with an inforom.

This number matches the serial number tag that is physically attached to the board.

Zabbix agent nvml.device.serial["{#UUID}"]

Preprocessing

  • Check for not supported value: The text is too long. Please see the template.

    ⛔️Custom on fail: Set error to: The device does not support operation to retrieve serial number.

[{#UUID}]: Encoder utilization

Retrieves the current utilization for the Encoder.

For Nvidia Kepler or newer fully supported devices.

Zabbix agent nvml.device.encoder.utilization["{#UUID}"]
[{#UUID}]: Decoder utilization

Retrieves the current utilization for the Decoder.

For Nvidia Kepler or newer fully supported devices.

Zabbix agent nvml.device.decoder.utilization["{#UUID}"]
[{#UUID}]: Fan speed

Retrieves the intended operating speed of the device's specified fan.

Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, the output will not match the actual fan speed.

For all Nvidia discrete products with dedicated fans.

The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed. This value may exceed 100% in certain cases.

Zabbix agent nvml.device.fan.speed.avg["{#UUID}"]
[{#UUID}]: Power usage

Retrieves power usage for this GPU in watts and its associated circuitry (e.g. memory).

For Nvidia Fermi or newer fully supported devices.

On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw. On Ampere (except GA100) or newer GPUs, the API returns power averaged over 1 sec interval. On GA100 and older architectures, instantaneous power is returned.

Zabbix agent nvml.device.power.usage["{#UUID}"]

Preprocessing

  • Custom multiplier: 0.001

[{#UUID}]: Power limit

Retrieves the power management limit associated with this device.

For Nvidia Fermi or newer fully supported devices.

The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit the power management algorithm kicks in.

This reading is only available if power management mode is supported.

Zabbix agent nvml.device.power.limit["{#UUID}"]

Preprocessing

  • Custom multiplier: 0.001

[{#UUID}]: Energy consumption

Retrieves total energy consumption for this GPU in joules (J) since the driver was last reloaded.

For Nvidia Volta or newer fully supported devices.

Zabbix agent nvml.device.energy.consumption["{#UUID}"]

Preprocessing

  • Custom multiplier: 0.001

[{#UUID}]: Temperature

Retrieves the current temperature readings for the device, in degrees C.

For Nvidia all products.

Zabbix agent nvml.device.temperature["{#UUID}"]
[{#UUID}]: Memory frequency

Retrieves the current memory clock speed for the device.

For Nvidia Fermi or newer fully supported devices.

Zabbix agent nvml.device.memory.frequency["{#UUID}"]

Preprocessing

  • Custom multiplier: 1000000

[{#UUID}]: SM frequency

Retrieves the current SM clock speed for the device.

For Nvidia Fermi or newer fully supported devices.

Zabbix agent nvml.device.sm.frequency["{#UUID}"]

Preprocessing

  • Custom multiplier: 1000000

[{#UUID}]: Graphics frequency

Retrieves the current graphics clock speed for the device.

For Nvidia Fermi or newer fully supported devices.

Zabbix agent nvml.device.graphics.frequency["{#UUID}"]

Preprocessing

  • Custom multiplier: 1000000

[{#UUID}]: Video frequency

Retrieves the current video encoder/decoder clock speed for the device.

For Nvidia Fermi or newer fully supported devices.

Zabbix agent nvml.device.video.frequency["{#UUID}"]

Preprocessing

  • Custom multiplier: 1000000

[{#UUID}]: Performance state

Retrieves the current performance state for the device.

For Nvidia Fermi or newer fully supported devices.

Zabbix agent nvml.device.performance.state["{#UUID}"]
[{#UUID}]: Device utilization, get

Retrieves the current utilization rates for the device's major subsystems.

For Nvidia Fermi or newer fully supported devices.

Zabbix agent nvml.device.utilization["{#UUID}"]
[{#UUID}]: GPU utilization

Percent of time over the past sample period during which one or more kernels was executing on the GPU.

For Nvidia Fermi or newer fully supported devices.

Dependent item nvml.device.utilization.gpu["{#UUID}"]

Preprocessing

  • JSON Path: $.device

[{#UUID}]: Memory utilization

Percent of time over the past sample period during which global (device) memory was being read or written.

For Nvidia Fermi or newer fully supported devices.

Dependent item nvml.device.utilization.memory["{#UUID}"]

Preprocessing

  • JSON Path: $.memory

[{#UUID}]: Encoder stats

Retrieves the current encoder statistics for a given device.

For Nvidia Maxwell or newer fully supported devices.

Zabbix agent nvml.device.encoder.stats.get["{#UUID}"]
[{#UUID}]: Encoder sessions

Retrieves the current count of active encoder sessions for a given device.

For Nvidia Maxwell or newer fully supported devices.

Dependent item nvml.device.encoder.stats.sessions["{#UUID}"]

Preprocessing

  • JSON Path: $.session_count

[{#UUID}]: Encoder average FPS

Retrieves the trailing average FPS of all active encoder sessions for a given device.

For Nvidia Maxwell or newer fully supported devices.

Dependent item nvml.device.encoder.stats.fps["{#UUID}"]

Preprocessing

  • JSON Path: $.average_fps

[{#UUID}]: Encoder average latency

Retrieves the current encode latency for a given device.

For Nvidia Maxwell or newer fully supported devices.

Dependent item nvml.device.encoder.stats.latency["{#UUID}"]

Preprocessing

  • JSON Path: $.average_latency_ms

  • Custom multiplier: 0.001

[{#UUID}]: FB memory, get

Retrieves the amount of used, free, reserved and total memory available on the device.

For all Nvidia products.

Enabling ECC reduces the amount of total available memory, due to the extra required parity bits. Under WDDM most device memory is allocated and managed on startup by Windows.

Under Linux and Windows TCC, the reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.

Zabbix agent nvml.device.memory.fb.get["{#UUID}"]
[{#UUID}]: FB memory, total

Total physical memory on the device.

For all Nvidia products.

Dependent item nvml.device.memory.fb.total["{#UUID}"]

Preprocessing

  • JSON Path: $.total_memory_bytes

[{#UUID}]: FB memory, reserved

Memory reserved for system use (driver or firmware) on the device.

For all Nvidia products.

Dependent item nvml.device.memory.fb.reserved["{#UUID}"]

Preprocessing

  • JSON Path: $.reserved_memory_bytes

    ⛔️Custom on fail: Set error to: NVML library too old to support this metric.

[{#UUID}]: FB memory, free

Unallocated memory on the device.

For all Nvidia products.

Dependent item nvml.device.memory.fb.free["{#UUID}"]

Preprocessing

  • JSON Path: $.free_memory_bytes

[{#UUID}]: FB memory, used

Allocated memory on the device.

For all Nvidia products.

Dependent item nvml.device.memory.fb.used["{#UUID}"]

Preprocessing

  • JSON Path: $.used_memory_bytes

[{#UUID}]: BAR1 memory, get

Gets Total, Available and Used size of BAR1 memory.

BAR1 is used to map the FB (device memory) so that it can be directly accessed by the CPU or by 3rd party devices (peer-to-peer on the PCIE bus).

For Nvidia Kepler or newer fully supported devices

Zabbix agent nvml.device.memory.bar1.get["{#UUID}"]
[{#UUID}]: BAR1 memory, total

Total BAR1 memory on the device.

For Nvidia Kepler or newer fully supported devices

Dependent item nvml.device.memory.bar1.total["{#UUID}"]

Preprocessing

  • JSON Path: $.total_memory_bytes

[{#UUID}]: BAR1 memory, free

Unallocated BAR1 memory on the device.

For Nvidia Kepler or newer fully supported devices

Dependent item nvml.device.memory.bar1.free["{#UUID}"]

Preprocessing

  • JSON Path: $.free_memory_bytes

[{#UUID}]: BAR1 memory, used

Allocated used BAR1 memory on the device.

For Nvidia Kepler or newer fully supported devices

Dependent item nvml.device.memory.bar1.used["{#UUID}"]

Preprocessing

  • JSON Path: $.used_memory_bytes

[{#UUID}]: Memory ECC errors, get

Retrieves the GPU device memory error counters for the device.

For Nvidia Fermi or newer fully supported devices.

Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts.

Only applicable to devices with ECC.

Requires ECC Mode to be enabled.

Zabbix agent nvml.device.errors.memory["{#UUID}"]

Preprocessing

  • Check for not supported value: The text is too long. Please see the template.

    ⛔️Custom on fail: Set error to: No ECC on the device or ECC mode is turned off.

[{#UUID}]: Memory ECC errors, corrected

Retrieves the count of GPU device memory errors that were corrected. For ECC errors, these are single bit errors, for Texture memory, these are errors fixed by resend.

For Nvidia Fermi or newer fully supported devices.

Dependent item nvml.device.errors.memory.corrected["{#UUID}"]

Preprocessing

  • JSON Path: $.corrected

[{#UUID}]: Memory ECC errors, uncorrected

Retrieves the count of GPU device memory errors that were not corrected. For ECC errors, these are double bit errors, for Texture memory, these are errors where the resend fails.

For Nvidia Fermi or newer fully supported devices.

Dependent item nvml.device.errors.memory.uncorrected["{#UUID}"]

Preprocessing

  • JSON Path: $.uncorrected

[{#UUID}]: Register file errors, get

Retrieves the GPU register file error counters for the device.

For Nvidia Fermi or newer fully supported devices.

Requires NVML_INFOROM_ECC version 2.0 or higher to report aggregate location-based memory error counts. Requires NVML_INFOROM_ECC version 1.0 or higher to report all other memory error counts.

Only applicable to devices with ECC.

Requires ECC Mode to be enabled.

Zabbix agent nvml.device.errors.register["{#UUID}"]

Preprocessing

  • Check for not supported value: The text is too long. Please see the template.

    ⛔️Custom on fail: Set error to: No ECC on the device or ECC mode is turned off.

[{#UUID}]: Register file errors, corrected

Retrieves the count of GPU register file errors that were corrected. For ECC errors, these are single bit errors, for Texture memory, these are errors fixed by resend.

For Nvidia Fermi or newer fully supported devices.

Dependent item nvml.device.errors.register.corrected["{#UUID}"]

Preprocessing

  • JSON Path: $.corrected

[{#UUID}]: Register file errors, uncorrected

Retrieves the count of GPU register file errors that were not corrected. For ECC errors, these are double bit errors, for Texture memory, these are errors where the resend fails.

For Nvidia Fermi or newer fully supported devices.

Dependent item nvml.device.errors.register.uncorrected["{#UUID}"]

Preprocessing

  • JSON Path: $.uncorrected

[{#UUID}]: PCIe utilization, get

Retrieve PCIe utilization information.

For Maxwell or newer fully supported devices.

Zabbix agent nvml.device.pci.utilization["{#UUID}"]
[{#UUID}]: PCIe utilization, Rx

The PCIe Rx (receive) throughput over 20ms interval on the device.

For Maxwell or newer fully supported devices.

Dependent item nvml.device.pci.utilization.rx.rate["{#UUID}"]

Preprocessing

  • JSON Path: $.rx_rate_kb_s

  • Custom multiplier: 1024

[{#UUID}]: PCIe utilization, Tx

The PCIe Tx (transmit) throughput over 20ms interval on the device.

For Maxwell or newer fully supported devices.

Dependent item nvml.device.pci.utilization.tx.rate["{#UUID}"]

Preprocessing

  • JSON Path: $.tx_rate_kb_s

  • Custom multiplier: 1024

Trigger prototypes for GPU Discovery

Name Description Expression Severity Dependencies and additional info
Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold

[{#UUID}]: Encoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.CRIT} Average
Nvidia: [{#UUID}]: Encoder utilization exceeded warning threshold

[{#UUID}]: Encoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.encoder.utilization["{#UUID}"],3m) > {$NVIDIA.ENCODER.UTIL.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: Encoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold

[{#UUID}]: Decoder utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.CRIT} Average
Nvidia: [{#UUID}]: Decoder utilization exceeded warning threshold

[{#UUID}]: Decoder utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.decoder.utilization["{#UUID}"],3m) > {$NVIDIA.DECODER.UTIL.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: Decoder utilization exceeded critical threshold
Nvidia: [{#UUID}]: Fan speed exceeded critical threshold

[{#UUID}]: Fan speed is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.CRIT} Average
Nvidia: [{#UUID}]: Fan speed exceeded warning threshold

[{#UUID}]: Fan speed is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.fan.speed.avg["{#UUID}"],3m) > {$NVIDIA.FAN.SPEED.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: Fan speed exceeded critical threshold
Nvidia: [{#UUID}]: Power usage exceeded critical threshold

[{#UUID}]: Power usage is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.CRIT} Average
Nvidia: [{#UUID}]: Power usage exceeded warning threshold

[{#UUID}]: Power usage is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

(min(/Nvidia by Zabbix agent 2/nvml.device.power.usage["{#UUID}"],3m) * 100 / last(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"])) > {$NVIDIA.POWER.UTIL.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: Power usage exceeded critical threshold
Nvidia: [{#UUID}]: Power limit has changed

Power limit for the device has changed. Checkout out if it was intentional.

change(/Nvidia by Zabbix agent 2/nvml.device.power.limit["{#UUID}"]) <> 0 Info Manual close: Yes
Nvidia: [{#UUID}]: Temperature exceeded critical threshold

[{#UUID}]: Temperature is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.CRIT} Average
Nvidia: [{#UUID}]: Temperature exceeded warning threshold

[{#UUID}]: Temperature is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.temperature["{#UUID}"],3m) > {$NVIDIA.TEMPERATURE.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: Temperature exceeded critical threshold
Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold

[{#UUID}]: GPU utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.CRIT} Average
Nvidia: [{#UUID}]: GPU utilization exceeded warning threshold

[{#UUID}]: GPU utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.utilization.gpu["{#UUID}"],3m) > {$NVIDIA.GPU.UTIL.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: GPU utilization exceeded critical threshold
Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold

[{#UUID}]: Memory utilization is very high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.CRIT} Average
Nvidia: [{#UUID}]: Memory utilization exceeded warning threshold

[{#UUID}]: Memory utilization is high. It may indicate abnormal behavior/activity. Change corresponding macro in case of false-positive.

min(/Nvidia by Zabbix agent 2/nvml.device.utilization.memory["{#UUID}"],3m) > {$NVIDIA.MEMORY.UTIL.WARN} Warning Depends on:
  • Nvidia: [{#UUID}]: Memory utilization exceeded critical threshold
Nvidia: [{#UUID}]: Encoder average latency is high last(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"]) > (2 * avg(/Nvidia by Zabbix agent 2/nvml.device.encoder.stats.latency["{#UUID}"],3m)) Warning
Nvidia: [{#UUID}]: Total FB memory has changed

Total FB memory has changed. That could mean possible memory degradation, hardware configuration changes or memory reservation by system or software.

change(/Nvidia by Zabbix agent 2/nvml.device.memory.fb.total["{#UUID}"]) <> 0 Warning Manual close: Yes
Nvidia: [{#UUID}]: Total BAR1 memory has changed

Total BAR1 memory has changed. That could mean possible memory degradation, hardware configuration changes or memory reservation by system or software.

change(/Nvidia by Zabbix agent 2/nvml.device.memory.bar1.total["{#UUID}"]) <> 0 Warning Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected memory ECC errors has changed

Increasing number of corrected ECC errors can indicate (but not necessary mean) aging or degrading memory.

change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.corrected["{#UUID}"]) <> 0 Info Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected memory ECC errors has changed

Increasing number of uncorrected ECC errors can indicate potential issues such as: data corruption, system instability, hardware issues

change(/Nvidia by Zabbix agent 2/nvml.device.errors.memory.uncorrected["{#UUID}"]) <> 0 Info Manual close: Yes
Nvidia: [{#UUID}]: Number of corrected register file errors has changed

Increasing number of corrected register file errors can indicate (but not necessary mean) wearing, aging or degrading memory.

change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.corrected["{#UUID}"]) <> 0 Info Manual close: Yes
Nvidia: [{#UUID}]: Number of uncorrected register file errors has changed

Increasing number of uncorrected register file errors can indicate potential issues such as: data corruption, system instability, hardware degradation

change(/Nvidia by Zabbix agent 2/nvml.device.errors.register.uncorrected["{#UUID}"]) <> 0 Info Manual close: Yes

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Articles and documentation

+ Propose new article

Didn't find integration you need?