Trigger names must be prefixed with the LLD object they belong to.
Trigger names should not use the {HOST.NAME} macro to keep names shorter. Consider getting this data from the host column.
Avoid using {ITEM.LASTVALUE} in trigger name
Don’t use {ITEM.LASTVALUE1-9} macros right in trigger names. These macros are expanded to values at the time when problem name is generated.
Use it in the operational data field (available since Zabbix 4.4) instead.
Explain the threshold in event name
Consider explaining why trigger fired (threshold) in parenthesis ().
Use the event name field for it (supported since Zabbix 5.2), to keep the trigger name short. The event name, if defined, will be used for generating the problem name.
E. g.:
Other examples for event names:
Good | Bad |
---|---|
Temperature is too high (over 35 C for 5m) MySQL: Refused connections (max_connections limit reached) |
Temperature is too high (now: 40) MySQL: Refused connections |
Use this field to describe:
Trigger expressions should be reasonably flap-resistant - that is, not relying on the last value only but checking last 5 or 10 minutes instead. On the other hand, do not make the expressions overly complex - for example, do not use trigger hysteresis unless it really adds significant value.
Prefer to use user macros in trigger expressions to allow thresholds tuning.
Good | Bad |
---|---|
last(/TEMPLATE_NAME/temperature)>{$TEMP.MAX.WARN} |
last(/TEMPLATE_NAME/temperature)>30 |
Use newlines and spaces to make long trigger expressions more human-readable.
Always use time (1m, 5m, 1d...) and size suffixes (1K, 1B, 1G) in trigger expressions and problem names, trigger description, operational data to improve readability. Remember, that you can use them in user macros, too.
Good | Bad |
---|---|
avg(/TEMPLATE_NAME/temperature,10m)>{$TEMP.MAX.WARN} avg(/TEMPLATE_NAME/memory.free,10m)<{$MEM_FREE.WARN} where {$MEM_FREE.WARN} = 100M |
avg(/TEMPLATE_NAME/temperature,600)>{$TEMP.MAX.WARN}} avg(/TEMPLATE_NAME/memory.free,600)<{$MEM_FREE.WARN} where {$MEM_FREE.WARN} = 104857600 |
Triggers created in the templates are mapped to the standard Zabbix severity scale. Consider choosing the severity assigned to the trigger with the following in mind:
Severity | Description | Examples | Expected reaction type and time (not always true!), given as example only |
---|---|---|---|
Not classified | Not used under normal circumstances | ||
Info | The event happened that is not an alarm at all. This is the info that might be helpful in the future for retrospective analysis or for auditing. | Examples: s/n changed, user logged in, etc | None |
Warning | A minor alarm that could lead to some more serious problem if left without attention. | Examples: Disk usage is low but there is still some room | React during working hours, no notification is expected. |
Average | Performance alarms: Average alarm that indicates serious performance problems or key service degradation. Fault alarms: partial resource failure or warnings that if left without attention might lead to complete device fault. |
Examples: CPU utilization is high, Low memory, High device temperature, Disk health failure in the disk array, Website is slow. | React during working hours, create an issue ticket if the problem stays for hours. |
High | Performance alarms: Key service is not available. Fault alarms: The device is not functioning or not available. | No ICMP PING, Website is down. | React off working hours if affects services with the page. React with a ticket during working hours otherwise. |
Disaster | Reserved for alarms indicating blackouts, disasters, global business service faults. There should be no triggers with disaster level severity in resource templates. |
Riga DC is down, Level core network is down, >50% of users cannot purchase anything from our website. | Always react by paging the responsible person. |
Use tags to logically group triggers using the recommended tagging model.
Trigger tags
Tag | Value | Description |
---|---|---|
scope | performance availability - a monitoring target or it's part may become unavailable capacity - a monitored resource may be exhausted notice security compliance - reserved for user-defined templates |
Specifies the type of a problem. Including at least one tag is mandatory; multiple tags are allowed. |
For example, the trigger High memory utilization might contain the following tags:
For macros used in trigger expressions (thresholds) use this form:
Use MAX|MIN when you need to highlight whether it is the high or low threshold.
Good | Bad |
---|---|
{$MYSQL.REPLICATION_LAG.MAX.WARN} {$TEMP.MAX.WARN:”{#SENSOR}”} {$SERVICE.STATUS.CRIT} {$IF.ERRORS.MAX.WARN} {$DISK.STATUS.OK} {$DISK.STATUS.WARN} {$DISK.STATUS.CRIT} {$MEM_UTIL.MAX.WARN} {$MEM_UTIL.MAX.CRIT} |
{$DISK_OK_STATUS} {$MEMORY_UTIL_MAX} |
Check the following trigger snippets library and consider reusing configuration to avoid reinventing the wheel.
Case: Something has just been restarted
Trigger: <resource> has just been restarted (uptime < 10m)
Applicable for | For uptime counters for device, host, or software/service running |
---|---|
Name | <resource> has been restarted |
Event name | <resource> has been restarted (uptime < 10m) |
Description | <resource> uptime is less than 10 minutes |
Expression | last(/TEMPLATE_NAME/METRIC)<10m |
Recovery expression | - |
Recovery mode | - |
Manual close | Yes |
Severity | Warning for the host. Info for all others. |
Depends on | - |
Case: Any master item + preprocessing in dependent items
Trigger: Master item is not responding
<resource>: Failed to get items (no data for 30m)
Applicable for | Any type of items used for bulk data collection |
---|---|
Expression | nodata(/TEMPLATE_NAME/temperature,30m)=1 |
Recovery expression | - |
Recovery mode | - |
Manual close | Yes |
Severity | Warning |
Depends on | If present: <Proc> is not running |
Case: HTTP item + regex preprocessing in dependent items
Trigger: HTTP item is not responding
Applicable for | HTTP items that provide output for future regex preprocessing. Use ‘Headers and Body’ mode in the item. |
---|
Case: <VALUE> is too high (over X)/ is too low (under X) for slow to change values
For slow changing values (i.e. temperature, use max() for high, and min() for lows to get immediate response with delayed (confirmed) recovery.
Trigger: <VALUE> is too high (over X)
Applicable for | High temperature (slow to change) |
---|---|
Expression | max(/TEMPLATE_NAME/METRIC,5m) > X |
Trigger: <VALUE> is too low (under X)
Applicable for | Low temperature (slow to change) |
---|---|
Expression | min(/TEMPLATE_NAME/METRIC,5m) < X |
Case: <VALUE> is too high (over X for 5m)/ is too low (under X for 5m) for quick-to-change and jumpy values
For jumpy values, use min (for high) and max(for low) to make triggers more tolerable to spikes/noise.
Trigger: <VALUE> is too high (over X for 5m)
Applicable for | CPU utilization (jumpy), signal strength(jumpy), network utilization |
---|---|
Expression | min(/TEMPLATE_NAME/METRIC,5m) > X |
Trigger: <VALUE> is too low (under X for 5m)
Applicable for | CPU utilization (jumpy), signal strength(jumpy), network utilization |
---|---|
Expression | max(/TEMPLATE_NAME/METRIC,5m) < X |
Case: Serial number has changed on the device
Trigger: Serial numbers controls
Applicable for | Serial numbers items |
---|---|
Name | <resource> has been replaced |
Event name | <resource> has been replaced (new serial number received) |
Description | <resource> serial number has changed. Ack to close |
Expression | last(/TEMPLATE_NAME/METRIC)<>last(/TEMPLATE_NAME/METRIC,#2) and length(/TEMPLATE_NAME/METRIC)>0 |
Recovery expression | - |
Recovery mode | None |
Manual close | Yes |
Severity | Info |
Depends on | - |
Case: Software version has changed on the device
Trigger: Version controls
Applicable for | Software version items |
---|---|
Name | <resource> version has changed |
Event name | <resource> version has changed (new version: {ITEM.VALUE}) |
Description | <resource> version has changed. Ack to close |
Expression | last(/TEMPLATE_NAME/METRIC)<>last(/TEMPLATE_NAME/METRIC,#2) and length(/TEMPLATE_NAME/METRIC)>0 |
Recovery expression | - |
Recovery mode | None |
Manual close | Yes |
Severity | Info |
Depends on | - |
Case: Control how much disk space is left
Trigger: Filesystem space is critically low with timeleft with context macro
{$VFS.FS.PUSED.MAX.CRIT:\"__RESOURCE__\"} = 90
Applicable for | Filesystems |
---|---|
Name | Disk space is critically low |
Event name | Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:\"__RESOURCE__\"}) |
Description | Space used: {ITEM.VALUE3} of {ITEM.VALUE2} ({ITEM.VALUE1}), time left till full: < 24h. Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:\"__RESOURCE__\"}. Second condition should be one of the following: - The disk free space is less than 5G. - The disk will be full in less than 24 hours. |
Expression | last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused])>{$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"} and (last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},total])-last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},used]))<{$VFS.FS.FREE.MIN.CRIT:"{#FSNAME}"} or timeleft((/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused],1h,100)<1d |
Recovery expression | - |
Recovery mode | None |
Manual close | Yes |
Severity | Average |
Depends on | - |
Trigger: Filesystem space is low with timeleft with context macro
{$VFS.FS.PUSED.WARN.CRIT:\"__RESOURCE__\"} = 80
Applicable for | Filesystems |
---|---|
Name | Disk space is low |
Event name | Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:\"__RESOURCE__\"}) |
Description | Space used: {ITEM.VALUE3} of {ITEM.VALUE2} ({ITEM.VALUE1}), time left till full: < 24h. Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.WARN:\"__RESOURCE__\"}. Second condition should be one of the following: - The disk free space is less than 10G. - The disk will be full in less than 24 hours. |
Expression | last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused])>{$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"} and (last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},total])-last(/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},used]))<{$VFS.FS.FREE.MIN.WARN:"{#FSNAME}"} or timeleft((/TEMPLATE_NAME/vfs.fs.size[{#FSNAME},pused],1h,100)<1d |
Recovery expression | - |
Recovery mode | None |
Manual close | Yes |
Severity | Warning |
Depends on | Disk space is critically low. |