etcd

etcd is an open source distributed key-value store used to hold and manage the critical information that distributed systems need to keep running. Most notably, it manages the configuration data, state data, and metadata for Kubernetes, the popular container orchestration platform.

Available solutions




This template is for Zabbix version: 7.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/7.0

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.

  2. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.

  3. Add the template to the etcd node. Set the hostname or IP address of the etcd host in the {$ETCD.HOST} macro. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag.

For more details, see the etcd documentation.

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run: zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Macros used

Name Description Default
{$ETCD.HOST}

The hostname or IP address of the etcd API endpoint.

<SET ETCD HOST>
{$ETCD.PORT}

The port of the etcd API endpoint.

2379
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable

Items

Name Description Type Key and additional info
Service's TCP port state Simple check net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

Get node metrics HTTP agent etcd.get_metrics
Node health HTTP agent etcd.health

Preprocessing

  • JSON Path: $.health

  • Boolean to decimal

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

Dependent item etcd.is.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_is_leader)

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

Dependent item etcd.has.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_has_leader)

  • Discard unchanged with heartbeat: 10m

Leader changes

The number of leader changes the member has seen since its start.

Dependent item etcd.leader.changes

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_leader_changes_seen_total)

Proposals committed per second

The number of consensus proposals committed.

Dependent item etcd.proposals.committed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_committed_total)

  • Change per second
Proposals applied per second

The number of consensus proposals applied.

Dependent item etcd.proposals.applied.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_applied_total)

  • Change per second
Proposals failed per second

The number of failed proposals seen.

Dependent item etcd.proposals.failed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_failed_total)

  • Change per second
Proposals pending

The current number of pending proposals to commit.

Dependent item etcd.proposals.pending

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_pending)

Reads per second

The number of read actions by get/getRecursive, local to this member.

Dependent item etcd.reads.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_reads_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

Dependent item etcd.writes.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_writes_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

Dependent item etcd.network.grpc.received.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_received_bytes_total)

  • Change per second
Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

Dependent item etcd.network.grpc.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_sent_bytes_total)

  • Change per second
HTTP requests received

The number of requests received into the system (successfully parsed and authd).

Dependent item etcd.http.requests.rate

Preprocessing

  • Prometheus to JSON: etcd_http_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

Dependent item etcd.http.requests.5xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"5.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

Dependent item etcd.http.requests.4xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"4.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs received per second

The number of RPC stream messages received on the server.

Dependent item etcd.grpc.received.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs sent per second

The number of gRPC stream messages sent by the server.

Dependent item etcd.grpc.sent.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_sent_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs started per second

The number of RPCs started on the server.

Dependent item etcd.grpc.started.rate

Preprocessing

  • Prometheus to JSON: grpc_server_started_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Get version HTTP agent etcd.get_version
Server version

The version of the etcd server.

Dependent item etcd.server.version

Preprocessing

  • JSON Path: $.etcdserver

  • Discard unchanged with heartbeat: 1d

Cluster version

The version of the etcd cluster.

Dependent item etcd.cluster.version

Preprocessing

  • JSON Path: $.etcdcluster

  • Discard unchanged with heartbeat: 1d

DB size

The total size of the underlying database.

Dependent item etcd.db.size

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_db_total_size_in_bytes)

Keys compacted per second

The number of DB keys compacted per second.

Dependent item etcd.keys.compacted.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_db_compaction_keys_total)

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Keys expired per second

The number of expired keys per second.

Dependent item etcd.keys.expired.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_store_expires_total)

  • Change per second
Keys total

The total number of keys.

Dependent item etcd.keys.total

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_keys_total)

Uptime

Etcd server uptime.

Dependent item etcd.uptime

Preprocessing

  • Prometheus pattern: VALUE(process_start_time_seconds)

  • JavaScript: The text is too long. Please see the template.

Virtual memory

The size of virtual memory expressed in bytes.

Dependent item etcd.virtual.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

Resident memory

The size of resident memory expressed in bytes.

Dependent item etcd.res.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

CPU

The total user and system CPU time spent in seconds.

Dependent item etcd.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
Open file descriptors

The number of open file descriptors.

Dependent item etcd.open.fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

Maximum open file descriptors

The Maximum number of open file descriptors.

Dependent item etcd.max.fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

Deletes per second

The number of deletes seen by this member per second.

Dependent item etcd.delete.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_delete_total)

  • Change per second
PUT per second

The number of puts seen by this member per second.

Dependent item etcd.put.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_put_total)

  • Change per second
Range per second

The number of ranges seen by this member per second.

Dependent item etcd.range.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Transaction per second

The number of transactions seen by this member per second.

Dependent item etcd.txn.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Pending events

The total number of pending events to be sent.

Dependent item etcd.events.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_pending_events_total)

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"])=0 Average Manual close: Yes
Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 Average Depends on:
  • Etcd: Service is unavailable
Etcd: Failed to fetch info data

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 Warning Manual close: Yes
Depends on:
  • Etcd: Service is unavailable
Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 Average
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} Warning
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} Warning
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} Warning
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} Warning
Etcd: Server version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 Info Manual close: Yes
Etcd: Cluster version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 Info Manual close: Yes
Etcd: Host has been restarted

Uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m Info Manual close: Yes
Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info
gRPC codes discovery Dependent item etcd.grpc_code.discovery

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info
RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for gRPC codes discovery

Name Description Expression Severity Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} Warning

LLD rule Peers discovery

Name Description Type Key and additional info
Peers discovery Dependent item etcd.peer.discovery

Preprocessing

  • Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name Description Type Key and additional info
Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Articles and documentation

+ Propose new article

Didn't find what you are looking for?