Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

Available solutions




This template is for Zabbix version: 7.2

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/hadoop_http?at=release/7.2

Hadoop by HTTP

Overview

The template for monitoring Hadoop over HTTP that works without any external scripts. It collects metrics by polling the Hadoop API remotely using an HTTP agent and JSONPath preprocessing. Zabbix server (or proxy) execute direct requests to ResourceManager, NodeManagers, NameNode, DataNodes APIs. All metrics are collected at once, thanks to the Zabbix bulk data collection.

Requirements

Zabbix version: 7.2 and higher.

Tested versions

This template has been tested on:

  • Hadoop 3.1 and later

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

You should define the IP address (or FQDN) and Web-UI port for the ResourceManager in {$HADOOP.RESOURCEMANAGER.HOST} and {$HADOOP.RESOURCEMANAGER.PORT} macros and for the NameNode in {$HADOOP.NAMENODE.HOST} and {$HADOOP.NAMENODE.PORT} macros respectively. Macros can be set in the template or overridden at the host level.

Macros used

Name Description Default
{$HADOOP.RESOURCEMANAGER.HOST}

The Hadoop ResourceManager host IP address or FQDN.

ResourceManager
{$HADOOP.RESOURCEMANAGER.PORT}

The Hadoop ResourceManager Web-UI port.

8088
{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN}

The Hadoop ResourceManager API page maximum response time in seconds for trigger expression.

10s
{$HADOOP.NAMENODE.HOST}

The Hadoop NameNode host IP address or FQDN.

NameNode
{$HADOOP.NAMENODE.PORT}

The Hadoop NameNode Web-UI port.

9870
{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN}

The Hadoop NameNode API page maximum response time in seconds for trigger expression.

10s
{$HADOOP.CAPACITY_REMAINING.MIN.WARN}

The Hadoop cluster capacity remaining percent for trigger expression.

20

Items

Name Description Type Key and additional info
ResourceManager: Service status

Hadoop ResourceManager API port availability.

Simple check net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

ResourceManager: Service response time

Hadoop ResourceManager API performance.

Simple check net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"]
Get ResourceManager stats HTTP agent hadoop.resourcemanager.get
ResourceManager: Uptime Dependent item hadoop.resourcemanager.uptime

Preprocessing

  • JSON Path: $.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()

  • Custom multiplier: 0.001

ResourceManager: Get info Dependent item hadoop.resourcemanager.info

Preprocessing

  • JSON Path: $.beans[?(@.name=~'Hadoop:service=ResourceManager,name=*')]

    ⛔️Custom on fail: Set value to: []

ResourceManager: RPC queue & processing time

Average time spent on processing RPC requests.

Dependent item hadoop.resourcemanager.rpc_processing_time_avg

Preprocessing

  • JSON Path: The text is too long. Please see the template.

ResourceManager: Active NMs

Number of Active NodeManagers.

Dependent item hadoop.resourcemanager.num_active_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

ResourceManager: Decommissioning NMs

Number of Decommissioning NodeManagers.

Dependent item hadoop.resourcemanager.num_decommissioning_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

ResourceManager: Decommissioned NMs

Number of Decommissioned NodeManagers.

Dependent item hadoop.resourcemanager.num_decommissioned_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

ResourceManager: Lost NMs

Number of Lost NodeManagers.

Dependent item hadoop.resourcemanager.num_lost_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

ResourceManager: Unhealthy NMs

Number of Unhealthy NodeManagers.

Dependent item hadoop.resourcemanager.num_unhealthy_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

ResourceManager: Rebooted NMs

Number of Rebooted NodeManagers.

Dependent item hadoop.resourcemanager.num_rebooted_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

ResourceManager: Shutdown NMs

Number of Shutdown NodeManagers.

Dependent item hadoop.resourcemanager.num_shutdown_nm

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Service status

Hadoop NameNode API port availability.

Simple check net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

NameNode: Service response time

Hadoop NameNode API performance.

Simple check net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"]
Get NameNode stats HTTP agent hadoop.namenode.get
NameNode: Uptime Dependent item hadoop.namenode.uptime

Preprocessing

  • JSON Path: $.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()

  • Custom multiplier: 0.001

NameNode: Get info Dependent item hadoop.namenode.info

Preprocessing

  • JSON Path: $.beans[?(@.name=~'Hadoop:service=NameNode,name=*')]

    ⛔️Custom on fail: Set value to: []

NameNode: RPC queue & processing time

Average time spent on processing RPC requests.

Dependent item hadoop.namenode.rpc_processing_time_avg

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Block Pool Renaming Dependent item hadoop.namenode.percent_block_pool_used

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Transactions since last checkpoint

Total number of transactions since last checkpoint.

Dependent item hadoop.namenode.transactions_since_last_checkpoint

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Percent capacity remaining

Available capacity in percent.

Dependent item hadoop.namenode.percent_remaining

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

NameNode: Capacity remaining

Available capacity.

Dependent item hadoop.namenode.capacity_remaining

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Corrupt blocks

Number of corrupt blocks.

Dependent item hadoop.namenode.corrupt_blocks

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Missing blocks

Number of missing blocks.

Dependent item hadoop.namenode.missing_blocks

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Failed volumes

Number of failed volumes.

Dependent item hadoop.namenode.volume_failures_total

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Alive DataNodes

Count of alive DataNodes.

Dependent item hadoop.namenode.num_live_data_nodes

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

NameNode: Dead DataNodes

Count of dead DataNodes.

Dependent item hadoop.namenode.num_dead_data_nodes

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

NameNode: Stale DataNodes

DataNodes that do not send a heartbeat within 30 seconds are marked as "stale".

Dependent item hadoop.namenode.num_stale_data_nodes

Preprocessing

  • JSON Path: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 6h

NameNode: Total files

Total count of files tracked by the NameNode.

Dependent item hadoop.namenode.files_total

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Total load

The current number of concurrent file accesses (read/write) across all DataNodes.

Dependent item hadoop.namenode.total_load

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Blocks allocable

Maximum number of blocks allocable.

Dependent item hadoop.namenode.block_capacity

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Total blocks

Count of blocks tracked by NameNode.

Dependent item hadoop.namenode.blocks_total

Preprocessing

  • JSON Path: The text is too long. Please see the template.

NameNode: Under-replicated blocks

The number of blocks with insufficient replication.

Dependent item hadoop.namenode.under_replicated_blocks

Preprocessing

  • JSON Path: The text is too long. Please see the template.

Get NodeManagers states HTTP agent hadoop.nodemanagers.get

Preprocessing

  • JavaScript: The text is too long. Please see the template.

Get DataNodes states HTTP agent hadoop.datanodes.get

Preprocessing

  • JavaScript: The text is too long. Please see the template.

Triggers

Name Description Expression Severity Dependencies and additional info
Hadoop: ResourceManager: Service is unavailable last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"])=0 Average Manual close: Yes
Hadoop: ResourceManager: Service response time is too high min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.RESOURCEMANAGER.HOST}","{$HADOOP.RESOURCEMANAGER.PORT}"],5m)>{$HADOOP.RESOURCEMANAGER.RESPONSE_TIME.MAX.WARN} Warning Manual close: Yes
Depends on:
  • Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Service has been restarted

Uptime is less than 10 minutes.

last(/Hadoop by HTTP/hadoop.resourcemanager.uptime)<10m Info Manual close: Yes
Hadoop: ResourceManager: Failed to fetch ResourceManager API page

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Hadoop by HTTP/hadoop.resourcemanager.uptime,30m)=1 Warning Manual close: Yes
Depends on:
  • Hadoop: ResourceManager: Service is unavailable
Hadoop: ResourceManager: Cluster has no active NodeManagers

Cluster is unable to execute any jobs without at least one NodeManager.

max(/Hadoop by HTTP/hadoop.resourcemanager.num_active_nm,5m)=0 High
Hadoop: ResourceManager: Cluster has unhealthy NodeManagers

YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.

min(/Hadoop by HTTP/hadoop.resourcemanager.num_unhealthy_nm,15m)>0 Average
Hadoop: NameNode: Service is unavailable last(/Hadoop by HTTP/net.tcp.service["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"])=0 Average Manual close: Yes
Hadoop: NameNode: Service response time is too high min(/Hadoop by HTTP/net.tcp.service.perf["tcp","{$HADOOP.NAMENODE.HOST}","{$HADOOP.NAMENODE.PORT}"],5m)>{$HADOOP.NAMENODE.RESPONSE_TIME.MAX.WARN} Warning Manual close: Yes
Depends on:
  • Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Service has been restarted

Uptime is less than 10 minutes.

last(/Hadoop by HTTP/hadoop.namenode.uptime)<10m Info Manual close: Yes
Hadoop: NameNode: Failed to fetch NameNode API page

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Hadoop by HTTP/hadoop.namenode.uptime,30m)=1 Warning Manual close: Yes
Depends on:
  • Hadoop: NameNode: Service is unavailable
Hadoop: NameNode: Cluster capacity remaining is low

A good practice is to ensure that disk use never exceeds 80 percent capacity.

max(/Hadoop by HTTP/hadoop.namenode.percent_remaining,15m)<{$HADOOP.CAPACITY_REMAINING.MIN.WARN} Warning
Hadoop: NameNode: Cluster has missing blocks

A missing block is far worse than a corrupt block, because a missing block cannot be recovered by copying a replica.

min(/Hadoop by HTTP/hadoop.namenode.missing_blocks,15m)>0 Average
Hadoop: NameNode: Cluster has volume failures

HDFS now allows for disks to fail in place, without affecting DataNode operations, until a threshold value is reached. This is set on each DataNode via the dfs.datanode.failed.volumes.tolerated property; it defaults to 0, meaning that any volume failure will shut down the DataNode; on a production cluster where DataNodes typically have 6, 8, or 12 disks, setting this parameter to 1 or 2 is typically the best practice.

min(/Hadoop by HTTP/hadoop.namenode.volume_failures_total,15m)>0 Average
Hadoop: NameNode: Cluster has DataNodes in Dead state

The death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.

min(/Hadoop by HTTP/hadoop.namenode.num_dead_data_nodes,5m)>0 Average

LLD rule Node manager discovery

Name Description Type Key and additional info
Node manager discovery HTTP agent hadoop.nodemanager.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

Item prototypes for Node manager discovery

Name Description Type Key and additional info
Hadoop NodeManager {#HOSTNAME}: Get stats HTTP agent hadoop.nodemanager.get[{#HOSTNAME}]
{#HOSTNAME}: RPC queue & processing time

Average time spent on processing RPC requests.

Dependent item hadoop.nodemanager.rpc_processing_time_avg[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: Container launch avg duration Dependent item hadoop.nodemanager.container_launch_duration_avg[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: JVM Threads

The number of JVM threads.

Dependent item hadoop.nodemanager.jvm.threads[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: JVM Garbage collection time

The JVM garbage collection time in milliseconds.

Dependent item hadoop.nodemanager.jvm.gc_time[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: JVM Heap usage

The JVM heap usage in MBytes.

Dependent item hadoop.nodemanager.jvm.mem_heap_used[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: Uptime Dependent item hadoop.nodemanager.uptime[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()

  • Custom multiplier: 0.001

Hadoop NodeManager {#HOSTNAME}: Get raw info Dependent item hadoop.nodemanager.raw_info[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.[?(@.HostName=='{#HOSTNAME}')].first()

    ⛔️Custom on fail: Discard value

{#HOSTNAME}: State

State of the node - valid values are: NEW, RUNNING, UNHEALTHY, DECOMMISSIONING, DECOMMISSIONED, LOST, REBOOTED, SHUTDOWN.

Dependent item hadoop.nodemanager.state[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.State

  • Discard unchanged with heartbeat: 6h

{#HOSTNAME}: Version Dependent item hadoop.nodemanager.version[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.NodeManagerVersion

  • Discard unchanged with heartbeat: 6h

{#HOSTNAME}: Number of containers Dependent item hadoop.nodemanager.numcontainers[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.NumContainers

{#HOSTNAME}: Used memory Dependent item hadoop.nodemanager.usedmemory[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.UsedMemoryMB

{#HOSTNAME}: Available memory Dependent item hadoop.nodemanager.availablememory[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.AvailableMemoryMB

Trigger prototypes for Node manager discovery

Name Description Expression Severity Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted

Uptime is less than 10 minutes.

last(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}])<10m Info Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch NodeManager API page

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Hadoop by HTTP/hadoop.nodemanager.uptime[{#HOSTNAME}],30m)=1 Warning Manual close: Yes
Depends on:
  • Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: NodeManager has state {ITEM.VALUE}.

The state is different from normal.

last(/Hadoop by HTTP/hadoop.nodemanager.state[{#HOSTNAME}])<>"RUNNING" Average

LLD rule Data node discovery

Name Description Type Key and additional info
Data node discovery HTTP agent hadoop.datanode.discovery

Preprocessing

  • JavaScript: The text is too long. Please see the template.

Item prototypes for Data node discovery

Name Description Type Key and additional info
Hadoop DataNode {#HOSTNAME}: Get stats HTTP agent hadoop.datanode.get[{#HOSTNAME}]
{#HOSTNAME}: Remaining

Remaining disk space.

Dependent item hadoop.datanode.remaining[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: Used

Used disk space.

Dependent item hadoop.datanode.dfs_used[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: Number of failed volumes

Number of failed storage volumes.

Dependent item hadoop.datanode.numfailedvolumes[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: JVM Threads

The number of JVM threads.

Dependent item hadoop.datanode.jvm.threads[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: JVM Garbage collection time

The JVM garbage collection time in milliseconds.

Dependent item hadoop.datanode.jvm.gc_time[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: JVM Heap usage

The JVM heap usage in MBytes.

Dependent item hadoop.datanode.jvm.mem_heap_used[{#HOSTNAME}]

Preprocessing

  • JSON Path: The text is too long. Please see the template.

{#HOSTNAME}: Uptime Dependent item hadoop.datanode.uptime[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.beans[?(@.name=='java.lang:type=Runtime')].Uptime.first()

  • Custom multiplier: 0.001

Hadoop DataNode {#HOSTNAME}: Get raw info Dependent item hadoop.datanode.raw_info[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.[?(@.HostName=='{#HOSTNAME}')].first()

    ⛔️Custom on fail: Discard value

{#HOSTNAME}: Version

DataNode software version.

Dependent item hadoop.datanode.version[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.version

  • Discard unchanged with heartbeat: 6h

{#HOSTNAME}: Admin state

Administrative state.

Dependent item hadoop.datanode.admin_state[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.adminState

  • Discard unchanged with heartbeat: 6h

{#HOSTNAME}: Oper state

Operational state.

Dependent item hadoop.datanode.oper_state[{#HOSTNAME}]

Preprocessing

  • JSON Path: $.operState

  • Discard unchanged with heartbeat: 6h

Trigger prototypes for Data node discovery

Name Description Expression Severity Dependencies and additional info
Hadoop: {#HOSTNAME}: Service has been restarted

Uptime is less than 10 minutes.

last(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}])<10m Info Manual close: Yes
Hadoop: {#HOSTNAME}: Failed to fetch DataNode API page

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Hadoop by HTTP/hadoop.datanode.uptime[{#HOSTNAME}],30m)=1 Warning Manual close: Yes
Depends on:
  • Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.
Hadoop: {#HOSTNAME}: DataNode has state {ITEM.VALUE}.

The state is different from normal.

last(/Hadoop by HTTP/hadoop.datanode.oper_state[{#HOSTNAME}])<>"Live" Average

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

Articles and documentation

+ Propose new article

No se encuentra la integración que necesitas?