Zabbix Administrators Guide

Dashboards

Zabbix has a number of types of dashboards that all serve different functions but could be considered by many as a “dashboard” of sorts.

NB. If there are any hosts that users don’t have access to in a Screen, Map or IT Service (SLA), then an object doesn’t exist error will return.

System Dashboard

A high level status of the system and monitored hosts as a whole. Bookmarked screens, maps and graphs will be found here.

Not much customisation can be done to the Dashboard but filtering of what is show can be done using the spanner at the top right hand corner. Read more

Overview

This is a dashboard where alerts (and data values) can be visualised in a matrix of hosts and metrics, where metrics and hosts can be optionally placed at the top or side. Queries can be done against host groups and filtered by many fields. Read more

Triggers

A simple snapshot of current alerts in table format. Order by severity, name or date; searched by host group and host and filtered by many fields.

Triggers can be acknowledged by a user, where the user should enter a brief comment on the situation and ideally the reference number of any ticket raised against the alert. Clicking show in the description column will show any actions operators should take that has previously been configured by an administrator on the trigger configuration. Read more

Maps

Maps are a way to present a visual topology of the interconnecting systems and services. Custom images (page section 3) can be added for improved usability. Read more

Screens

Screens are by far the most configurable dashboard in Zabbix. They allow multiple columns and rows where each cell can contain a number of different types of elements, such as a graph, trigger overview, events list, maps, live issues by host group, etc.

Screens come in two forms. Templated and custom.

Template screens are screens that set a model for how a screen should look for any hosts which are linked to the template. All hosts that link will have all templated screens linked to them automatically. Templated screens are limited by monitoring items that are available to the template, including inherited items.

Custom screens are for special cases where the scope of the screen is beyond items that are configured or linked to a template. Custom screens have far more available types of elements and can even include data from different hosts and host groups. Read more

Configuring screens

As an administrator, the configuration section for custom screens is found under the screens sub-section of the monitoring header. If all screens are not immediately visible, click the all screens breadcrumb root.

Clicking create will open a new, screen properties page where, screen name, number of columns and rows can be configured as well as changing the owner and assigning permissions to users/user groups. Clicking properties link of a previously created screen will reopen this page.

Clicking the name of any screen will take you to the user view of the configured screen.

In this section the user can click on the host name to get a menu to other areas and tools for the host. Clicking or hovering over any alerts will present the user with a brief list of historic events (full list is available by clicking the date) and the description of the trigger (which could contain what actions are recommended for the event).

Once the user has taken all the necessary actions, they should click yes in the Ack column, enter a comment and any reference and save to mark the alert as actioned.

If the user has read-write privileges to the screen, they will see and edit button which will take the user to the constructor section (also available in the all screens section).

Here is where the magic happens. By default, the user will see a table of empty cells with a change link inside. Around the outside of the cells are plus (+) and minus (-) symbols. These are used to add or remove columns and rows.

Clicking the change link will open a dialogue where the user can select the type of element they want to populate the cell, its presentation options and any additional parameters required per resource type. Read more

URL embeds content in page

Users and User Groups

Users in Zabbix have been setup for individuals and for monitoring screens. User groups have been set up to define roles and permissions to hosts. For example:

User Groups

Group::Read-Write – allows all users as part of this group to be able to modify all monitored instances in the all systems host group

All monitored host should be set to automatically be added to this host group upon auto-registration and therefore be read-write to this user group by default. There is also a Group::Read-Only user group which initially contains the user used for displaying dashboards on a wall monitor.

Group::Read-Only – defines read-only access to all systems

Users

TBC

Items

Below details of the built-in checks. They are far more efficient at scale than using system.run for everything; and are rigorously tested already by Zabbix so you don’t have to.

Zabbix Agent items

These are checks handled by the agent on the monitored host. They are the most frequently used and useful type

They can be done:

· actively (recommended): where the agent keeps a record of what should be monitored and handles all the collection and pushes the data to the proxy. It asks the proxy regularly (configurable) for what needs to be monitored. This distributes load across the monitoring infrastructure and reduces network traffic

· passively: where the agent sits passively waiting for the proxy to ask it to go and collect each metric

https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/simple_checks

Calculated items

Calculated items are for extrapolating values from any given item(s). For example working out what the percentage of requests are failing would look like 100*last(“req.count.keyfail”)/last(“req.count.keytotal”)

https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/calculated

Internal items

Used for monitoring the inner working of the Zabbix application

https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/internal

Aggregate items

Aggregate items take a host group and an item and apply a function across the given item value returned by all hosts in that host group. For example to count how many java process are running across all of the RCM hosts the item would look like grpsum[RCM,”proc.numjava,rcm,,rcm“,last]. In this case we’re adding up all of the last returned responses proc.num checks (which is an individual host check of the number of java processes, owned by rcm user and with rcm in the process string) from each host in the RCM host group.

For aggregate items there needs to be a dummy host. It’s useful to name these by what the service is being provided by the host group because the aggregated items will give you an overall view of host the whole service is performing, across all hosts.

https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/aggregate

Administration

Queue

The Zabbix queue is an administration view of how long it is taking to capture and present metrics from monitored hosts. Overview shows time by item type, by proxy shows if any particular proxy is having any issues returning data and details shows the actual items and how long they’ve been delayed for.

Actions

Auto Registration

Zabbix has the ability to automatically monitor host as they spin up. It can use various factors to decide what role the host has and what to monitor based on the role. The hostname and monitored-by proxy can be used but essentially, this can be programmatic by using HostMetadataItem agent configuration option if set

API

curl -i -X POST -H ‘Content-Type: application/json-rpc’ -d ‘{“params”: {“password”: “your_pass”, “user”: “your_user”}, “jsonrpc”: “2.0”, “method”: “user.login”, “id”: 0}’ http://zabbix-web/zabbix/api_jsonrpc.php

Zabbix Daemons

zabbix_agentd

This is a small management agent that runs on any host you wish to monitor. It can run in active and/or passive mode.

In active mode, the agent queries a remote server or proxy and asks it what should be monitored on the local host and then actively collects the data before pushing the results to the server/proxy.

In passive mode, the agents sit idle until the remote proxy or server individually requests metrics.

Active mode is recommended for most metrics to distribute load but in reality the Zabbix infrastructure is set up to allow a mix of both and many cases will find mixed passive and active checks will be the best solution.

/etc/zabbix/zabbix_agentd.conf

Key configuration options are:

ServerActive: addresses of remote servers to query for checks and send metrics to. Set this to the comma separated list of available Zabbix Proxies.

Server: address of remote servers that are allowed to query for metrics. Currently this is set to only one proxy and should be set this to the comma separated list of available Zabbix Proxies.

EnableRemoteCommands: allows the Zabbix Agent to use system.run items to open a shell and run system commands and scripts. Set this to “1”.

LogRemoteCommands: equally log any remote commands in the specified log file. Set to 1

ListenIP: what INET address the Zabbix Agent will listen on. Leaving this blank will set the agent listening on all interfaces. Recommend leaving blank for cases where DHCP will be used for monitored hosts. Puppet module puppet-zabbix config (around line 262) works out what to set this to, if left undefined.

StartAgents: how many processes to start for passive checks. The number of passive checks should be low. Set this to 1.

Hostname: configured name of the server to send with metrics and checks for items to monitor. Should be left blank.

RefreshActiveChecks: how often the agent should query the server to see if there are any new items it should monitor. Default is 120. This is probably okay unless you foresee circumstances of needing to ensure a new item is monitored ASAP

BufferSend: how long to keep collected metric data in the buffer for. Increasing this increases the time in which a server or proxy can be inaccessible for without losing data but will use more memory on the monitored host. With two proxies, set to 300-600 (seconds) should be sufficient.

BufferSize: The number of metrics to collect before sending the results to the Zabbix Server. The lower the number the more frequently values are sent to the server but causes more load and higher numbers use more memory and adds delay in collection. Default of 100 is fine for most cases.

UserParameter: in the event that you need to run complex scripts (beyond the allocated field size for system.run keys), UserParameter settings can be used to create a custom key. In the form: UserParameter=,

zabbix_proxy

A modified version of the zabbix_server daemon that is setup to proxy and buffer data between agents and server. It helps with distributing load, adds a layer of resilience for data and accessing remote or secured environments more easily.

/etc/zabbix/zabbix_proxy.conf

Key parameters are:

ProxyMode: whether the proxy is actively checking and sending data to the backend server or whether it is waiting passively for the server to send it new configuration and pick-up buffered monitoring data. Default of 0 (active) is recommended

Server: same reference as Zabbix Agent above

Hostname: same reference as Zabbix Agent above

DB*: various settings for communicating with the local database

ProxyOfflineBuffer: how long to store monitoring data that has yet to be synchronised with the backend server. This parameter can be used as data resilience for loss of connectivity to the server. The higher the number, the greater the length of time without data loss but more diskspace will be required on the volume hosting the database to store the data. The final value is an executive decision but 1-6hrs is normal

ConfigFrequency: how often an active proxy will be checking for new things that need to be monitored. Higher numbers mean potentially slower time for new monitoring to be picked up but this is generally low impact so should be set to 60-120 (seconds).

DataSenderFrequency: how frequent proxied data gets forwarded by and active proxy onto the server. Default of one for most cases is unnecessary. Set to between 10-60 (seconds) is best for most circumstances

Start*: different configuration options to tell the proxy how many particular types of process to start. E.g. StartTrappers=5 tells the daemon to start 5 processes for receiving monitoring data from zabbix_sender and active zabbix_agentd processes. Default values should be fine for most implementations but throughout the lifetime of the Zabbix monitoring solution, high Zabbix queues will determine if these need to be increased

HousekeepingFrequency: setting to tell the server how often to run the housekeeper process. Fine tuning of what to clean and how much to clean is configured in the Zabbix Web front end. From testing we have found that the best setup here is to leave this setting as default, disable (in the Web front end) housekeeping on the history tables and set up partitioning management through stored procedures in the database (details of this can be found in the zabbix_db.pp in the puppet-control-repo).

*Cache*: internal metrics need to be monitored for issues as the monitored hosts/metrics increases

zabbix_server

Full implementation of the main application server which handles the consolidation of data from proxies and updated the database.

/etc/zabbix/zabbix_server.conf

Many of the same settings available for the proxy configuration are available for the server. Particular exclusions from above are: ProxyMode, Server, ProxyOfflineBuffer, ConfigFrequency and DataSenderFrequency. Differences include:

Start*: in a fully proxy distributed monitoring platform, many of the pollers and trappers may be unnecessary on the Server;

Proxy*Frequency: server side settings for proxies that are set to be passive;

AlertScriptsPath: directory location for storing scripts used for alert actions

Debugging

Logging

Logs for all daemons are kept in /var/log/zabbix. All Zabbix processes have an option to increase the log level on the running process. E.g. zabbix_agentd –R log_level_increase or zabbix_server –R log_level_decrease. The resulting output can be found in the relevant log for the daemon

On Zabbix Proxy and server, slow running DB queries can be logged by setting LogSlowQueries to an upper value (in milliseconds) of what is considered a slow query.

Zabbix Administration