Icinga2 error "check command does not exist" because of missing constant

Author Christian Reading time 8 minutes

Photo by Christina Morillo: https://www.pexels.com/photo/software-engineer-standing-beside-server-racks-1181354/

Apparently this problem kept me busy far too long, as I kept looking into the Icinga2 Master logfiles only. Main due to the service definition for my icinga CheckCommand still being from a time when it was only one Master without any Agents. This lead to it being executed on the Master and hence I never saw the problems on the agent..

Additionally the cluster and cluster-health checks only check if all endpoints are connected. Which was the case all the time. Therefore I got no error there too.

But what happened?

I defined a new CheckCommand. It worked fine on the master. Then I re-rewrote the service apply-Rule so that it matches for all Linux hosts being monitored. And then I got Check command not found for all these new service checks on all agent hosts.

I deleted the API config sync directories and restarted Icinga2 on the agents to trigger a new sync:

root@agent:/etc/icinga2# rm /var/lib/icinga2/api/zones-stage/* -rf && rm /var/lib/icinga2/api/zones/* -rf
root@agent:/etc/icinga2# systemctl restart icinga2.service

And suddenly all CheckCommands which are not part of the Icinga Template Library stopped working on the agents.

Uhm, ok. At this point I suspected I had somehow messed up my /etc/icinga2/zones.conf file some time ago. Turns out, this wasn't the case.

The root cause

Some weeks ago I defined a service check which is only executed on my Icinga2 master. However I stored the CheckCommand and Service-Configuration under /etc/icinga2/zones.d/master anyway as you never know when this comes in handy. (This has since been corrected in the article.) But the Telegram API requires a Token. And I defined that in /etc/icinga2/constants.conf - but this file isn't synced as it is outside of /etc/icinga2/zones.d/master. Something which I did on purpose, as I didn't want to sync the Token to all agents.

This apparently caused the config file sync to run into an syntax error as the constant for the Token couldn't be resolved.
But again.. This was only logged in the logfiles on the agents..

root@agent:/etc/icinga2# cat /var/log/icinga2/icinga2.log
[...]
[2024-07-17 22:39:04 +0200] information/ApiListener: Received configuration for zone 'global-templates' from endpoint 'master.domain.tld'. Comparing the timestamp and checksums.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/eventcommands.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/groups.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/host-templates.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/notifications.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/service-templates.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/telegrambot-notifications.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/templates.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/timeperiods.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/users.conf' for zone 'global-templates'.
[2024-07-17 22:39:04 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-templates' (6688 Bytes).
[2024-07-17 22:39:04 +0200] information/ApiListener: Received configuration updates (2) from endpoint 'master.domain.tld' are different to production, triggering validation and reload.
[2024-07-17 22:39:04 +0200] critical/ApiListener: Config validation failed for staged cluster config sync in '/var/lib/icinga2/api/zones-stage/'. Aborting. Logs: '/var/lib/icinga2/api/zones-stage//startup.log'
[...]

The /var/lib/icinga2/api/zones-stage/startup.log has the details:

root@agent:/etc/icinga2# cat /var/lib/icinga2/api/zones-stage/startup.log
[2024-07-17 23:36:19 +0200] information/cli: Icinga application loader (version: r2.12.3-1)
[2024-07-17 23:36:19 +0200] information/cli: Loading configuration file(s).
[2024-07-17 23:36:19 +0200] information/ConfigItem: Committing config item(s).
[2024-07-17 23:36:19 +0200] critical/config: Error: Error while evaluating expression: Tried to access undefined script variable 'TelegramBotToken'
Location: in /var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf: 46:26-46:41
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(44):     HOSTDISPLAYNAME = "$host.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(45):     SERVICEDISPLAYNAME = "$service.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(46):     TELEGRAM_BOT_TOKEN = TelegramBotToken
                                                                                                               ^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(47):     TELEGRAM_CHAT_ID = "$user.vars.telegram_chat_id$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(48):

[2024-07-17 23:36:19 +0200] critical/config: Error: Error while evaluating expression: Tried to access undefined script variable 'TelegramBotToken'
Location: in /var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf: 20:26-20:41
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(18):     NOTIFICATIONCOMMENT = "$notification.comment$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(19):     HOSTDISPLAYNAME = "$host.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(20):     TELEGRAM_BOT_TOKEN = TelegramBotToken
                                                                                                               ^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(21):     TELEGRAM_CHAT_ID = "$user.vars.telegram_chat_id$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(22):

[2024-07-17 23:36:19 +0200] critical/config: 2 errors
[2024-07-17 23:36:19 +0200] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

However... The tricky part is that a config validation will succeed!

root@agent:/etc/icinga2# icinga2 daemon -C
[2024-07-18 00:00:16 +0200] information/cli: Icinga application loader (version: r2.12.3-1)
[2024-07-18 00:00:16 +0200] information/cli: Loading configuration file(s).
[2024-07-18 00:00:16 +0200] information/ConfigItem: Committing config item(s).
[2024-07-18 00:00:16 +0200] information/ApiListener: My API identity: agent.domaint.tld
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 5 Zones.
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 2 Endpoints.
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 235 CheckCommands.
[2024-07-18 00:00:16 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2024-07-18 00:00:16 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-07-18 00:00:16 +0200] information/cli: Finished validating the configuration file(s).

And this was the reason why I was too focused on the master..

What I learned later is, that you can utilize the following command to validate the configuration from the stage-dir.
Documentation for the Config Sync: Receive Config is here.

root@agent:/var/log/icinga2# icinga2 daemon -C --define System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/
[2024-07-21 16:28:51 +0200] information/cli: Icinga application loader (version: r2.12.3-1)
[2024-07-21 16:28:51 +0200] information/cli: Loading configuration file(s).
[2024-07-21 16:28:51 +0200] information/ConfigItem: Committing config item(s).
[2024-07-21 16:28:51 +0200] critical/config: Error: Error while evaluating expression: Tried to access undefined script variable 'TelegramBotToken'
Location: in /var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf: 20:26-20:41
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(18):     NOTIFICATIONCOMMENT = "$notification.comment$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(19):     HOSTDISPLAYNAME = "$host.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(20):     TELEGRAM_BOT_TOKEN = TelegramBotToken
                                                                                                               ^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(21):     TELEGRAM_CHAT_ID = "$user.vars.telegram_chat_id$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(22):

[2024-07-21 16:28:51 +0200] critical/config: Error: Error while evaluating expression: Tried to access undefined script variable 'TelegramBotToken'
Location: in /var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf: 46:26-46:41
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(44):     HOSTDISPLAYNAME = "$host.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(45):     SERVICEDISPLAYNAME = "$service.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(46):     TELEGRAM_BOT_TOKEN = TelegramBotToken
                                                                                                               ^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(47):     TELEGRAM_CHAT_ID = "$user.vars.telegram_chat_id$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(48):

[2024-07-21 16:28:51 +0200] critical/config: 2 errors
[2024-07-21 16:28:51 +0200] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

The solution

On the master I moved the 2 files from /etc/icinga2/zones.d to /etc/icinga2/conf.d and restarted the service.

root@master:/etc/icinga2# mv /etc/icinga2/zones.d/global-commands/telegrambot-commands.conf /etc/icinga2/conf.d/
root@master:/etc/icinga2# mv /etc/icinga2/zones.d/global-templates/telegrambot-notifications.conf /etc/icinga2/conf.d/
root@master:/etc/icinga2# systemctl restart icinga2.service

On the agent a simple restart is enough:

root@agent:/etc/icinga2# systemctl restart icinga2.service

And after that everything worked again.

Another problem detected - an even deeper rooted cause

In the aftermath I was curious why & how Icinga didn't notify me that the config in the stage-dir couldn't be validated. Shouldn't there be some kind of included check for this?

Yes, turns out the built-in Icinga CheckCommand does exactly this. But it was never executed on my agent. As I still had a service definition from a time when I didn't have any agents. Initially the configuration was the following:

// Checks the agent health
apply Service "icinga" {
  import "generic-service"

  check_command = "icinga"

  assign where (host.address || host.address6) && host.vars.os == "Linux"
}

This was still a remnant of having only the Icinga Master and no agents. But this lead to it being executed on the Master. Which is... Not smart if you want to validate the configuration on the Agent.

After changing it to the following:

// Checks the agent health - must be executed on the agent
apply Service "icinga" {
  import "generic-service"

  check_command = "icinga"

  command_endpoint = host.vars.agent_endpoint

  assign where host.vars.agent_endpoint
}

The check worked as intended.

Oh, and I opened a pull request to enhance Icinga's documentation regarding the config sync: https://github.com/Icinga/icinga2/pull/10101. Let's see if it get's accepted.