Using ping to monitor if your systems are working is wrong
I've seen this far too often. When monitoring a system a simple ping
command is used to verify that the system is "up and running". In reality nothing could be further from the truth as this is not what you are actively checking.
Ping utilizes the Internet Control Message Protocol (ICMP) to send so-called ping-request packets to a server, which the server will answer with ping-reply packets. Giving us a method to verify that the server is actively answering our requests. But if the server is answering the packets, it doesn't mean that the server itself is in working condition. That all services are running.
And this is because of several reasons:
1. The network is reliable - mostly
In https://aphyr.com/posts/288-the-network-is-reliable Kyle Kingsbury, a.k.a "Aphyr" and Peter Bailis discuss common network fallacies, similar to the Fallacies of distributed computing by L. Peter Deutsch. As it is commonly assumed that the network "just works" and no strange things will happen. When indeed they do happen all the time.
In regard to our ICMP ping this means:
- There can be a firewall blocking ICMP or simply all traffic from our monitoring system
- Routing can be misconfigured
- Datacenter outages can happen
- Bandwidth can be drastically reduced
- VLANs can be misconfigured
- Cables can be broken
- Switchports can be defect
- Add your own ideas what can go wrong in a network
And you do want a monitoring method which allows you to reliably distinguish between network and system problems.
2. CPU Lockups
ICMP packets are answered by the kernel itself. This can have the nasty side-effect that your server literally hangs. Trapped in a state known as either Soft or Hard Lockup. And while overall they are somewhat rare - CPU Soft Lockups still do occur from time to time in my experience. Especially with earlier versions of hypervisors for virtual machines (VMs) as a CPU Soft Lockup can be triggered if there is simply too much CPU load on a system.
But the nasty side-effect of CPU Soft Lockups? The system will still reply to ICMP packets, while all other services are unreachable.
I once had problems with power management (ACPI) with a servers hardware. Somehow the ACPI kernel module would lock resources without freeing them. This effectively meant that the system came to a complete stop - but it didn't reboot or shutdown. Nor did it crash as in "Completely unreachable". No, ICMP packets were still answered quite fine.
Just no SSH connection was possible. No TCP or UDP services reachable. As the CPU was stuck at a certain operation and never switched to process other tasks.
Only disabling ACPI by adding the acpi=off
parameter to the grub kernel boot command line "fixed" this.
Regarding soft lockups I can recommend reading the following:
- Linux Magic System Request Key Hacks: Here you learn how you can trigger a kernel panic yourself and how to configure a few things
- https://www.baeldung.com/linux/terminal-kernel-panic also has a nice list of ways to trigger a kernel panic from the command line
- https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt Kernel documentation regarding "Softlockup detector and hardlockup detector (aka nmi_watchdog)"
- This SuSE knowledge base article also has some good basic information on how to configure timers, etc. https://www.suse.com/support/kb/doc/?id=000018705
Takeaways
- ICMP is suited to check if the system is reachable in the network
- After all ICMP is more cost-effective than TCP in terms of package size and number of packages sent
- A TCP connect to the port providing the used service is usually better for the reasons stated above
- You must incorporate your network topology in your monitoring system; only then you will be able to properly distinguish between: "System unreachable", "Misconfigured switchport" and "Service stopped responding"
- This means switches, firewalls, routers, loadbalancers, gateways - everything your users/service depends upon to be reachable must be included in your monitoring system, and:
- If possible the dependencies between them should be defined
- Like: Router → Firewall → LoadBalancer → Switch → System → Service
Conclusion
Knowing all this you should keep the following in mind: A successful ping only verifies that the system is reachable via your network. And this doesn't imply anything about the state of the OS.
Yes, this is no mind-blowing truth that I reveal here. But still I encounter monitoring setups where ICMP is used to verify that a system is "up and running" far too often.