Feuerfest

Just the private blog of a Linux sysadmin

Working in IT is sometimes like this...

Task: Upgrade a custom application on a customer system

Technical details: This involves executing some script beforehand, than a Puppet Run and some menial afterwork/clean up tasks.

  1. Prepare everything. Shutdown system. Take a Snapshot.
  2. Executing the script: Error! I can't find this dependency!
    • Oh, yeah. That package requiring the dependency isn't needed here. Commenting out that package from the list.
  3. Executing script again. It finishes properly. Fine.
  4. Running Puppet agent.
    • Error after 45seconds: Left over process found. Aborting.
    • Checking Relases Notes: Ah, known bug. ps -ef |process and kill pid is the workaround ok.
    • ps'ing and kill'ing.
  5. Run puppet agent -t again.
  6. It runs for 30min, then aborts. A partition has not enough free space. Narf.
    • Searching for stuff to delete/compress/move.
    • Start Puppet again to verify it works. It does.
  7. Revert system to snapshot as this is policy.
  8. Doing 1-6 again.
  9. Puppet agent run.
  10. It runs for 2.5 hours. Then breaks with Execution expired and some Ruby stacktrace.
    • Checking Release Notes again. Nothing. Hmm ok.
    • Reading code to understand the problem. No real insight gained. It SHOULD work.
  11. Reverting to snapshot. Now taking a 2nd snapshot right before the Puppet run.
  12. 3rd Puppet agent run. Same error after 2.5 hours.
    • Oh come on..
  13. Reverting to snapshot.
  14. Installing the RPM which installation triggers the stacktrace manually. It works. Package install takes 45min.
  15. Reverting to snapshot.
  16. Just for the sake of trying I start a 4th Puppet run. Expecting no change.
  17. After 55min of Puppet running I send a status mail to project lead, some other involved people, application developer, etc. that I couldn't update the system in time today and will need help and an additional 1-2 hours tomorrow.
  18. 5min after sending the mail: Hey, it's me! Puppet! I finished. No errors!
  19. I send out a follow-up mail, stating that the biggest part of the update is done. But due to the missed time window I will still need 1-2 hours tomorrow for the afterwork/clean up tasks.

And I still don't where or what the error was! Due to Execution expired I suspect that somewhere deep in the Ruby code something timed out. Maybe something which isn't documented nor written in any logfile. Something which had a maintenance window at exactly the same hours as we had.

Hopefully the developer knows more.

Sometimes IT sucks.😂

Tag