Working in IT is sometimes like this...
Task: Upgrade a custom application on a customer system
Technical details: This involves executing some script beforehand, than a Puppet Run and some menial afterwork/clean up tasks.
- Prepare everything. Shutdown system. Take a Snapshot.
- Executing the script: Error! I can't find this dependency!
- Oh, yeah. That package requiring the dependency isn't needed here. Commenting out that package from the list.
- Executing script again. It finishes properly. Fine.
- Running Puppet agent.
- Error after 45seconds: Left over process found. Aborting.
- Checking Relases Notes: Ah, known bug.
ps -ef |process
andkill pid
is the workaround ok. ps
'ing andkill
'ing.
- Run
puppet agent -t
again. - It runs for 30min, then aborts. A partition has not enough free space. Narf.
- Searching for stuff to delete/compress/move.
- Start Puppet again to verify it works. It does.
- Revert system to snapshot as this is policy.
- Doing 1-6 again.
- Puppet agent run.
- It runs for 2.5 hours. Then breaks with
Execution expired
and some Ruby stacktrace.- Checking Release Notes again. Nothing. Hmm ok.
- Reading code to understand the problem. No real insight gained. It SHOULD work.
- Reverting to snapshot. Now taking a 2nd snapshot right before the Puppet run.
- 3rd Puppet agent run. Same error after 2.5 hours.
- Oh come on..
- Reverting to snapshot.
- Installing the RPM which installation triggers the stacktrace manually. It works. Package install takes 45min.
- Reverting to snapshot.
- Just for the sake of trying I start a 4th Puppet run. Expecting no change.
- After 55min of Puppet running I send a status mail to project lead, some other involved people, application developer, etc. that I couldn't update the system in time today and will need help and an additional 1-2 hours tomorrow.
- 5min after sending the mail: Hey, it's me! Puppet! I finished. No errors!
- I send out a follow-up mail, stating that the biggest part of the update is done. But due to the missed time window I will still need 1-2 hours tomorrow for the afterwork/clean up tasks.
And I still don't where or what the error was! Due to Execution expired
I suspect that somewhere deep in the Ruby code something timed out. Maybe something which isn't documented nor written in any logfile. Something which had a maintenance window at exactly the same hours as we had.
Hopefully the developer knows more.
Sometimes IT sucks.😂