Date: 2024-09-01
Duration: Approximately 2 hours and 54 minutes 1
Impact: Metropolis’ server was unreachable, causing disruption to all services dependent on it

Timeline

  • 6:14 PM: Routine server maintenance initiated with an apt update && apt upgrade.

  • 6:24 PM: Server rebooted as part of the update process due to there being a kernel update.

  • 6:27 PM: Reboot completed, but SSH access was unavailable. Attempts to connect resulted in “Connection timed out” errors.

  • 6:33 PM:

    • managed to gain access to the server via the recovery console
    • sshd service status was checked and found to be active and running, indicating the SSH server itself was operational.
    • ping commands to the host failed, suggesting network connectivity issues
  • 6:39 PM:

    • Further attempts to diagnose the issue.
    • Disabling the Uncomplicated Firewall (ufw) had no effect, ruling it out as the cause of the connectivity problem.
  • 6:42 PM:

    • An attempt to install net-tools failed due to the network being unreachable.
    • I then tried to ping 1.1.1.1 (Cloudflare’s DNS resolver) also failed, further confirming the lack of general internet connectivity was server side and not just a DNS issue
  • 6:44 PM:

    • A power cycle was initiated in an attempt to reset the network hardware and potentially resolve the issue
  • 6:45 PM:

  • 6:54 PM:

    • After a little bit more digging, eth1 didn’t exist either
    • The netplan status command revealed that the primary interface eth0 was initialized and up, yet later research found that it didn’t have a valid IP config setup
  • 7:02 PM:

    • In an attempt to recover the missing configuration file, a backup restoration was initiated.
    • The netplan generate command, which should regenerate network configuration, produced unusable output, further complicating the recovery.
  • 8:01 PM:

    • A decision was made to restore from a backup, despite the potential loss of 3 days’ worth of data.
  • 8:11 PM:

    • The backup itself was found to be broken or unusable as they had the same issues.
  • 8:18 PM:

    • A final attempt to restart the networking service (systemctl restart networking.service) failed as networking.service couldn’t be found, indicating deeper issues with the network configuration beyond just missing files.
  • 8:21 PM:

    • With other options exhausted, booting from a recovery ISO was initiated as a potential way to access and repair the system
  • 8:41 PM:

    • The root cause was definitively identified: the apt upgrade / kernel update had inadvertently removed or corrupted the network settings.
  • 8:50 PM:

    • With the recovery ISO boot seemingly unsuccessful, the possibility of migrating to a new VPS was considered as a last resort
  • 8:53 PM:

    • I tried removing all ip interfaces (besides docker0) and added them back from scratch using the gateway, public IP and network mask i found in the Digital Ocean dashboard
      • sudo ifconfig eth0 {ip} netmask {netmask} up
      • sudo ip route add default via {gateway
  • 8:58 PM:

    • Network was partially up again
  • 9:24 PM:

    • Network was fully up and all services restored.

Root Cause Analysis

The apt upgrade inadvertently altered or removed critical network configuration files, specifically /etc/netplan/50-cloud-init.yaml leading to the loss of the default routes. and the eth0 / eth1 network interfaces (eth0 being the one that connected us to the public internet). This rendered the server unreachable even though sshd was running.

Contributing Factors

  • Lack of robust backup and recovery procedures: The unavailability or corruption of backups hindered recovery efforts and prolonged the outage
  • Debugging on production: While the exact actions taken during troubleshooting weren’t fully captured in the logs, some comments suggest that changes were made directly on the production system, which can increase risk

Lessons Learned and Action Items

  • Review and improve backup strategy: Ensure backups are regularly tested and can be reliably restored
  • Implement a staging environment: Critical updates and configuration changes should be tested in a staging environment before being deployed to production
  • Automate network configuration: Use tools like Ansible or cloud-init to manage network settings, making them less susceptible to accidental changes during updates

Conclusion

This incident underscores the importance of careful change management and robust recovery procedures. While the outage was eventually resolved, the associated downtime and potential data loss could have been mitigated with better preparation.

Footnotes