Your production server is dead. Hard reboot, what caused the issue?
Last week, one of our production server became completely unresponsive: No SSH. No SSM. No ping. The application was down, the database was spiking, and the monitoring had been screaming for minute...

Source: DEV Community
Last week, one of our production server became completely unresponsive: No SSH. No SSM. No ping. The application was down, the database was spiking, and the monitoring had been screaming for minutes before anyone noticed. We had to force-reboot to recover. But then came the hard part: figuring out what happened on a machine where all the pre-crash state was gone. This is the story of how basic GNU/Linux tools (the kind most cloud engineers never bother learning these days since "we can check grafana") gave us the complete picture when our fancy observability stack had nothing. The scene Here's what we knew after the reboot: The EC2 instance (48 CPUs, 92GB RAM, Ubuntu 24.04) had been completely unreachable for ~18 minutes until the team take the decision of hard-reboot it Both RDS primary and replica showed a CPU spike during the same window Application logs in our centralized logging (Loki) showed nothing unusual Sentry had captured DNS resolution failures to the database The ops team