8 Hours of Internet Downtime: Lessons from the Google Cloud Blackout
On Thursday morning, June 12, 2025, the world was as peaceful as ever. People went about their daily lives, sending and receiving emails, doing work, and attending video conferences. All the services we use without thinking are powered by a giant heart called the Google Cloud Platform (GCP). No one could have imagined that this heart would suddenly stop. A small crack can cause chaos The seeds of all chaos began with the Service Control system, the invisible gatekeeper of Google Cloud, which strictly managed hundreds of millions of API requests coming in from all over the world. In late May, new code was quietly added to provide more sophisticated policy management, but it contained a fatal flaw: the infamous null pointer error. At 10:45 a.m. Pacific Time on June 12, 2025, an engineer entered a single policy change command. The command contained an unintended empty data field, and the data was quickly distributed to Google’s global data centers. Seconds later, the service control systems in all data centers around the world simultaneously encountered this error and crashed. The moment the world stopped At 10:51 a.m., the monitors of developers around the world began to show a “503 Service Unavailable.” At first, everyone thought it was a problem with their own code, but they soon realized the severity of the situation. The problem was much bigger. The Internet itself was down. Many services based on Google Cloud, such as Cloudflare, Supabase, and Sentry, were simultaneously shut down. Companies, schools, and hospitals that worked through these services were paralyzed all over the world. When GCP, the center of the digital world, stopped, the Internet world collapsed one after another as if it was falling into darkness. A desperate fight for recovery The SRE (Site Reliability Engineering) team at Google headquarters immediately went into emergency mode. The problem was identified within two minutes of the failure, and emergency response began immediately. All engineers worked desperately to achieve one goal: to press the 'red button' to restore the system. But even the recovery process was not smooth. In particular, the recovery command was processed all at once in the largest data center, 'us-central-1', and the 'herd effect' occurred, which caused the failure to worsen due to traffic overload. The situation got worse. Engineers tried to slowly recover by diverting traffic to other areas and carefully controlling the load on the system. Tension was at its peak, and an hour felt like a day. 8 hours of silence, and the promise of dawn Finally, after a long struggle of 8 hours, the last data center was restored. Internet services were restored one by one, and people were able to gradually return to their daily lives. Google immediately issued an official apology and a detailed report on the failure. The report included how a small code error brought the entire world to a halt, and what measures would be taken to avoid making the same mistake again in the future. Google has pledged to further refine its systems, to thoroughly test even small changes, and to build a structure that will not collapse in the future due to small errors. A warning to modern society This incident clearly showed how unstable modern society is based on an invisible technological foundation. The Internet infrastructure that we take for granted was by no means perfect. Rather, it contained vulnerabilities that could easily be shaken by even the smallest error. The eight-hour internet outage on June 12, 2025 was a wake-up call for all of us. We must now realize that the digital civilization we enjoy is not built on solid ground, but on uncertainty that can shake at any moment. Remembering those eight hours, we need to think deeply about the stability and reliability of the system and the responsibility for technology once again. This incident was bitter, but it was a historical event that taught us a lesson we must learn.
- Haebom
https://haebom.dev/4z7pvx2k9pwd12ek8653