Google's Gmail self heals from shooting itself in the foot with config updates, after 25 minutes systems restored, Outage ends

There is tons of press on Gmail outage.  I was on the phone during the time the outage occurred so gmail being down didn’t bother me, but it did bother many others.

Gmail goes down briefly and everybody flips out

Atlanta Journal Constitution - ‎5 hours ago‎
If you're watching this, congratulations! You've survived the Great Google Outage of Jan. 24, 2014. At about 2:15 p.m. eastern time Friday, Gmail users across the world began seeing Temporary Error (500) error message while trying to access their email ...
 

Google's Gmail outage leaves many in the dark [San Jose Mercury News :: ]

Businessweek - ‎14 hours ago‎
Jan. 24--MOUNTAIN VIEW -- An unexplained outage affected countless users of Google's (GOOG) popular Gmail service for more than an hour Friday, while also disrupting the Google Plus social network and some of the company's other Web services, ...
 

Google's reliability team was prepping for a reddit AMA when Gmail went down

Washington Post (blog) - ‎50 minutes ago‎
While most of Twitter panicked over (and Yahoo celebrated) a Gmail outageGoogle's Site Reliability Engineering was preparing to do an "Ask Me Anything" reddit thread. Depending on how paranoid you are, that may seem either incredibly ironic or like ...
 

Google services go down as Reliability team takes questions on Reddit

Fox News - ‎10 hours ago‎
Many of Google's services hiccupped briefly on Friday, as an unexplained outage knocked offline such popular services as Gmail, Calendar, Talk, Docs, Drive and more. As of 3:23 p.m. EST, the service was back up and running smoothly, according to the ...
 

Here's what caused that massive Gmail outage

Washington Post (blog) - ‎50 minutes ago‎
The outage, Traynor continued, essentially fixed itself when the system responsible for the malfunction automatically generated the correct configuration and began propagating that throughout Google's live services. Google offered an apology for the mishap ...

Here is Blog Post from Google VP Engineering Ben Traynor.  The brief summary of the problem, and how it self repaired is here.

At 10:55 a.m. PST this morning, an internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes, caused users’ requests for their data to be ignored, and those services, in turn, generated errors. Users began seeing these errors on affected services at 11:02 a.m., and at that time our internal monitoring alerted Google’s Site Reliability Team. Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users’ service was restored.

Naive users are comparing Yahoo’s email outage to Google’s gmail.  Did Yahoo self heal?  No.  

Google knows it can win the e-mail battle with better availability.  Things happen, but if you can quickly recover and find the cause the overall site reliability should improve.

With services once again working normally, our work is now focused on (a) removing the source of failure that caused today’s outage, and (b) speeding up recovery when a problem does occur. We'll be taking the following steps in the next few days:
1. Correcting the bug in the configuration generator to prevent recurrence, and auditing all other critical configuration generation systems to ensure they do not contain a similar bug.
2. Adding additional input validation checks for configurations, so that a bad configuration generated in the future will not result in service disruption.
3. Adding additional targeted monitoring to more quickly detect and diagnose the cause of service failure.