As More People Use Google Services, Is Google Able To Handle The Load?
I support Google and the services they offer, including their mail service known as Gmail. But yesterday the system suffered another outage which according to their official web site was caused by an under estimate of usage. Which made me start to think. Is Google going to be able to handle the load in the future? If and when the company comes out with their Google OS and more people use their services, will Google finally show their Achilles heel and will there be more down time?
According to their blog site Google reps state the following:
Gmail’s web interface had a widespread outage earlier today, lasting about 100 minutes. We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there’s a problem with the service. Thus, right up front, I’d like to apologize to all of you — today’s outage was a Big Deal, and we’re treating it as such. We’ve already thoroughly investigated what happened, and we’re currently compiling a list of things we intend to fix or improve as a result of the investigation.
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google’s architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.
Though the blog entry goes on to state that Google is going to make sure that this doesn’t happen again, it does make wonder. If we continue to flock to Google services, will this one day bite us all in the butt?
What do you think?
Comments welcome.

5 Comments
the oracle
September 2nd, 2009
at 5:59am
Ron, I have been with Hotmail, from Microsoft, in one form or another, since 1997. It has had outages, or periods of overburdened activity (which for me, is the same thing) for 5-7 hours many times, yet it seems to get a pass.
Now I am not faulting Microsoft, or Google, it happens. That is one area however, where Microsoft seems to get absolutely no criticism whatsoever. Really odd.
That is why I have more than 1 e-mail account. Heck, the Verizon mail server, which that provider wonders why I use so little, is down monthly, almost like clock work … and they wonder?
I just got a Google Voice number, and the choices of custom numbers were fairly slim. Now you were wondering about services overburdening the system?
I fully expect problems; it is the nature of things, and also of anything touched by humans.
Ron Schenone
September 2nd, 2009
at 6:34am
Hi Marc,
Thanks for the info.
I have a MSN account and I expect it to fail, which it does about once a week. My ISP account fails on average about once a month.
IMHO Google needs to be held to a higher standard. Failure is not an option.
Hotrao
September 3rd, 2009
at 2:56am
I add a couple of points for disussion:
a) on one side I think that Google like any other business is facing the problem of optimizing costs and this reflects also on having a structure dimensioned everytime “on the hedge” with little space for errors
b) on the other side I’m little bit worried on the fact that Google engineers talk about having “we had slightly underestimated the load which some recent changes”: Google engineers are human and can make mistakes, but what if the error in estimates was not slight but big?
c)) We are relying too heavily on network infrastructure: in other times if Google went down, we would have said “I’ll try again later” , while everybody is now acting like the outage was the outage of a nuclear plant
Google outage and credibility « How I see the world
September 3rd, 2009
at 2:57am
[...] Google outage and credibility Ron Schenone at Lockergnome writes and article asking if, with an increasing number of people using Google services, Google is till able to handle the load (full article at http://www.lockergnome.com/blade/2009/09/02/as-more-people-use-google-services-is-google-able-to-han…). [...]
As More People Use Google Services, Is Google Able To Handle The Load? - Gmail Blog
September 3rd, 2009
at 9:54am
[...] This article is featured on the custom Gmail Blog at Auto-Blogs.us. [...]