Yesterday, August 14 at about 10am many users of Outlook.com and Skydrive began to experience and report email outages as well as being unable to connect to files in their Skydrive. While outages of this nature are not completely uncommon, when people do loose connectivity to their email and files they tend to freak out.  Outlook.com, formally known as "Hotmail" for the most part serves the consumer base as opposed to professional businesses.
Our organization has been in Microsoft's cloud for almost three years now for Email. There have been a few outages during that time frame so I was not really that worried about this one.  I also use Outlook.com for my personal email and I really like it. However I must report that the "outages" we have experienced with Microsoft's 365 service have been short and only effecting specific areas of their service. For example once or twice Outlook clients would not connect to the email service, however email continued to flow to mobile devices and Web Outlook. Another time Web Outlook was down, however once again mobile devices received email and the clients worked. Each of these occurrences lasted a few hours, much less then yesterday's Outlook.com outage. 
What caused yesterday's outage? Microsoft has issued the following statement: 
"At 13:35 PM PDT on August 13th, 2013 there was a service 
interruption that affected some people's access to a small part of the SkyDrive 
service, but primarily Hotmail.com and Outlook.com. Availability was restored 
over the course of the afternoon and evening, and fully restored by 5:43 AM PDT 
on August14th, 2013.
On the afternoon of the 12th, in one physical region of one of our 
datacenters, we performed our regular process of updating the firmware on a core 
part of our physical plant. This is an update that had been done successfully 
previously, but failed in this specific instance in an unexpected way. This 
failure resulted in a rapid and substantial temperature spike in the datacenter. 
This spike was significant enough before it was mitigated that it caused our 
safeguards to come in to place for a large number of servers in this part of the 
datacenter. 
These safeguards prevented access to mailboxes housed on these servers and 
also prevented any other pieces of our infrastructure to automatically failover 
and allow continued access. This area of the datacenter houses parts of the 
Hotmail.com, Outlook.com, and SkyDrive infrastructure, and so some people trying 
to access those services were impacted.
Once the safeguards kicked in on these systems, the team was instantly 
alerted and they immediately began to get to work to restore access. Based on 
the failure scenario, there was a mix of infrastructure software and human 
intervention that was needed to bring the core infrastructure back online. 
Requiring this kind of human intervention is not the norm for our services and 
added significant time to the restoration.
From that point onward, the team brought back access in waves throughout the 
evening. The majority of the impacted mailboxes were fully restored before 
midnight and the rest completed by 5:30 AM."
We should accept and expect that as we rely on the "cloud" for more and more services service interruptions will occur from time to time. What you should however require from your provider are a couple of very important considerations:
1. When outages occur, communicate immediately with clear and concise information regarding the issue. 
2. Work to get the service restored as quickly as possibly. 
3. And most important of all, protect our data!

"What caused yesterday's outage? Microsoft has issued the following statement:
ReplyDeleteAt 13:35 PM PDT on March 12th, 2013 there was a service interruption that affected some people's access to a small part of the SkyDrive service, but primarily Hotmail.com and Outlook.com. Availability was restored over the course of the afternoon and evening, and fully restored by 5:43 AM PDT on March 13th, 2013."
...March??? Isn't this article about the August 2013 outage???
Well done. You caught an obvious typo is Microsoft's statement! For the record I corercted it here as to not confuse the readers.
ReplyDelete