Thursday, August 15, 2013

Outlook.com & SkyDrive Outage Over

Yesterday, August 14 at about 10am many users of Outlook.com and Skydrive began to experience and report email outages as well as being unable to connect to files in their Skydrive. While outages of this nature are not completely uncommon, when people do loose connectivity to their email and files they tend to freak out.  Outlook.com, formally known as "Hotmail" for the most part serves the consumer base as opposed to professional businesses.

Our organization has been in Microsoft's cloud for almost three years now for Email. There have been a few outages during that time frame so I was not really that worried about this one.  I also use Outlook.com for my personal email and I really like it. However I must report that the "outages" we have experienced with Microsoft's 365 service have been short and only effecting specific areas of their service. For example once or twice Outlook clients would not connect to the email service, however email continued to flow to mobile devices and Web Outlook. Another time Web Outlook was down, however once again mobile devices received email and the clients worked. Each of these occurrences lasted a few hours, much less then yesterday's Outlook.com outage.

What caused yesterday's outage? Microsoft has issued the following statement:

"At 13:35 PM PDT on August 13th, 2013 there was a service interruption that affected some people's access to a small part of the SkyDrive service, but primarily Hotmail.com and Outlook.com. Availability was restored over the course of the afternoon and evening, and fully restored by 5:43 AM PDT on August14th, 2013.

On the afternoon of the 12th, in one physical region of one of our datacenters, we performed our regular process of updating the firmware on a core part of our physical plant. This is an update that had been done successfully previously, but failed in this specific instance in an unexpected way. This failure resulted in a rapid and substantial temperature spike in the datacenter. This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter.

These safeguards prevented access to mailboxes housed on these servers and also prevented any other pieces of our infrastructure to automatically failover and allow continued access. This area of the datacenter houses parts of the Hotmail.com, Outlook.com, and SkyDrive infrastructure, and so some people trying to access those services were impacted.

Once the safeguards kicked in on these systems, the team was instantly alerted and they immediately began to get to work to restore access. Based on the failure scenario, there was a mix of infrastructure software and human intervention that was needed to bring the core infrastructure back online. Requiring this kind of human intervention is not the norm for our services and added significant time to the restoration.

From that point onward, the team brought back access in waves throughout the evening. The majority of the impacted mailboxes were fully restored before midnight and the rest completed by 5:30 AM."

We should accept and expect that as we rely on the "cloud" for more and more services service interruptions will occur from time to time. What you should however require from your provider are a couple of very important considerations:

1. When outages occur, communicate immediately with clear and concise information regarding the issue.

2. Work to get the service restored as quickly as possibly.

3. And most important of all, protect our data!

2 comments:

  1. "What caused yesterday's outage? Microsoft has issued the following statement:
    At 13:35 PM PDT on March 12th, 2013 there was a service interruption that affected some people's access to a small part of the SkyDrive service, but primarily Hotmail.com and Outlook.com. Availability was restored over the course of the afternoon and evening, and fully restored by 5:43 AM PDT on March 13th, 2013."
    ...March??? Isn't this article about the August 2013 outage???

    ReplyDelete
  2. Well done. You caught an obvious typo is Microsoft's statement! For the record I corercted it here as to not confuse the readers.

    ReplyDelete