Yesterday, August 14 at about 10am many users of Outlook.com and Skydrive began to experience and report email outages as well as being unable to connect to files in their Skydrive. While outages of this nature are not completely uncommon, when people do loose connectivity to their email and files they tend to freak out. Outlook.com, formally known as "Hotmail" for the most part serves the consumer base as opposed to professional businesses.
Our organization has been in Microsoft's cloud for almost three years now for Email. There have been a few outages during that time frame so I was not really that worried about this one. I also use Outlook.com for my personal email and I really like it. However I must report that the "outages" we have experienced with Microsoft's 365 service have been short and only effecting specific areas of their service. For example once or twice Outlook clients would not connect to the email service, however email continued to flow to mobile devices and Web Outlook. Another time Web Outlook was down, however once again mobile devices received email and the clients worked. Each of these occurrences lasted a few hours, much less then yesterday's Outlook.com outage.
What caused yesterday's outage? Microsoft has issued the following statement:
"At 13:35 PM PDT on August 13th, 2013 there was a service
interruption that affected some people's access to a small part of the SkyDrive
service, but primarily Hotmail.com and Outlook.com. Availability was restored
over the course of the afternoon and evening, and fully restored by 5:43 AM PDT
on August14th, 2013.
On the afternoon of the 12th, in one physical region of one of our
datacenters, we performed our regular process of updating the firmware on a core
part of our physical plant. This is an update that had been done successfully
previously, but failed in this specific instance in an unexpected way. This
failure resulted in a rapid and substantial temperature spike in the datacenter.
This spike was significant enough before it was mitigated that it caused our
safeguards to come in to place for a large number of servers in this part of the
datacenter.
These safeguards prevented access to mailboxes housed on these servers and
also prevented any other pieces of our infrastructure to automatically failover
and allow continued access. This area of the datacenter houses parts of the
Hotmail.com, Outlook.com, and SkyDrive infrastructure, and so some people trying
to access those services were impacted.
Once the safeguards kicked in on these systems, the team was instantly
alerted and they immediately began to get to work to restore access. Based on
the failure scenario, there was a mix of infrastructure software and human
intervention that was needed to bring the core infrastructure back online.
Requiring this kind of human intervention is not the norm for our services and
added significant time to the restoration.
From that point onward, the team brought back access in waves throughout the
evening. The majority of the impacted mailboxes were fully restored before
midnight and the rest completed by 5:30 AM."
We should accept and expect that as we rely on the "cloud" for more and more services service interruptions will occur from time to time. What you should however require from your provider are a couple of very important considerations:
1. When outages occur, communicate immediately with clear and concise information regarding the issue.
2. Work to get the service restored as quickly as possibly.
3. And most important of all, protect our data!
"What caused yesterday's outage? Microsoft has issued the following statement:
ReplyDeleteAt 13:35 PM PDT on March 12th, 2013 there was a service interruption that affected some people's access to a small part of the SkyDrive service, but primarily Hotmail.com and Outlook.com. Availability was restored over the course of the afternoon and evening, and fully restored by 5:43 AM PDT on March 13th, 2013."
...March??? Isn't this article about the August 2013 outage???
Well done. You caught an obvious typo is Microsoft's statement! For the record I corercted it here as to not confuse the readers.
ReplyDelete