by Kyle Hilgendorf | February 29, 2012 | 9 Comments
Today, Microsoft Windows Azure had an advertised outage. As of writing this blog, the outage is still in recovery mode. I spent the morning talking to a handful of Azure customers via phone, email, and Twitter. Here are some observations becoming quite evident and important learnings for cloud customers and cloud providers:
- Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics
- Service dashboards continue to rely on the underlying cloud service being online
- Customers can never get enough information during the outage from the provider
- We all know outages are a fact of life, but in the midst of one, pain is real
- Customer application design needs to continue to evolve
Let me dive into each of these points with my own commentary.
- Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics: Azure’s health dashboard and communication originally communicated that only 3.8% of customers were affected with this outage. There was no context around where the 3.8% came from or how it was measured but I spoke to several customers this morning that suspect they were not included in the 3.8%. Just recently, the percentages were increased at the dashboard. Based upon region, the latest affected customer percentages are 6.7%, 37%, and 28% (and may still change). I was informed by some customers that various Azure roles (web, worker, VM) are up and online for many of these customers but that service performance is degraded to such a point of being unusable. Because most provider SLAs are based upon uptime and availability, and not performance or response, these outages may not be reported as being affected. You can follow some of my interactions via Twitter (@kylehilgendorf) from this morning to see a couple of examples. Providers MUST start including performance and response SLAs into their standard service. A degraded service is often as impactful as a down service. A great quote came in on twitter this morning via @qthrul, “…a falling tower is ‘up’ until it is ‘down’.” A falling tower is not very useful for most customers.
- Service dashboards continue to rely on the underlying cloud service being online: The Azure Service Dashboard (http://www.windowsazure.com/en-us/support/service-dashboard/) has been experiencing very intermittent availability. Throughout this morning, I have had about a 25%-30% success rate of getting the dashboard to load. I’ve been informing providers frequently that service health systems and dashboards must be hosted independently from the provider’s cloud service. If the cloud service is down or degraded, customers had better be able to see the status at all times. I recently finished a lengthy document on evaluation criteria for public IaaS providers that will publish in the near future, and one of those criteria specifically states this as a requirement. If the service dashboard is the primary vessel by which cloud providers communicate outage updates, it must be up while the service is down.
- Customers can never get enough information during the outage from the provider: Looking back to 2011 and the AWS and Microsoft outages it became very clear that frequent status updates are paramount during an outage. AWS led the way with 30-45 min outage updates through their painful EBS outage and Ireland issues. While updates don’t solve the problem, they do demonstrate customer advocacy and concern. Some customers told me this morning they feel completely in the dark. There is no reason why a cloud provider should not have a dedicated communication team providing at least 30 min updates throughout the entire outage. Microsoft seems to be in a good cadence late this morning on more frequent updates, but there were large gaps in updates when the outage first occurred. More important in my opinion however, is a thorough post-mortem on the outage once the service has been restored. This should come within 3-4 days of the outage and must be very open and honest about the root cause, the fix, and the take-aways for the future. Providers please note, the world is very smart. If a provider even tries to mask or hide any of the details, it will come back to reflect negatively. Honesty wins.
- We all know outages are inevitabilities, but in the midst of one, pain is real: I’ve heard from some customers very impacted and as a result very frustrated and disappointed. When a cloud service has a good track record, we all admit that an outage will happen at some point. Yet, in the middle of an outage, emotion gets involved. Therefore, see point #5.
- Customer application design needs to continue to evolve: Similar to previous cloud outages, customer application design must continue to evolve to account for possible (some would say probable) cloud outages and issues. No cloud services is identical to anotherand each has its own unique design and configuration options. Most cloud services have the concept of zones and regions from a geographical or hosting location standpoint. In most cloud outages, not every zone or region is affected. Therefore, the best-prepared applications will be those designed cross-zone and cross-region to avoid an outage or degradation in any one area. However, this comes at extreme complexity and increase in cost. Many times 3x-10x the cost advertised by providers. If you will be running a critical application at a cloud provider, expect an outage, design for resiliency, and be prepared to pay for it. This may also mean that you have to hire or retain some very skilled cloud staff.
It is always a sad day as a cloud analyst to see these outages. However, it seems that significant change in the industry, at both a provider and customer level, only tends to come after an emergency.
I’d love your comments here. Let’s engage in a conversation.
Category: Cloud Microsoft Outage Providers Tags: Azure, Cloud, Microsoft, Outage
by Kyle Hilgendorf | February 3, 2012 | 9 Comments
I’m pretty vocal when it comes to challenging Cloud Service Providers (CSPs) regarding increasing the amount of public transparency they share with not only customers but with prospects. On a very regular basis, I take calls from Gartner clients about the challenges in evaluating CSPs and the frustration with the lack of published information that exists at most providers.
I’ve seen some CSPs make some very good strides lately in terms of improving websites and publishing architectural and security related information. One particular aspect where the industry has seen very little improvement is transparency with audits.
A common discussion for me at Gartner has centered on SAS 70 Type II audits, and now SSAE 16 / SOC 1 reports. The latter has replaced SAS 70 and having an SSAE 16 audit and SOC 1 report completed by an independent third party is table stakes for competing in the public cloud services market. There are many problems with the SSAE 16 audit, namely that CSPs still get to designate which control objectives an auditing agency verifies. If a CSP does a poor job at logical access security, they could choose not to have the third party audit them against that control. It seems unfair and a loophole. As such, customers actually do need to see the SOC 1 report and must sign a confidentiality agreement with each provider to do so. That does not scale well.
But why a confidentiality agreement? Why don’t CSPs simply publish their SOC 1 report online? I’ve spent the last month talking to a number of CSPs about this. I get the token response that it would divulge sensitive security configurations that if published would put the cloud service in jeopardy of being attacked/exposed. My response to that is, “Ok, but let’s get creative.” I have not been able to understand why a CSP cannot publish a summary report listing each of the controls that were audited and the relative findings for each objective. There is a stark difference in mentioning that a third party confirmed security surveillance cameras are in place versus actually listing each physical location of all individual cameras.
Well after having several in depth conversations with many providers, I believe our cross hairs need not focus on the CSPs as much as the auditing agencies. More than a few of the CSPs have apparently gone to their auditing agency and requested the right to publish the SOC 1 report publicly. All providers that have done this were denied that ability. The auditing agency holds the copyright to the report and the legal agreements of the audit restrict the CSP from publishing without auditor consent.
A few providers claim they have gone further and have asked the auditor if they can takes portions of the report and publish as an executive summary or FAQ to highlight for customers the controls and summarized results. Again, those providers were not able to obtain the rights to do so.
What are these auditing agencies / large consulting companies needing to hide? If they truly are independent, third parties, why can’t they stand behind their report publicly? If not the entire report, why not a summary of findings?
Providers are not 100% absolved of any responsibility here either. Even if the auditing agency refuses to release any information from the report, the provider should still publicly list the controls that the provider asked the auditor to look after. That would be a big step for many providers and would at least start to level-set the playing field for customer evaluations. Furthermore, the best CSPs will put more emphasis on obtaining ISO 27001 certification, which does provide a base standard for controls.
I would love to hear from you on this. Are you a customer that is tired of signing agreements simply to confirm controls? Are you a provider that wants to publish more information but are restricted by auditors? Are you an auditor that would like to have a deeper discussion? Please contact me.
Category: Cloud Evaluation Providers Tags: Audit, Cloud, Evaluation, SSAE 16
by Kyle Hilgendorf | December 16, 2011 | 2 Comments
I’ve focused a significant amount of effort in 2011 in assisting our clients through assessments of various cloud providers, namely at the IaaS level. The topic has been so popular in fact, that I presented an “Evaluating Cloud Providers” session at our Gartner Catalyst 2011 conference as well as a free Gartner webinar (which is available on replay).
We have several pieces of research in the works that we are excited will further assist customers in evaluating cloud providers in early 2012.
However, I would be remiss if I did not call attention to the fact that a very encouraging announcement was recently made by the Cloud Security Alliance. I’ve personally been an advocate for the CSA and the effort they’ve put into improving security standards within cloud computing. The recent announcement is in regards to a public cloud provider registry named STAR. The intent of STAR is to provide a publicly accessible registry where cloud providers publish the security controls that they offer in their service.
Most cloud providers in my recent experience have become quite good and open in sharing their security controls with prospective clients, but it is very time consuming for clients to hop from provider to provider, ask to see these controls, and document the controls for comparison. Furthermore, many of the providers still require a signed NDA with the client to share the controls.
My hope with STAR is that most providers opt in, as this is exactly the type of registry and knowledge sharing location that customers want. However, there is one potential risk. The CSA is a member-driven organization, and many of the public cloud providers are key members. There is a risk that the members will tune the security criteria over time to best match their capabilities. Yet I have faith that the consensus opinion of many providers (i.e. competitors) will triumph over collusion and we as Gartner will keep a close eye on this. It is a positive sign that the CSA does not require a cloud provider to be a CSA member in order to be listed in STAR. As a result, there really is no excuse for a cloud provider to not opt in. If you are a significant customer at a major cloud provider and you also believe in this, encourage your provider to participate.
This entire entry is my own personal opinion, not an official position from Gartner.
Category: Cloud CSA Evaluation Providers Tags: Cloud, CSA, Evaluation, Providers
by Kyle Hilgendorf | August 29, 2011 | 3 Comments
VMworld kicked off this week with a flurry of announcements and improvements and I wanted to highlight two of the important ones for my coverage area.
Global Connect – this new announcement is intriguing for enterprise customers. The goal is for vCloud Datacenter partners to develop global relationships with one another to provide customers true global coverage for vCloud hosting while only having to maintain a relationship with a vCloud Datacenter partner in their region. The announcement kicks off with Bluelock (US), Softbank (Japan), and Singtel (Singapore).
While this will take some time for the technology to be put in place and work effectively, my concerns are not around technology. The non-technical side of this announcement will prove to be extremely challenging. For each of the providers to establish legal agreements with one another in such a way that Customer X can deploy a workload to Bluelock and then later move it or copy it to Singtel or Softbank (or vice versa) will surely prove to be challenging. It is hard enough for Customer X to establish satisfactory terms and conditions with any single provider today. For a customer to be able to establish those terms and then for the provider to take those terms and pass on to other providers is daunting to consider.
Surely these providers will be motivated, but tracking the agreements will be an area to watch closely. Furthermore, customers will now have additional context to consider when negotiating agreements. Pay close to attention for language around the mobility of workloads from provider to provider.
If VMware can successfully broker their providers through this Global Connect initiative, it squarely places vCloud Datacenter providers (or the ecosystem) in line to compete with Amazon Web Services from a global availability perspective, an area until this announcement where vCloud was getting beat badly.
vCloud Connector 1.5 - Earlier this year I published a research document named, Moving Applications to the Cloud: Finding Your Right Path. The document provides guidance around the process to move virtual machines from internal data centers to public cloud providers. In the document, I highlighted several current concerns with vCloud Director and vCloud Connector 1.0. At the document publication time there were issues such as no network transmission intelligence, restart protocol, and the fact that vCloud Connector had to temporarily make a copy of the OVF/VMDK and store it in a temporary holding area. All of these factors contributed to really poor VM mobility performance.
It is great to see that version 1.5 addresses some of these issues. vCloud Connector no longer requires a copy of the OVF/VMDK to a temporary location. Furthermore, a restartable protocol has now been introduced. Now a single network hiccup will not completely interrupt the transmission and require you to start over. Both advancements are welcome.
However, as noted in the document, the biggest problem in V2C Mobility is inconsistent/slow internet upload speeds from customers to vCloud Datacenter providers. I would still like to see network intelligence built into vCloud Director and Connector (e.g. compression, acceleration, de-dup). Adding these enhancements will really start to change the game for VM mobility performance and move the industry from a nice concept to reality. VMware’s stiff competitor, Citrix, is already aggressively innovating in this space with their Netscaler CloudBridge announcement earlier in the year. Perhaps VMware can build, acquire, or partner with someone to bring similar capabilities to vCloud.
I hope to see many of you at VMworld this week. It is always a great week and wonderful chance to learn, network, and gain perspective.
Category: Cloud Hybrid IaaS Mobility Providers vCloud VMware Tags: Availability, Mobility, V2C, vCloud, VMware
by Kyle Hilgendorf | August 23, 2011 | 1 Comment
I just finished two webinars last week on Evaluating Cloud Providers. Attendance was really fantastic and I hope our research has helped a variety of companies in their journey of evaluating cloud providers. Below are my thoughts from the sessions. [Read more →]
Category: Cloud Evaluation Hosting IaaS Providers Tags: Assessment, Cloud, Evaluation, Providers, Transparency, Vendor Management
by Kyle Hilgendorf | August 12, 2011 | 2 Comments
In my recently published Gartner IT1 guidance document, “Moving Applications to the Cloud: Finding Your Right Path”, I highlighted a major contributing factor to the failure of VM to cloud mobility being the Internet bottleneck between enterprise data centers and public cloud providers. In fact, for many large enterprise organizations, achieving uploads of 2-3 GB/hour across the Internet is a best-case scenario today.
I wanted to highlight two recent announcements in the industry that will hopefully help solve this bottleneck soon.
First, Citrix announced NetScaler Cloud Bridge at the Citrix Synergy conference in May. As the name aptly applies, the product aims to bridge connections between corporate data centers and cloud providers. It is a layer 2 network bridge that includes end-to-end security (IPSEC) and network performance optimization. Cloud Bridge accomplishes the network optimization through compression, packet optimizations, and de-duplication. Performance improvement data is not yet available, but I plan to follow up on that in the near future.
More recently, Amazon Web Services announced AWS Direct Connect which allows organizations to establish a dedicated circuit between corporate infrastructure/applications and AWS at either 1Gbps or 10Gbps connections. AWS already has a an intriguing Virtual to Cloud (V2C) tool (AWS VM Import) that works well for converting Windows Server 2008 VMware-based images to AWS AMIs, but adoption is slow due to Internet bandwidth bottlenecks. It simply hasn’t been practical to convert and move large VMs yet. By establishing a dedicated circuit to AWS, the bottleneck is largely erased for customers that are serious about moving large VMs or large amounts of data to AWS quickly.
Personally, I think both solutions will be very successful and have different use cases. Cloud Bridge is more versatile and will improve and secure connections to several cloud providers. AWS Direct Connect is for those customers highly dedicated to using AWS for serious data transfers. Both solutions will need some more time to bake however. Netscaler Cloud Bridge will need time to proliferate into all the major cloud providers as it will require a presence at each cloud provider and AWS Direct Connect is currently only available as a cross-connect from the Ashburn, VA Equinix facility.
We are approaching a “meet me in the middle” solution where most enterprises need a lot of on-premise hosting, some traditional managed hosting and setup (like dedicated circuits, design assistance, concierge-level support), and also some cloud (for workloads that need scalability or can handle elasticity). The hybrid cloud picture is coming together. Both these announcements are positive improvements in supporting this concept.
I’m excited to see the adoption rate and early feedback on both solutions and am wondering what other players might get serious in this market soon. How about you?
Category: AWS Citrix Cloud Dedicated Hosting Hybrid IaaS Mobility Tags: AWS, Citrix, Cloud, Hosting, Hybrid
by Kyle Hilgendorf | July 21, 2011 | Comments Off
Chris Wolf and I will be participating in the first ever Catalyst Twitter Chat at the Gartner Catalyst conference next week in San Diego. I am thrilled to be part of this event and the topic is extremely popular right now. For those that know Chris he has some of the best technical depth of anyone around in virtualization. I will be bringing my background on cloud service providers and capabilities to conduct virtual to cloud migrations. In fact, my first Gartner IT1 research paper centers around this very topic. IT1 clients can access “Moving Applications to the Cloud: Finding Your Right Path” now.
If you will be at Catalyst next week, please stop by and participate face to face with us. If you are not at Catalyst or prefer to interact virtually, please engage with us via Twitter and submit questions per the guidelines below. We’re looking forward to a great week!
Analyst Chat invitation
@Gartner_inc will be hosting an “Analyst Chat” via Twitter during Gartner’s Catalyst Conference.
When: Wednesday, July 27th at 5:30pm PT.
Where: Aqua 302
Details: Join Chris Wolf (@cswolf) and Kyle Hilgendorf (@kylehilgendorf) to discuss VM and Cloud Mobility – What You Should Worry About. This discussion will cover:
• Important considerations: Converting the VM is the easy part!
• Architectural and product pitfalls that directly impact mobility options
• What options do current cloud providers and server virtualization vendors offer?
• What independent software options exist?
• How will this market mature and how long will it take?
Migrating VMs to and from the public cloud is not for the faint of heart. Yet organizations have compelling arguments for VM to cloud (IaaS) mobility. A market is emerging around cloud brokers and orchestrators, but it is very immature. What are the key aspects when considering moving VMs to or from IaaS cloud providers, or even between your own private data centers? Why isn’t it as simple as converting the VM file format?
To participate and track the conversation, Twitter users should key into #CAT11. We would encourage you to share comments, insights, and questions to help further discuss VM and cloud mobility. This chat will be open to all, look forward to seeing you on Twitter.
• Use #CAT11 to tap into the discussion.
• New question/topic every 10 minutes
• DM questions to @Gartner_inc without hashtag
Category: Catalyst Cloud Gartner IaaS Mobility Virtualization Tags: Cloud, Mobility, Virtualization, VM
by Kyle Hilgendorf | May 31, 2011 | 4 Comments
Citrix’s Project Olympus announcement last week at Synergy regarding commercial support for OpenStack officially signaled the beginning of an intriguing battle for the attention of enterprise cloud computing environments.
VMware is the industry behemoth in the data center with the vast market share of server virtualization business. Their vCloud initiative over the past 2 years has been trying to retain enterprise clients by offering a private, hybrid, and public cloud thread that binds things together. Very few options existed for enterprise clients unless those clients wanted to venture away from VMware, try something on their own, or bleed the edge with innovative cloud service providers.
The OpenStack announcement, driven by Citrix, RackSpace, and Dell is the counter argument to vCloud, and it has support from many companies. But which route will enterprises take?
I hear from many clients the growing concerns about staying 100% all-in with VMware and about the cost of VMware. Gartner advises clients to keep options open, yet at the same time, VMware has a proven track record, credibility, and the simplest migration story thus far. If we can be sure of anything, VMware-VMware will work well.
OpenStack looks to be the open source alternative. If OpenStack makes it easy to migrate and interoperate between vSphere in the data center and OpenStack, then I believe “game on”. I have no doubt that will be a major focus and those involved have a vested interest to make it work well. Further evidence is that OpenStack supports vSphere as a back-end hypervisor and Citrix even wrote the integrations.
But, are too many companies involved in OpenStack? At what point does “groupthink” cripple innovation? Today everyone is excited, developing, partnering, and innovating. When will greedy hands start to mess things up? When will OpenStack “forks” develop and will we find ourselves in interoperability hell? Individual companies have a hard enough time providing backwards and forwards compatibility with their own products. How will a conglomerate of companies fare any better? Count me optimistically skeptical. I am hopeful, but very cautious. Will we see variants of OpenStack all over the place, similar to the state of Linux today? Will Citrix be the thread that holds it all together?
The world appears to be out to get VMware right now. That’s business. Kings of the hill are always the target, just ask Microsoft. Is OpenStack the legitimate threat? I am excited to watch this battle take place and am anxious to see how it unfolds. How do you think it will transpire?
Category: Citrix Cloud IaaS OpenStack Private Cloud vCloud VMware Tags: Citrix, Cloud, Iaas, Olympus, OpenStack, vCloud, VMware
by Kyle Hilgendorf | April 27, 2011 | 1 Comment
I have now been employed at Gartner for three months. While that is a very short time of employment, I have come to appreciate the knowledge and perspective I gain daily through peer discussions, vendor briefings, and client inquiries. I am beyond impressed with the talent level within Gartner and the challenging and thought-provoking work environment we create for ourselves.
Recent acquisitions of Terremark by Verizon Business and Savvis by Century Link further demonstrate why working for Gartner is great. At the Gartner Catalyst North America 2010 conference last July in San Diego, I attended as a client. During the conference I heard my now close colleague and mentor, Drue Reeves say on stage that the “cloud IaaS market was the telco’s market to lose”. He had a lot of rationale and evidence for this position but I did not think about it too much at the time.
Fast forward 6-9 months later and we are seeing exactly Drue’s position come to light. Verizon and Century Link have made it clear that telco’s are attacking this market and both have gone the acquisition route. I realize now that Gartner analysts like Drue are in a unique position. We are able to sit objectively between vendors and clients and see the the big picture. I obviously agree with Drue’s position now as much of my initial research findings indicate that major technical problems with enterprises moving to IaaS deal with network performance and latency issues (not to mention non-technical legal, transparency, and availability concerns). Telco’s are in ideal positions with their global network to solve these network issues faster than more traditional hosters or software companies. We may not see all telco’s dive in, but it is their market to win or lose.
While three months of employment is still in the honeymoon employment phase, it is long enough for me to see the value that Gartner offers as a business and the enviable position that us anlaysts get to reside. Analysts offer opinions, but a lot of research and fact backs those opinions. I already have a unique and objective position in the market that I did not have 6 months ago and that perspective is only going to grow more clear in the years ahead.
Category: Cloud Gartner Hosting IaaS Tags: Cloud, Gartner, Iaas
by Kyle Hilgendorf | April 22, 2011 | 2 Comments
The AWS outage was sad for me to see. As a research analyst that covers cloud computing, I have a vested interested in the success and viability of the cloud.
Amazon is a major bellwether for the cloud. They’ve set direction, driven innovation, and challenged status quo. Cloud providers everywhere either try to emulate Amazon or look to shoot holes in Amazon’s boat. Many in the industry, analysts included, give Amazon little slack and a tough time. This happens to leaders and those on top. It’s a fact of life and I am sure there are many out there taking joy in this outage. Do not count me among those people.
But I do want to find some positives in this outage. A common sign in gyms and workout rooms around the world says, “Pain is weakness leaving the body.” With the AWS outage, we might just find ourselves saying, “Pain is weakness leaving the cloud.”
The outage has been painful for customers, Amazon, and other cloud providers. The cloud concept, technology, and movement took a big hit. However, I believe this pain will lessen some cloud weaknesses. Let me highlight just three examples.
1. Transparency – if customer due diligence and transparency demands were real before, they’ll be even more elevated now. I don’t know of a single enterprise customer that won’t perform more due diligence and demand more in terms of transparency as a result of this outage. Cloud providers like Amazon will have no choice but to better respond than they did before. Brand is at stake. Outages happen as technology does fail. The cloud is no exception. But for customers to architect around potential outages, they have to better understand architecture and limitations at the provider. The good news? Amazon has been pretty transparent with this outage on their health dashboard. Let’s hope that continues with both Amazon and other providers. It might have been a wakeup call.
2. Liability – many legal experts have said that it is going to take a major cloud outage to bring lawsuits, rulings, and ultimately precedents on cloud liability and responsibility. Is this the outage to do that? It’s too early to tell, but I’d be surprised if zero lawsuits come out of this. Customers and providers need better understanding on how courts will rule. The waterfall effect on those rulings could be enormous and lead to more or less cloud adoption. Simply clearing legal uncertainty, in itself, is a big step.
3. Hybrid clouds – with very few exceptions, the industry is nascent and undefined in terms of architecting solutions across clouds or between the cloud and the enterprise. Many enterprises already understood the need or were asking for innovative solutions to help with this. As a result of this outage, there will now be a heightened awareness on service level, application, and data protection against any one cloud or provider. “How can I architect my solution to be resilient against any one cloud provider?” ISVs, orchestration software companies, and venture capitalists will siege this opportunity and more development will be aimed at these solutions. I am sure it is already occurring in strategic planning sessions among many software companies today. “How can we help customers avoid an outage like this?” The net result will be improved solutions for customers.
We should never like to see an outage like this happen, but every once in a while an alarm is necessary to set the ship back on course. The cloud is not invincible and was never sold to be. In many ways, shame on any one that was starting to believe that. Customers and providers alike have no choice but to make the best out of this situation, and use it to improve the overall industry.
Category: AWS Cloud IaaS Tags: AWS, Cloud, Liability, Outage, Transparency