Kyle Hilgendorf

A member of the Gartner Blog Network

Kyle Hilgendorf
Research Director
3 years with Gartner
13 years in IT industry

Kyle Hilgendorf works as a Research Director in Gartner for Technology Professionals (GTP). He covers public cloud computing and hybrid cloud computing. Areas of focus include cloud computing technology, providers, IaaS, SaaS, managed hosting, and colocation. He brings 10 years of enterprise IT operations and architecture experience. Read Full Bio

AWS moves from ECU to vCPU

by Kyle Hilgendorf  |  April 16, 2014  |  10 Comments

In a quiet move yesterday, Amazon Web Services apparently abandoned their Elastic Compute Unit (ECU) approach to describing and selling EC2 instance types and towards a more traditional vCPU approach. This was done without any formal announcement and I wonder what effect it will have (positively or negatively) on customers.

For existing AWS customers that have grown accustomed to ECU over the course of the past years, this could be a somewhat disruptive change, especially for those at larger scale that have invested a lot of time and money optimizing instance size and horizontal scalability based on their own performance testing and analysis of which kind and how many EC2 instances they need for their use case. Initially, this may not matter much for existing deployments, but it will have an impact on scaling out or for new use cases. Bottom line – these types of customers are pretty savvy and will find ways to adjust.

For new or prospective AWS customers, the ECU was always a gnarly concept to grasp and it took time. More traditional deployments, like those based upon VMware were always declared with vCPU. Bottom line – more traditional IT Ops admins and new AWS customers will likely welcome this move as a move toward familiarity and simplicity.

However, for all customers, there is one aspect of this that could be problematic. AWS is a massive scale cloud provider, with a wide variety mix of servers and processor architectures in existence. Therefore, two instances each with 2 vCPU will not necessarily be equivalent. One instance could reside on top of a 2012-based processor while the other could reside on top of a 2014-based processor. Many people have written about the fact that EC2 processor architecture varies across instance types and across regions, even those described as having the “same specs”. Therefore, some savvy organizations have moved to a “deploy and ditch” strategy whereby they deploy many instances, interrogate them all for processor architecture and then ditch all the ones that are not up to current or fastest specs.

This will further escalate an important transparency event for AWS. AWS will need to clarify the physical processor architecture strategy per instance type or instance family. As a customer, I will want to know which instance types are based on Sandy Bridge processor architectures for example – because that tells me what a vCPU will equate to. I will want to know the processor strategy similarities/differences between an m2 and an m3 or between an m3.medium and m3.large. And if there are no differences – I will want to know that also and have something in writing stating as such.  Customers wanted this before with ECU, but ECU gave AWS a way to deflect these customer questions.

ECU was a foreign concept to grasp initially, but it did provide one benefit – a standard of measure. Now that AWS has moved to a vCPU strategy will customers applaud this or complain? I’d love to hear your thoughts in the comments below.


Category: AWS Cloud     Tags: , ,

Microsoft joins the Open Compute Project (OCP) – Cloud Transparency

by Kyle Hilgendorf  |  January 28, 2014  |  1 Comment

Today Microsoft announced that it is joining the Open Compute Project by contributing what Microsoft calls the “Microsoft cloud server specification”.  To date, almost all of the major public cloud services have failed to expose the inner workings and configurations of the infrastructure that powers the public cloud service.  At Gartner, we  often advocate (on behalf of our clients) the importance of exposing underlying infrastructure configuration to cloud customers.  In fact, in our research, “Evaluation Criteria for Public Cloud IaaS Providers“, we have a specific set of requirements, stipulating published infrastructure transparency by IaaS providers.

In an ever increasing demand for hybrid cloud architectures, customers really do need some level of insight into the underlying infrastructure configuration, especially in IaaS, in order to assess the risk and compatibility of using the environment.  Furthermore, understanding the relevant details of the configurations impacts migration, compliance, licensing, configuration and performance.

Obviously, providers can go too far and expose too much information to customers which could lead to targeted security attacks.  Gartner is not advocating sharing information such as the location and number of surveillance cameras, the number of trained people on site at any one time, or the security policies configured for IDS/IPS systems.  But what Microsoft is doing today, I believe is striking that right balance.

I believe it also continues to confirm that Microsoft not only is serious about playing in the cloud provider market, but that they are also listening to enterprise requirements and taking obstacles out of the equation.  Customers will now be able to clearly understand the makeup of servers and configurations within Windows Azure, be able to discern local levels of redundancy and availability and make intelligent decisions about when you use Fault Domains or larger protections on availability such as deployments into multiple locations, geographies or additional providers.  At the end of the day, customers want as much information as they can get to make the most informed decisions.

Furthermore, what the Microsoft blog entry does not highlight is the long term benefit that sharing these details offers to large customers and partners to build hybrid clouds and reap its benefits.  As the Microsoft and OCP initiative moves forward, there is no reason why large customers and partners cannot start to deploy the same Azure-like infrastructure internally and ensure compatibility in a hybrid cloud architecture as workloads migrate to the public cloud or back to the internal, private cloud.

I’ll be closely watching this evolution in Microsoft’s strategy and paying attention to how enterprise customers react.  I will also be very curious what (if any) impact this has on AWS.  Microsoft has often emulated moves AWS has made (especially with price cuts) and it will be fascinating to see if AWS responds to this by increasing the transparency of their environment to customers.

What will you be watching for?


1 Comment »

Category: Cloud Microsoft     Tags: ,

We’re Hiring Cloud Experts – Why Work for Gartner?

by Kyle Hilgendorf  |  January 23, 2014  |  1 Comment

In just two short weeks I will hit my third anniversary with Gartner.  I am often asked by peers, clients, vendors, colleagues and friends what it’s like working as an analyst at Gartner.  But more importantly, we have two new open positions right now and I wanted to entice those of you that are interested to look at the positions, read this blog and if it sounds like a fit – apply.

Virtualization and Private Cloud Analyst

Public and Hybrid Cloud Analyst

As I reflect on my three years at Gartner, here are the things I love most about working here.

  1. The Gartner research community is an incredible thought-leader warehouse that has a small family feel.  On a daily basis I am blown away by the amazing depth of thought and analysis that comes out of the collective Gartner research division.  You would think that among this many intellectual individuals, most of whom have been high performers in previous jobs, would come with an insane amount of ego and competitve natures.  But by in large, the majority of the time, I find quite the opposite.  The team atmosphere and dedication to uncovering the right analysis drives a true value for one another and partnership for the greater goal rather an individual achievements.
  2. Being a Gartner analyst is a truly unique industry position.  Over the course of each of my years, I have had the opportunity to speak intimately with several hundred end user organizations.  These engagements happen daily on phone inquiries and regularly in conference one-on-ones or on-site and face to face visits.  I had a really good perspective prior to Gartner about what the company I worked for needed and wanted.  But now I have a great perspective on what many organizations collectively want and need for business solutions.  This knowledge allows me to think and analyze what the industry actually needs and make recommendations to vendors and providers to help move the industry forward.  I have truly come to cherish my access to each of the Gartner clients that interact with me and the trust they put in me to help advise them strategically and tactically.  Finally, Gartner is objective.  We do not accept any vendor money to sponsor research.  Not once have I ever been influenced or forced to write anything other than what my own research has uncovered.  Gartner takes this very seriously and it is what makes Gartner the best analyst firm in existence.
  3. Gartner offers a great work-life balance.  The majority of Gartner analysts get to work from home.  Working from home is not for everyone and it could get lonely, but personally I love the flexibility it offers for me and my family and the quiet atmosphere that a corporate environment with cubicle farms can never offer.  Furthermore, even though most of us work from home, I very much feel part of a team environment.  We leverage phone and video conference technologies frequently and engage in conversations to keep the interaction alive.  In many ways I feel as much a part of a team working from home as I ever felt working in a cubicle farm.
  4. Gartner is committed to the needs of our clients.  It’s a statement all companies make.  But I have found Gartner to really mean it.  I think a big reason for this is because each of us talk directly with our clients every single day.  It’s easy to remember your focus when we interact with that focus (our clients) routinely.  But Gartner as a company keeps investing in what our clients want also.  A great example are these two open positions.  Cloud Computing continues to be a fast growing and in high demand coverage area for our clients.  Therefore, we are hiring more experts.  These experts will get to work alongside already great cloud colleagues such as Lydia Leong, Alessandro Perilli, Chris Gaun, Gonzalo Ruiz, Drue Reeves, Douglas Toombs….and many, many others.
  5. Gartner stratifies its research.  Gartner has always been known as the best for CIO and Senior IT Leadership research.  But Gartner has also broadened and invested in other areas of resarch.  I work in the Gartner for Technical Professionals research division, an area of research aimed at senior level technology professionals (e.g., enterprise architects and engineers).  This research division completes a holistic research offering that other analyst firms simply cannot offer.  It also allows us internally to collaborate among analysts that specialize in all levels of an IT organization to deliver timely, accurate and tailored research to each individual in an IT organization and specific to their current role.

A while back, my colleague, Lydia Leong wrote two separate blog entries about working for Gartner that I will link here. I encourage you to also read her insights.

Do you love research, analysis and opportunities to expand your insight into IT and the industry as a whole?  Do you have a specific expertise in private, hybrid or public cloud right now?  If so, click the links at the top, apply, and hopefully join our great team!   I look forward to meeting you.  If you would like to engage in a private conversation first, please email me at kyle <dot> hilgendorf <at>

1 Comment »

Category: Cloud Gartner     Tags: , ,

Cloud Exit Strategies – You DO need them!

by Kyle Hilgendorf  |  September 18, 2013  |  9 Comments

My colleague, Jay Heiser, also has a good take on this in his blog.  I will not repeat his thoughts.

Multiple media outlets have been reporting that Nirvanix, a popular public cloud storage provider is closing down and giving customers only two weeks (now reports are October 15 instead of September 30) to get their data off the service.  Further providing evidence to this fact, Gartner has been receiving client inquiry requests in the last 24 hours from Nirvanix customers asking for immediate planning assistance in moving off the Nirvanix service.


What are clients do to?  For most – react…and react in panic.  You have 2 weeks.  Go!  You don’t have time to worry about how much data you have stored there.  You don’t have time to upgrade network connections or bandwidth.  You don’t have time to order large drives or arrays to ship to the provider to get your data back.  You may not even get any support from the provider!  You may be facing the worst company fear – losing actual data.

Gartner has been advocating the importance of Cloud Exit Strategies to clients for some time.  In Gartner for Technical Professionals, we  even published a very comprehensive strategy document titled, “Devising a Cloud Exit Strategy: Proper Planning Prevents Poor Performance“.  I’m sad to say however, that compared to many other Gartner research documents, this document has not seen nearly the amount of demand or uptake from our clients.  Why is that?  I suspect it is because cloud exits are not nearly as sexy as cloud deployments – they are an afterthought.  It’s analogous to Disaster Recovery and other mundane IT risk mitigation responsibilities.  These functions rarely receive the attention they deserve in IT, except for immediately following major events like Hurricane Sandy or 9/11.

Does that mean this news regarding Nirvanix will be a catalyst for cloud customers to pay attention to the importance of building exit strategies?  Perhaps.

If you are a Nirvanix customer, it’s too late to build a strategy.  Drop whatever you are doing and get as much of the data as you can back immediately.

If you are a customer of any other cloud service (that is basically all of us) – take some time and build a cloud exit strategy/plan for every service you depend upon.  Cloud providers will continue to go out of business.  It may not be a frequent occurrence, but it will happen.  And even if your cloud provider does not go out of business, here is a list of many other factors which many signal you needing to exit a cloud service:

  • Provider’s services less reliable than advertised in SLAs, contracts or expressed expectations
  • Soured relationship with provider
  • Change in service levels
  • Change of provider ownership
  • Change of price
  • Change of terms and conditions
  • Expiration of enterprise agreement or contract
  • Lack of support
  • Data, security or privacy breach
  • Provider inability to stay competitive with industry features
  • Repeated or prolonged outages
  • Lack of remuneration for services lost
  • Change of internal leadership, strategy or corporate direction

Cloud customers, don’t delay.  All the risk mitigation tasks you would do if one of your in-house application vendors suddenly went out of business, ideally should be done in advance before leveraging cloud services. Exit strategies are important and necessary insurance policies.  Don’t be caught off guard.


Category: Cloud Providers     Tags: , ,

vCloud Hybrid Service: My take from VMworld 2013

by Kyle Hilgendorf  |  August 28, 2013  |  12 Comments

This week at VMworld 2013, VMware announced the general availability of vCloud Hybrid Service (vCHS).  vCHS has been in an early adopter program for the last couple of months but will enter GA on Monday.

vCHS is VMware’s public IaaS offering which will attract comparison against AWS, Azure Infrastructure Services, and other’s in Gartner’s Cloud IaaS Magic Quadrant.  However, vCHS will be limited for a while.  At launch, vCHS is a US-only hosted service, although sure to expand to Europe and Asia in 2014.  Q4 of 2013 promises services like DR-as-a-service and Desktop-as-a-service, but more basic services that have come to be the norm at competing services like object-storage services will be missing for the foreseeable future.  In fact, many developer-enhancing and cloud-native services (e.g. auto scaling, continuous deployment, packaging) are not part of vCHS at launch.

My expectation is that the interest from enterprise customers will be very high around vCHS, so what is my early take?

First, VMware had to create a public cloud service offering.  AWS has changed the industry and created a market and VMware had no choice but to compete with a public IaaS offering.   VMware is the private datacenter virtualization and private cloud behemoth.  Yet, increasingly, customers are considering public cloud deployments for future state (cloud-native) applications.  As organizations are using public clouds for cloud-native applications and dev/test workloads, an inflection point is on the horizon for the 80-90% of all other workloads possibly moving to public cloud environments.  VMware did not want to find themselves left out of that future shift.  Therefore, VMware had to try on their own to enter this market.  If not that, then they would have had to find a way to partner with AWS.  As of today, they’ve not found such a partnership.

Second, VMware has a compelling opportunity.  Clients are hugely invested in VMware technology and there is reason to believe these same organizations are looking for quick and easy runways into the public cloud for traditional workloads.  Migrating or converting traditional workloads into AWS or Azure has been minimal at best.  No one vendor or provider has a better chance for success of “holding onto” VMware workloads than VMware itself.  VMware understands the importance of the network in a hybrid cloud environment and their opportunity with SDN, the NSX offering from Nicira and the ability to cross connect into vCHS data centers will help their hybrid cloud story.  Finally, a true hybrid cloud story centers around management, and VMware has a management opportunity in a better position than most major public CSPs who struggle greatly with native management.

Third, I don’t see vCHS impacting AWS negatively.  I do see it impacting a large market of many smaller or regional vCloud providers.  Because vCHS will be missing many of the features that AWS users have come to depend upon, I do not expect to see any exodus of AWS customers to vCHS.  VMware claims that vCHS and AWS will attract different buyers and that AWS does not focus on enterprise-grade or compliant workloads.  I disagree.  From 2006-2012, AWS did struggle capturing the enterprise buyer, but every movement AWS has made in 2012 and 2013 (and all future movements) are positioned directly at enterprise buyers and enterprise-grade applications.  Furthermore, few providers can compete with AWS on security and compliance capabilities.  However, with the price point of vCHS and with the traditional VMware feature set, many VMware providers, including VSPP’s will face a very fierce new competitor in vCHS.  VSPP’s will have to be extremely clear what value proposition they bring against vCHS (for example industry vertical specialization) or be relegated to reselling into vCHS.

Fourth, I’m intrigued by the vCHS franchise service design initially rolling out with Savvis.  VMware must expand domestically and internationally quickly.  They cannot do that on their own.  vCloud Datacenter Services was VMware’s first attempt to do this, but it mostly failed due to the various providers differing enough to erode compatibility.  With the vCHS franchise program, VMware owns and operates the vCHS architecture and the franchisees provide the location, network and facility hosting services.  VMware does not have a large portfolio of datacenters to compete on their own, nor do they have any significant ownership in WAN networking or Internet peering.  Savvis with Century Link brings the networking breadth to the relationship and other future franchisees will do much of the same internationally.  Expect the cross connects to be similar to AWS Direct Connect and that is a win for customers.  Both VMware and the franchisee can sell into the service and the benefit for the franchisee is potential viral hosting growth of vCHS in their facilities as well as the opportunity to upsell customers into managed hosting, colocation and network cross connects.  Franchising will not be easy though.  VMware will have to manage it very closely to ensure quality and consistency, much like McDonald’s corporate tightly oversees all franchise restaurants.  It’s about ensuring a consistent and stable user experience and that should not be understated.  But it is VMware’s opportunity to enter new locations very quickly.  It would not surprise me if vCHS is in as many or more locations as AWS and Azure within 12-18 months through franchising.

Fifth, expect there to be growing pains.  VMware hired a fantastic leader in Bill Fathers, formerly from Savvis.  Bill brings a great leadership background in running services and is already pushing vCHS into a 6-week release cycle – a concept foreign to traditional VMware products.  But vCHS is not a commodity, its a uniquely created service. Multiple vCloud providers have told me that VMware is in for a surprise with their own products in that VMware will start to find the breaking points of product scalability.  Therefore, I expect vCHS to go through similar growing pains that other major CSPs have gone through over the past few years.  vCHS will not be perfect, it will have outages and it may not be as seamless between franchisees as promised. Customers should know this and pay attention to how VMware responds in the midst of issues, rather than hold them to perfection.  And if customers cannot accept this risk today, they should wait on the sidelines or look to a provider with more years in the market.

Finally, the public IaaS provider market is starting to show some interesting segmentation lines.  AWS is the dominating force, but mega vendors in the form of Microsoft, VMware and Google have made their intentions known and the service development and innovation each company possesses is creating a line between mega providers and the rest of the market.

So what does vCHS mean to you?  Well, I think long term, many organizations will not be able to avoid using it in some capacity.  Even some of the largest AWS adopters will find a place where vCHS shines past AWS.  Perhaps its the DRaaS or DaaS offerings on the horizon.  Perhaps its simplified lift and shift of large pools of VMs.  Maybe its more seamless management between the private datacenter and a public Iaas offering.  Whatever the use case ends up being, there is plenty of room in this market for multiple providers and most organizations will want at least 2-3 strategic IaaS partners for properly placing workloads based on individual requirements.  WIth the saturation of VMware in the enterprise, vCHS will surely be a logical endpoint for many of those workloads.  But vCHS will come with a ramp up and improvement period.  For organizations that want to assess it on day 1 and on-going, Gartner’s “Evaluation Criteria for Public Cloud IaaS Providers” can help you there.

What are your thoughts on vCHS?


Category: AWS Cloud Hybrid IaaS Microsoft Providers vCloud VMware     Tags: , , , , ,

Cloud Security Configurations: Who is responsible?

by Kyle Hilgendorf  |  April 2, 2013  |  3 Comments

A Rapid7 report surfaced last week that discovered some 126 billion AWS S3 objects were exposed to the general public.  AWS has since taken a brunt of security attacks by many blogs and tech magazines for their “lack of security”.  But I have to voice as an objective analyst, this is not the fault of AWS.

Security in S3 is binary for each object.  Private or Public.  Within private, there are a number of different settings one can employ.  Private is also the default security control for all S3 objects.  The AWS customer must manually go in and configure each individual object as “public”.  There might be very good reason for doing so.  For example, companies use S3 all the time to post public information that they want to share or make accessible for the world.  S3, and other object stores, are great for posting content such as public websites, videos, webinars, recordings, or pictures.  In other words, there might be very good reason why 126 billion objects are publicly accessible.

But for those objects that should not have been made public, the question really comes down to who is responsible – the provider or the customer?  I’ll argue this is the customer responsibility.  AWS offers customers what they want.  Security or public accessibility – the customer chooses.  There are reasons for both and customers have the power to choose.  Consider how many customers would be upset if AWS took away public accessibility options from S3.  I’d bet a large percentage of S3 customers would complain as S3 is great for publishing public websites and content.

If AWS has any fault here, it is making self-service and automation too smooth and easy, but isn’t that the goal of public cloud?  It is quite easy to create a bucket policy that opens up access to all current and future objects in a bucket for anonymous users and perhaps that is what happened to some of the more critical or private data that Rapid7 found in this study.  It is possible that one admin created a bucket policy and another admin or user uploaded sensitive data into the bucket unaware of the security configuration.  But at the same time, these bucket policies can be incredibly helpful for organizations that want to expose all objects in a bucket, for instance, a public web site.  However, the AWS management console does not provide simple visibility into what objects are accessible publicly or to anonymous users.  To gain this level of insight, you will need to have an understanding of the AWS S3 API.

But, in the end, customers are responsible.  Customers will always be responsible in the public cloud for their applications and their data – beware of configurations, features, and options.  I do not argue that many objects found in the report may have sensitive information inside, unfortunately user error or confusion could have led to the accidental public exposure of such objects.  Therefore, it is paramount that organizations employing public cloud services build not only clear governance practices, but also monitoring and alerting practices to raise awareness within the organization when digital assets may be exposed or not secured in the fashion that the data warrants.



Category: AWS Cloud Providers     Tags: , ,

Devising a Cloud Exit Strategy

by Kyle Hilgendorf  |  March 29, 2013  |  2 Comments

I’m happy to share today that our newest Gartner for Technical Professionals research is out: Devising a Cloud Exit Strategy: Proper Planning Prevents Poor Performance.

There is very little published in the industry about how to create a cloud exit strategy or plan.  If you search hard, you’ll find some blogs or magazine articles about why a cloud exit strategy is important, but almost nothing in terms of how you put the strategy and plan in place.

Proper risk management includes four pillars:  Accept, Avoid, Transfer and Mitigate.  For the last pillar, mitigate, customers have many options available: encrypt data, distribute availability, scale horizontally, enlist cloud brokers, implement hybrid architectures, monitor, alert, etc.  However, perhaps the ultimate risk mitigation is to have a cloud exit strategy.  While we all have faith in the future of cloud computing many events might occur that would warrant an exit from a cloud service including:

  • Provider’s services less reliable than advertised in SLAs, contracts or expressed expectations
  • Soured relationship with provider
  • Change in service levels
  • Change of provider ownership
  • Change of price
  • Change of terms and conditions
  • Expiration of enterprise agreement or contract
  • Lack of support
  • Data, security or privacy breach
  • Provider inability to stay competitive with industry features
  • Repeated or prolonged outages
  • Lack of remuneration for services lost
  • Change in your organization’s internal leadership, strategy or corporate direction

And so the remainder of this research focuses on how to put that cloud exit strategy and plan in place.  This is where we as GTP take pride, in delivering practical, How-to advice for our clients and it’s a focus for us at this year’s Gartner Catalyst conference in our Cloud Track.  Our Cloud Track Theme is “Cloud Services: Moving Past Partial Commitment”.  We’ve exhausted the ‘what’ and the ‘why’ of cloud computing and for organizations to move past occasional uses of cloud services, we will be delivering how-to advice for:

  • Building cloud strategies
  • Mitigating cloud risks
  • Building and delivering cloud native applications
  • Managing and operating cloud applications and assets
  • Forging and managing cloud provider relationships
  • Building private cloud architectures
  • Leveraging Hybrid Cloud solutions
Are you using cloud services?  Are you prepared with a Cloud Exit Strategy?  Are you looking for more “how-to” advice? We hope to see you at Gartner Catalyst 2013!



Category: Cloud Evaluation Gartner Outage Providers     Tags: , ,

Google Reader’s Death – A case for auto scaling

by Kyle Hilgendorf  |  March 14, 2013  |  Comments Off

When I talk about cloud characteristics with clients, elasticity and scalability often come up in conversation.  However, far too often clients jump to conclusions that auto scaling capabilities are out of scope or too complex for what they actually need.  Many times they may be right, but a recent industry announcement provides a perfect case study as to why you should always consider auto scaling designs.

Yesterday, Google announced that it would be shutting down Google Reader on July 1, 2013.  It was a painful announcement for me as my research is very dependent upon a meticulously designed news consumption strategy.  Google Reader and RSS feeds is vital to my personal strategy for keeping up with the industry.

A lot has been published on Google Reader alternatives.  One of the more popular options mentioned is NewsBlur.  However, in parsing tweets and news today, NewsBlur has been having major trouble keeping up with the incredible demand on its service today from defecting Google Reader users looking for alternatives.  Furthermore, I suspect the service issues at NewsBlur today may lead some prospective customers to conclude NewsBlur is not for them and move on to other alternatives.  And that is too bad.  Below is a tweet from NewsBlur today.

Auto scaling application designs should not only rely on your internal forecast for growth and possible demand, but should also consider external forces that might drive additional demand your way.  What if your top competitor or leader in the market announced tomorrow that they were shutting doors?  What if that competitor had a catastrophic corporate or technical event?  Are you ready and able to take on all their customers?  If not, they might go to the next option – opportunity lost!

These scenarios do not apply to every application, but they serve as an interesting case study for your business being ready to respond to demand not only when you do something good, but also when your competitors do something bad.  And the latter will likely come at the most unexpected times.

Comments Off

Category: Cloud Uncategorized     Tags: ,

Can cloud customers learn from the Super Bowl power outage?

by Kyle Hilgendorf  |  February 4, 2013  |  Comments Off

Last night, 108M+ people tuned in for Super Bowl XLVII.  It looked to be a blowout of a game, yet ended in exciting fashion, perhaps to one factor not related at all to football.

A power outage.

Early in the 3rd quarter, exactly half of the lights went out in the Superdome.  A 34 minute game delay ensued while a likely frantic army of individuals behind the scenes attempted to get the lights back on.  After the delay, the game momentum turned dramatically.  Thankfully (in my opinion) the Ravens held on to win or we’d be hearing nothing but “Power Gate” complaints for the next 6 months.

It got me thinking though – is there anything cloud consumers can learn from this power outage? They seem like unrelated events, but let me clarify some of my brief thoughts.

  • Outage pain is often more about time to recovery – would anyone have been upset if the lights came back on 2 minutes later?  Probably not.  But just like with cloud outages, we’re dealing with highly complex and slow to recover systems.  When something goes wrong in the cloud, don’t expect it fixed in minutes.  By comparison, 34 minutes would be fantastic.  Therefore, cloud customers should adequately plan contingency, recovery, or triage plans to operate in the midst of a prolonged outage.
  • Even highly resilient systems will fail – failure is imminent. We all assume that something the size of the Superdome has multiple power paths, protections, circuits, breakers, and generators.  Even with all that planning, something went wrong.  To my knowledge, no cause has 100% been determined (more on that later).  Similarly with cloud, no matter how many geographic zones or data centers you distribute your application, there is always some event that can knock you down.  We’ve seen outages due to control plane complexity, software bugs, and outages of resiliency-enabling components like load balancers.  Customers should architect for resiliency, but also architect for many levels of failure if possible.  Keep in mind all of this comes at a cost too.
  • Root cause analysis – as of yet, no one has taken responsibility for the Superdome power outage. Eventually, the truth will come out.  We are not sure whether it will be an admitted mistake, or a mistake uncovered by investigative journalism.  But we will get the truth.  Cloud providers have dramatically improved in these regards during 2012.  Customers are getting much better post mortems and root cause analysis documents after an outage from providers.  Sometimes these take a few days or a week, but they come.  If you, as a customer, are not getting a post mortem on a cloud issue, I’d encourage you to demand it or move to a different provider.
  • Outages improve the market - whatever was the cause of the power outage, you can bet it will be addressed or improved prior to any other major event at the Superdome.  Furthermore, every major sporting arena will be tuning in to see what they can learn from the issue in New Orleans.  Power designs in sporting stadiums will improve as a result of this.  Similarly, after every major cloud outage, both the provider affected and its competitors learn and improve.  Outages, while painful, are often beneficial.
  • Employing the best staff is not fool proof – I’ve read that the the Super Bowl and Superdome had some of the best technicians on hand both planning and running the event.  Yet the power still went out.  Cloud providers also tend to employ the best and brightest, but issues still happen. Humans, no matter how brilliant, are not perfect, nor can they prevent every issue.
  • Outages are not only for nascent markets - I’ve heard many people blame cloud outages on the fact that many providers are young and services are immature/nascent.  It’s a fair argument.  But power distribution to major sporting events is a very mature market.  And yet a problem still occurred. Similarly, as cloud providers mature, we should expect fewer outages, but they will not disappear.  See prior bullets for justification.

These are just a few of the correlations I’ve come up with.  What other connections do you see?

Comments Off

Category: Cloud Outage     Tags: , , ,

Go West Cloud Customers?

by Kyle Hilgendorf  |  October 23, 2012  |  Comments Off

Yesterday was marked with another major cloud outage.  Amazon Web Services experienced a single availability zone issue in the US-East-1 region.  As with all major cloud provider outages, I get the opportunity to speak to customers affected by the outage or customers considering broader public cloud adoption.

One question was asked of me in multiple conversations: Does US-East-1 have systemic design and availability issues?

This question stems from the fact that most (but not all) AWS issues have occurred in the US-East-1 region.  Unfortunately the answer is not a definitive yes or a no, but let me elaborate.

US-East-1 (also simply referred to as US-East) is the oldest (i.e., original) and substantially largest AWS region.  It is unclear exactly how much larger US-East-1 is from other regions such as US-West or EU-West, but substantial is probably an understatement.  In the July 2, 2012 power outage post mortem, AWS stated that US-East-1 is composed of more than 10 data centers.

US-East-1 is also the cheapest and default region for many deployments.  Therefore, the scale and impact of US-East-1 is quantitatively larger than other AWS regions.

So while US-East-1 may not have systemic design and availability issues, it is fair to say that US-East-1 pushes the limits in terms of scale, capacity, stress on software logic, distribution, and complexity.  While AWS does not deliberately use US-East-1 as a test bed or trial ground, the unfortunate results of it being so much larger than the other regions is that US-East-1 by default becomes that trial ground.

Which leads to the basis of the question in the title of this blog:  Are customers better off by moving to other AWS Regions (e.g., US-West, EU-West, etc…)?  Unfortunately the answer may be yes.  It might be beneficial to not be in the biggest AWS pond where scale and complexity issues first occur.  The advantage of this is that fixes and optimizations can be uncovered in US-East-1 and deployed to the region you reside in before that region gets to the same size/scale.

However, perhaps you must be in US-East-1 for location requirements, price constraints, or a number of other reasons.  But if you don’t have an affinity to the east coast of the United States, and if you can tolerate slightly higher prices in another region, perhaps it is time to move west AWS cloud customers.

I’d love to hear your comments and thoughts.

Comments Off

Category: AWS Cloud IaaS Outage Providers     Tags: , , ,