I corresponded with some members of the HP cloud team in email, and then colleagues and I spoke with HP on the phone, after my last blog post called, “Cloud IaaS SLAs can be Meaningless“. HP provided some useful clarifications, which I’ll detail below, but I haven’t changed my fundamental opinion, although arguably the nuances make the HP SLA slightly better than the AWS SLA.
The most significant difference between the SLAs is that the HP’s SLA is intended to cover a single-instance failure, where you can’t replace that single instance; AWS requires that all of your instances in at least two AZs be unavailable. HP requires that you try to re-launch that instance in a different AZ, but a failure of that launch attempt in any of the other AZs in the region will be considered downtime. You do not need to be running in two AZs all the time in order to get the SLA; for the purposes of the SLA clause requiring two AZs, the launch attempt into a second AZ counts.
HP begins counting downtime when, post-instance-failure, you make the launch API call that is destined to fail — downtime begins to accrue 6 minutes after you make that unsuccessful API call. (To be clear, the clock starts when you issue the API call, not when the call has actually failed, from what I understand.) When the downtime clock stops is unclear, though — it stops when the customer has managed to successfully re-launch a replacement instance, but there’s no clarity regarding the customer’s responsibility for retry intervals.
(In discussion with HP, I raised the issue of this potentially resulting in customers hammering the control plane with requests in mass outages, along with intervals when the control plane might have degraded response and some calls succeed while others fail, etc. — i.e., the unclear determination of when downtime ends, and whether customers trying to fulfill SLA responsibilities contribute to making an outage worse. HP was unable to provide a clear answer to this, other than to discuss future plans for greater monitoring transparency, and automation.)
I’ve read an awful lot of SLAs over the years — cloud IaaS SLAs, as well as SLAs for a whole bunch of other types of services, cloud and non-cloud. The best SLAs are plain-language comprehensible. The best don’t even need examples for illustration, although it can be useful to illustrate anything more complicated. Both HP and AWS sin in this regard, and frankly, many providers who have good SLAs still force you through a tangle of verbiage to figure out what they intend. Moreover, most customers are fundamentally interested in solution SLAs — “is my stuff working”, regardless of what elements have failed. Even in the world of cloud-native architecture, this matters — one just has to look at the impact of EBS and ELB issues in previous AWS outages to see why.