Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.
The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.
Let me explain why quotas suck as a way to run your business.
Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).
As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)
The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?
Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.
If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.
This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.