by Jay Heiser | May 30, 2011 | Comments Off on Yes, Virginia, there are single points of failure
The Commonwealth of Virginia has recently announced that they have settled up with their service provider, Northrup Grumman, over an incident last year that apparently brought down 3/4 of state applications, resulted in the loss of a several days worth of drivers license photos, and forced state offices to open on weekends. Compensation to the state, payment for the audit, and upgrades to prevent future failures amount to just shy of $5,000,000 for Northrop Grumman.
A February 15 report from professional services firm Agilysys analyses exactly what went wrong with the storage system to cause loss of service and data corruption. the report is based in part on interviews of Northrop Grumman and hardware supplier EMC, both of which seem to hem and haw over what amounts to “We put a lot of eggs into a single basket, and when we dropped the basket, a lot of eggs broke.”
A quote on p. 7 sounds like an auto-immune condition to me: “the failure to suspend SRDF before the maintenance event allowed the corrupted data to be replicated to the SWESC location, thus corrupting the disks that contained the copies of the Global Catalog in the SWESC location” Yes, they should have turned off replication during this particular recovery effort (see p. 15), and its likely that there would have been a lot less damage, and a much faster recovery. Yes, a more granular level of snapshotting would have provided a more current recovery point. But there’s a bigger issue, here.
Much of the discussion in the 41 page audit report surrounds the decision on the part of a single technician to replace board 0 before replacing board 1–or was it the other way around? Perhaps storage specialists take this for granted, but I take a different lesson from this arcane analysis of the repair sequence of an IT product that is meant to epitomize uptime. The majority of state systems were hanging by a single thread of fault tolerance, and a routine attempt to repair that thread resulted in it breaking. The break not only resulted in the loss of fault tolerance, but it also resulted in the loss of data. Its not just a question of what should have been done to prevent the failure–the more significant question is whether too high a percentage of state systems are dependent upon a single Storage Area Network. Its a single point of failure. The fact that it was widely considered at the time to be a ‘one in a billion failure mode’ only reinforces my point that stuff happens, and that high concentrations of service and data means that more stuff will go missing.
Read Complimentary Relevant Research
2019 Planning Guide Overview: Architecting Your Digital Ecosystem
Technical professionals are confronting increasingly complex technology ecosystems. They must overcome this complexity to create solutions...
View Relevant Webinars
State of Cloud Security
This webinar presentation will help you and your organization fully understand and address cloud risks. We will discuss the current and...
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.