The Washington Post recently published an article discussing the difficulties the Commonwealth of Virginia has had with it’s IT systems: Crash of Va. computer network has implications for tech world, state politics. Apparently, something went wrong in the outsourced IT capability managed by Northrup Grumman in a $2.4B contract. It has impacted multiple agencies, including the DMV where 74 offices were unable to renew drivers licenses for many days. Quoting: “In all, computers at 26 of the state’s 89 agencies were affected.”
What interests me from an EA perspective are two things.
First, I’m so happy to see the dirty laundry aired with some level of transparency. Public sector enterprises “suffer”‘ from this openness, but I think it’s a good trend that we all should be more transparent. These are real discussions we must have about IT (and EA) value — or lack thereof. Too often, any kind of failure is immediately hushed up, and festers as stories told at the water cooler — never put on paper or discussed openly. We need to learn from our failures.
Second, I’m amazed by the description the newspaper and the words people it quotes actually use to describe the problem that occurred. Actually, there isn’t one description, there are many — and perhaps that should NOT surprise me, but it does. Here are examples of how this problem is described in just this article (and I quote, with my own underlining to highlight different terminology):
- “The data storage unit that failed in a warehouse outside of Richmond last week, wreaking havoc in the computer networks of a number of Virginia agencies for more than a week, is a ubiquitous bit of technology …”
- “The crash — still baffling to state officials — exposes the vulnerability of modern, massively complex interconnected computer networks“
- “the worst network failure since …”
- “computers have been down”
- “In all, computers at 26 of the state’s 89 agencies were affected.”
- “Chief Information Officer Sam Nixon said Thursday that the problem began Aug. 25 with the crash of a pair of three-year-old memory cards — one was supposed to back up another.”
- “That led to 485 of the state’s 4,800 data servers being knocked off-line.”
- “the storage units are fundamental building blocks used to hold the mass databases of information necessary to run complex organizations in a digital age”
- “A blistering legislative audit released in October found that the computer system had caused problems at almost every state agency that uses computers.”
What crashed: data storage unit, computer networks, networks, computers, memory cards, data servers, storage units, mass databases, computer systems?
Are these people all talking about the same thing? The same problem? The same effect and cause? Yes, but … it certainly doesn’t seem like that. For EA practitioners, let this be a lesson to us all: there are so many different views of things. In EA, we have to understand and speak to them all.
One consistent thread: what failed was some kind of technology and the outcome of this was reduced services to citizens (DMV and other examples are cited). This we can all agree on. There is a set of related things that each can be named as the cause of the failure — here it might be: memory card, storage unit, computer systems, mass databases, computer networks all in a dependency thread that gets increasingly larger as the dependencies keep piling up. This was apparently shared infrastructure (a SAN) that many applications (and their computer systems) depended on. This is a technical architecture thread (or pattern, if repeated in many actual solution threads). We’re left asking if too many eggs were dependent on a single basket, and can any single basket ever be fully fault tolerant.
Here’s my challenge to you: Can you define the dependency thread of failure in your enterprise?
If such a variously described “thing” failed in your environment, which eggs would be broken when the technology basket fails? Would you already know which 1) systems, 2) applications, 3) people, and most importantly 4) business services are impacted? For this, EA information could be leveraged — you’d know which business services (FEA’s BRM or better SRM, business architecture function or service models, etc.) are dependent on what other services or technologies or implementations. You’d be able to do a quick analysis or risk assessment to show the extent of an outage’s impact on the eggs in the basket (predicted or actually occurring).
Perhaps this case example suggests some opportunities for what EA can do for key stakeholders. Here, the EA failure would be not taking advantage of a crisis to show the value of the EA work.
I have no visibility into the Commonwealth of Virginia’s EA program (but I have met with people there over the years), so if anyone has more info on what happened in this case from an EA slant, please share. Or, just any other cases where EA information helped in managing a crisis — everyone would like to hear about some successes.
Category: EA Tags: