Blog post

You Cannot Run an Open Data Project Like an Open Source Project, Unless…

By Andrea Di Maio | March 10, 2010 | 6 Comments

open government data

Earlier today I dared to tweet my disagreement about a piece that Tim O’Reilly indicated as really important one. Tim tweeted back and so I guess I owe him a better explanation than what I could squeeze into a couple of tweets.

Let me start saying that the piece I criticized – Truly Open Data by Nat Tokington – is a good one, as it makes a number of interesting points about how open source software approaches. Nat highlights very clearly the problem of data quality that will affect many datasets and may indeed have a negative impact on open government initiatives. His intuition, which sounds quite intriguing, is that one could apply the best thinking and practices from open source to open data.

However there is a fundamental flaw in this line of thought. Open source projects cluster a number of developers who collaborate on an equal footing to develop a product they are jointly responsible for, as a community.

Government does not have the luxury of doing so. An agency publishing crime statistics or weather forecast or traffic information is ultimately accountable for what it publishes.

Indeed collaboration can be built around that data, and several mechanisms used in open source projects can be effectively leveraged, such as mailing lists, bug trackers, ways to report problems and inaccuracies, in a nutshell mechanisms that would help those who are charged in that government agency for quality and accuracy of that data do a better job.

I would also argue that it is quite idealistic to think of open government data users having the same degree of tolerance that users of open source software would have. There is an expectation that government does the right thing and provides trusted and accurate dataset.

Where the author hits the nail on the head is where he says that

we need to change attitudes and social systems. Data is produced as the product of work done, and is rarely conceived of as having a life outside the original work that produced it. Some datasets will (some won’t–think of how many projects fail to interest anyone but the person who started them). This means thinking of yourself not just as the person who does the work, but the person who leads a project of interested outsiders and (in some cases) collaborators and who is building something that will last beyond their time

However this is not the way open government works today. To realize this vision, governments need to overcome the asymmetry that I’ve highlighted several times (see here and here) and give the same dignity to citizen-collected data as to government-sanctioned data. Only then we can start thinking about open source communities around open data: some communities may led by government, some by external stakeholders.

But all this raises a number of questions about who is accountable for what and the fine like between trust and truth, which I addressed in a recent post.

This is why, while I am intrigued by Nat’s proposal, I cannot buy it at this stage. If we want to pursue that path, governments do not only need to open their data, but the process they use to collect, qualify, manage and publish it. I am not sure they are ready yet.

Comments are closed


  • Tim O'Reilly says:

    You don’t seem to understand open source projects very well here. Do you really think that Linus Torvalds isn’t accountable to his users for the quality of the Linux kernel?

    If government agencies were as responsible about the quality of their data as most open source developers are of their code, that would be a huge step up in quality.

    There may be good reasons that open source practices won’t work for open data, but the ones that you offer don’t meet even a minimal test for reasonableness.

    I do agree that governments need to improve the process by which they collect, qualify, manage and publish data. But that was exactly the point of Nat’s article.

    BTW, I do apologize for comparing you to Andrew Keen – which is why I deleted the tweet shortly after posting it. It was an intemperate comment. However, it was obviously correct enough that you loved the opportunity to grow traffic via controversy and retweeted it yourself. Oh well.

    I do however appreciate that you took the time to explain yourself. At least now we can see the argument.

  • Joe says:

    I would say the answer lies somewhere in between. There are clear differences between open government data and open source, but there are a slew of similarities. One of the big differences is that data doesn’t do anything, code does; this is to say, that the operational ability to modify data and “patch” it locally isn’t as important in data vs. code. But that doesn’t mean that users can’t “fork” open government data, especially if an agency, etc. is too stubborn to fix flaws in data sets themselves.

    And, I suspect the biggest kernel in the comparison here is one that Andrea has written about a lot: participation. Monolithic projects like Linux can be very difficult to work one’s way from the periphery into a place where contributions one makes are taken seriously. The day that corrections and extensions to government data sets are embraced in a timely manner is perhaps very far off… in fact, there are likely barriers that will keep erroneous data elements for longer than anyone would like (for example, cruft in data that must be reported raw and that cannot be updated other than by redoing specific statutory/regulatory-dictated measurements).

    Anyway, I have a neat little example of how open government access worked, in one trivial and isolated example, much like an open source project (a feedback loop of participation was closed and a correction was made). I can blog about that if you’re interested.

    And, guys, twitter is the perfect medium for pouring gasoline on flame wars—it has the impersonality of email *plus* large public followings *plus* 140 bytes per bite. You’re both at the vanguard of thinking here, so there are bound to be disagreements. No reason to not give each other the benefit of the doubt.

  • @Tim – Thanks for your comment and indeed I’m happy we have the ability to discuss on more-than-140 messages.

    First of all – and this is a point I should have made in the post – there is a fundamental difference between software and open government data. Data is a representation of a fact or event (or series thereof). Software is an invention.

    The open source development process – which I do happen to know relatively well – is very well suited to share and evolve inventions, but less to share and evolve “official” data.

    The main point is that when GOVERNMENT data is published, there is a clear expectation that IT is correct. Do you rEally believe that government agencies should throw DRAFT data out assuming that IT will be corrected by the community? One could think about using an open source process inside government, but only assuming that different agencies have a common interest at improving each other’s data, for which I frankly fail to see a business case. The original piece does not even make this distinction, but assumes that government data might be treated like software, which – in my humble opinion – show an insufficient understanding of what accountability and compliance mean in a government context.
    As I said, one can think about an open source process only when there is a mutual understanding that government and the public collaborate to create data: this is not what current initiatives (including the OGD) suggest. If government agencies are rewarded for publishing “beta versions” of their data, then we can start thinking about an open process.

    Incidentally I did not find your tweet inappropriate at all.I am deleting the retweet, as I hadn’t realized you had deleted that (Tweetdeck does not help in that respect). You made me reflect on how my statements may be perceived and gave me food for thought to explain the reasons for (as well as the boundaries of) my cynicism.

  • @Joe – Thanks for adding some other element to explain the difference between data and software. I do believe that Tim and I are not so far away in our thinking about this. I think that the open source suggestion is intriguing but needs to be applied within reason. I’d love to see any of the large agencies adopt a process like this internally: actually, before data is stamped as “public”, it could be collaboratively improved by engaging specific constituencies (at the agency’s discretion). This is closer to what we in Gartner call “community source”, i.e. a gated community of individuals who have the credentials to collaborate on creating / improving data. The model could then be extended toward an open community when the statutory & regulatory barriers are overcome.

  • joe says:

    Absolutely, community source (which I know from the SAKAI project) is a much more appropriate model (although much much younger).

  • @Nat – Thanks for your indirect response to my post. Whereas I do not make comments about coherence or lucidity, I would argue that neither you not Tim O’Reilly understand the principles of accountability. Data that is collected by government as a consequence of a statutory/regulatory requirement is something for which government is accountable for: if an open source process can be used, it has to be driven and owned by that government agency, which is why I talked about community source (the agency selects the community it wants to work with). Once we agree that the process used to collect and validate data does not change accountability lines, then we can start diving into the mechanics of that process. It seems to me that the two of you still need to get closer to that understanding.