Gartner Blog Network

Tag Cloud

by Whit Andrews  |  October 29, 2008  |  3 Comments

Any number of fine books include artists toying with the idea that something that they create comes to life. I suppose the oldest version of this that i can think of is the lovely Pygmalion, in which a sculptor finds that his ideal woman takes life. In my case, it’s the Hostile Information Ecosystem, something I wrote about first in 2006. I was reading Geeking With Greg (again) tonight, and I was delighted to discover academic research reflecting my own analytically dispassionate fears about the extreme vulnerability of tag systems out in the long tail. He’s written about such tag vulnerability before, and I missed it.

So, anyway, it’s all about me, right? I imagined this first! Well,

It’s hard to see what can be done about the ecosystem, in general. The good news for people and sites who are heavily tagged is that it’s hard to foul up where they sit in the “standings” when one does a search that’s going to be influenced by tags. The bad news is that for the sites way out on the tail, it’s easy to botch up where they stand. What interests me — because here in my Home Office in Massachusetts I am rarely disturbed by actual knowledge — is that there are two kinds of highly likely attacks, the “piggyback attack” and the “overload attack.” I looked online (clusty, google) for other mentions of these attack types, but they appear to be fairly recently named — possibly by the researchers?

In any event, that’s just the abstract. The researchers delve in a lovely matrix (and by lovely, I mean, of course, “deeply paranoid”) into a variety of different kinds of attacks. They see the problem as roughly four-dimensional (those of you who actually know math can delete “4D” and substitute “four-tuple,” and you can do it with my blessing, if not my aid). They see three things in the tagging ecosystem, accurately — the resource, the tag, and the user; and then the fourth is the relationship created by linking these things. Now, this points to a problem I recently pointed out with the New York Times own site, which is that this presumes no computable value judgment by the user — in other words, a tag is a tag is a tag. You can’t tag something as “NOT A GOOD RESOURCE ON” or “ONLY A GOOD RESOURCE ON TUESDAYS” or whatever. That’s problematic when you have that little memo field on something like, which lets you say things like “This recipe is the worst I have ever used — do not use it, or if you do, God help you.”

I am not saying that the academics have fouled up here — I am saying this is hard.

In any event, my point is that in these different areas, the academics have identified that one can navigate in ways that are related to the user, the tag, or the resource — and that the target elements can also be classified that way, as users, tags or resources. Here’s what I like — and you’ll need to read Greg’s blog entry and the research, which is a tidy 14 pages and which I read in 30 minutes, to the academics eternal credit — there are at least SEVEN (7) POSSIBLE ATTACKS.

Now, that, pumpkins, is a scary Halloween fact. Can I get a witness?


Whit Andrews
VP Distinguished Analyst
14 years at Gartner
18 years IT industry

Whit Andrews is a vice president and distinguished analyst in Gartner Research. Andrews covers enterprise search and enterprise video content management.Read Full Bio

Thoughts on Tag Cloud

  1. Whit, as you know, I’m a big fan of pointing out the damage done by the adversarial nature of “objective” relevance functions, which is just the academic way of talking about the hostile ecosystem. What we really need–on the web as much as in the enterprise–is transparency and user control.

    These principles also apply to tagging. Take away the anonymity or quasi-anonymity of tagging, and give users control over which tags affect the user experience. Given that transparency and control, I’ll only consider the tags of users I trust to have non-spammy motives and ideally good judgment / taste. We’ll get there, even if it’s only when the current model collapses under its own weight.

  2. Dan Sholler says:

    I do need to point out that this problem is not just with tags, but with the entire notion of the web and hypermedia. After all, a tag is just an abstraction for a list of links (all the things that have that tag). this is even how the resources are represented if you get them from del.ici.ous or whoever. The real issue is that links have no machine readable distinctions. In fact, much of the business of the internet (search optimization in particular ) are heuristic attempts to create those kinds of distinctions.

    Semantic web supposedly gives us a mechanism for attaching metadata to those links, but that only gives the mechanism, and the models still need to be created (possibly by using the same techniques we do today). In the end, any of these techniques are fundamentally statistical, and we all know that the statistics lie when the data set is too small. This is what happens at the long tail, and nothing anyone does can change that.

    Mr. Tunkelang above suggests that we abandon the statistical techniques for what is essentially a deterministic one, in which each individual controls the level of influence of the various inputs that could be used. this has some appeal as a means of allowing user input to improve the results, but it still relies on the underlying statistical techniques (unless the user is planning to spend a lot of time choosing which of the elements on the list for the del.ici.ous tag “programming” are relevant to his or her current search…) Since the information volumes are so great, these kinds of manual approaches usually cannot work, or they restrict the results to only those items that have been manually vetted.

    So basically, we end up with two choices: either we use a statistical technique that is pretty much assured of screwing up in the long tail, or we use a manual technique which is assured of being very limited in coverage.

    Sorry Whit, this one is a problem that IMHO will not go away.

  3. Whit Andrews says:

    This absolutely is a problem that will not go away; I could not agree more. I think that Dan Tunkelang’s point is well-taken, and I am also a believer in transparency as a critical element of future relevancy calculations, especially those in which the intention is to exploit the apparently naive but possibly malicious or soimply calculated behaviors of passionate users. However, as we know, most people decline to invest the efoort in understanding a system such the transparency may be allowed to work effectively. So, as a result, the default settings are allowed to dominate, no matter what. Now, what we have here is the ability to alter the default settings at a very high level, if one can make this work.

    Dan Sholler, on the other hand, is of course also right. Tagging, however, is even more vulnerable than other kinds of link manipulation. Google bombing requires technical sophistication (you don’t think so? explain it to your friend who is not interested in how it works). Piggyback attacks require some sophistication, but overload attacks would be trivially easy for someone with enough interested, or compensated, minions.

    My intention here is to point out that any roomful of perfect new solutions for relevancy are always vulnerable insofar as they benefit from publicly available tweaking.

Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.