Blog post

AAA is Not Enough Security in the Big Data Era

By Merv Adrian | January 13, 2014 | 7 Comments

SecurityIndustry trendsGartnerDBMSdata warehouseData IntegrationApache HDFSApache HadoopData and Analytics Strategies

Talk to security folks, especially network ones, and AAA will likely come up. It stands for authentication, authorization and accounting (sometimes audit). There are even protocols such as Radius (Remote Authentication Dial In User Service, much evolved from its first uses) and Diameter, its significantly expanded (and punnily named) newer cousin, implemented in commercial and open source versions, included in hardware for networks and storage. AAA is and will remain a key foundation of security in the big data era, but as a longtime information management person, I believe it’s time to acknowledge that it’s not enough, and we need a new A – anonymization.

I realize I’m speaking out of turn here. I’m not a security guy myself, and I don’t pretend to be deep in the disciplines that decide whether you are who claim to be (authentication) and govern whether you can get to the network.  Nor do I know the detailed nuances, spread across many different resources, that grant me permission to do what I will be allowed to do with those resources when I get there (authorization.) I don’t understand the various protections that assure breaches do not/have not occurred, which depend on the audit capabilities (the latter, as accounting, also provides the mechanism to report on all of the above.

What I do spend some time on is what happens within the resource that holds the data, when an authenticated, authorized person who is appropriately audited gets to it. For example, we need to distinguish what DBAs can see from what an analyst can – financial types call that “separation of concerns,” and it’s typically managed by a DBMS, which has mechanisms to interact with authorization capabilities to implement policy. It can be coarse- or quite fine-grained, and it’s one of the reasons we analysts always like to remind people that we talk about database management systems, not just databases.

But here’s the problem: in the big data era, much of the data we work with is not in DBMSs – and more and more of it will not be, as file-based systems like Hadoop gain broader and broader use. File systems don’t provide that granular control, so intervening layers will be required. They too can be coarse – we can encrypt/decrypt everything, for example. Or they too can be fine-grained, offering selective, policy-based decryption – in memory, after the bits come off the disk, before handing to the requester.

Personally, I hope people who model disease vectors, or even purchase behaviors, can build effective predictive models that describe what happens to people with certain characteristics. I just don’t want that process to result in my name being “on their list.” If they can intuit and classify what I am by my behavior and assign me to a category in some separate process, that is a different issue.

One approach that matters a great deal is obfuscation, which replaces a field like name or SSN with valid characters, but not the original data. Its value is that if properly implemented, it maintains mathematical cohesion and permits statistical analysis, aggregation, model building, etc to proceed without individuation of the records they are performed over. This is a privacy concern. Redaction – the familiar “blacking out” of content, is also used – but in some policy scenarios, being able to peer into the redacted data might subsequently be of value, and redaction typically doesn’t permit this.

Both approaches, however, can be classified as anonymization (or data deidentification, but I prefer to add another A to AAA for consistency!), and in an era where big data will increasingly be used to track human behaviors for medical, commercial and security reasons, I believe it’s time for anonymization to join the other 3 As. Perhaps it’s time to talk of authentication, authorization, anonymization and  audit as the true foundation for data security.

Thanks to my esteemed Gartner colleague Neil MacDonald for commenting on and improving this post.

Comments are closed


  • Dude says:


    An acronym used within the obfuscation method is FPE or Format Preserving Encryption. Eschewing any specific implementation or service I direct you to Wikipedia –

    The results of this type of effort yield the statistical data options you list.


    • Merv Adrian says:

      Thanks, dude. (Always wanted to say that here.) I wasn’t trying to get detailed about protolos; that was just a way to launch the discussion. But much appreciated.

  • I think you’re spot on here Merv. We’re seeing almost every customer ask about or have expectations for security along these lines as they are considering Hadoop. Without it, they simply can’t do much more with Hadoop other than storage and processing. For those looking to do analytics security is an absolute must. This is why we created Sentry ( and have a strong partner ecosystem for data encryption, to ensure our clients and prospects can get the most out of their deployment. Hadoop with security has very quickly become a must have rather than a nice to have.

    • Merv Adrian says:

      Thanks, Clarke – the work you and your competitors are doing – and work form several firms partnering with you and them – is certainly making it possible for people to get a handle on this. We’ll be covering more about it in upcoming research.

  • Dan Graham says:

    In my days working at a security start up, the 3rd A was always audit. When there’s a breach, there is always a pogrom to find and persecute the offenders as well as the security guys who didn’t have enough security. 95% of the time, the security staff will tell you they had no money to do a better job or directives to not do a better job.

    One thing I learned in the startup was a simple saying “Security is like a chocolate cake. Do you bake it into the batter or spread chocolate icing on top?” What this means is that once the bad guy penetrates the secure perimeter (the icing), security is gone, its a free for all. If the security is baked in at the beginning and permeates everything, well, you are as secure as you can be with that technology.

    With so many repositories of data, its increasingly harder to obfuscate and bake the security in at the foundation. I think the Logical Data Warehouse needs to address this but culturally I doubt most data centers will do what’s right anyway. So maybe there is a case for keeping the most secure data in the data warehouse and controlling the BI tools and federation tools. This can be done whereas protecting and obfuscating every file in the data center is unlikely.

    Anyway — when you tell the chocolate cake story, give me a plug once in a while.

    BTW — love the CDs I got for Xmas.

    • Merv Adrian says:

      Thanks, Dan. The notion that some repositories will be trustworthy and some not is at the heart of the question. Can Hadoop be ready if we don’t prot4ect it? Will it get the attention and resources within the enterprise to make it trustworthy if it requires an entirely separate mechanism there frmo the ones we use inside DBMSs?
      The emerging players like Dataguise, Gazzang,Protegrity and Vormetric, and existing ones like IBM, offer an intervening layer where policy can be centralized for multiple data stores. That may be where we need to go….
      Delighted the CD suggestions helped!

  • Merv,

    Great article. I would add on to Dan’s comment and state that security within an organization should be viewed as a layered cake. There are multiple level of security that enterprises should pursue, from perimeter security, to securing internal network, application level security and finally security at the data storage layer and OS level. A layer in itself may not be foolproof, but together will present a more comprehensive strategy.

    Data masking or “anonymizing” is thus another layer within data storage security. There is indeed a strong use case for big data. Given the volumes and velocity of data, enterprises do not want to create multiple copies of same data for different teams and keep maintaining it. They would rather have everyone access the same repository with clear segregation of duties, fine grained access control, advanced rules to enable dynamic policies (policies based attributes such as purpose, place etc), as well encryption and masking. To control the environment, they need strong auditing mechanism which can support forensic like capabilities.

    Solving the security challenge within Hadoop and big data is core part of our work at our security startup, XA Secure ( We have imbibed the exact principles that you had outlayed in your article. I would welcome you all to test drive our solution and check the use cases we have tried to solve.

    Balaji Ganesan
    XA Secure