Talk to security folks, especially network ones, and AAA will likely come up. It stands for authentication, authorization and accounting (sometimes audit). There are even protocols such as Radius (Remote Authentication Dial In User Service, much evolved from its first uses) and Diameter, its significantly expanded (and punnily named) newer cousin, implemented in commercial and open source versions, included in hardware for networks and storage. AAA is and will remain a key foundation of security in the big data era, but as a longtime information management person, I believe it’s time to acknowledge that it’s not enough, and we need a new A – anonymization.
I realize I’m speaking out of turn here. I’m not a security guy myself, and I don’t pretend to be deep in the disciplines that decide whether you are who claim to be (authentication) and govern whether you can get to the network. Nor do I know the detailed nuances, spread across many different resources, that grant me permission to do what I will be allowed to do with those resources when I get there (authorization.) I don’t understand the various protections that assure breaches do not/have not occurred, which depend on the audit capabilities (the latter, as accounting, also provides the mechanism to report on all of the above.
What I do spend some time on is what happens within the resource that holds the data, when an authenticated, authorized person who is appropriately audited gets to it. For example, we need to distinguish what DBAs can see from what an analyst can – financial types call that “separation of concerns,” and it’s typically managed by a DBMS, which has mechanisms to interact with authorization capabilities to implement policy. It can be coarse- or quite fine-grained, and it’s one of the reasons we analysts always like to remind people that we talk about database management systems, not just databases.
But here’s the problem: in the big data era, much of the data we work with is not in DBMSs – and more and more of it will not be, as file-based systems like Hadoop gain broader and broader use. File systems don’t provide that granular control, so intervening layers will be required. They too can be coarse – we can encrypt/decrypt everything, for example. Or they too can be fine-grained, offering selective, policy-based decryption – in memory, after the bits come off the disk, before handing to the requester.
Personally, I hope people who model disease vectors, or even purchase behaviors, can build effective predictive models that describe what happens to people with certain characteristics. I just don’t want that process to result in my name being “on their list.” If they can intuit and classify what I am by my behavior and assign me to a category in some separate process, that is a different issue.
One approach that matters a great deal is obfuscation, which replaces a field like name or SSN with valid characters, but not the original data. Its value is that if properly implemented, it maintains mathematical cohesion and permits statistical analysis, aggregation, model building, etc to proceed without individuation of the records they are performed over. This is a privacy concern. Redaction – the familiar “blacking out” of content, is also used – but in some policy scenarios, being able to peer into the redacted data might subsequently be of value, and redaction typically doesn’t permit this.
Both approaches, however, can be classified as anonymization (or data deidentification, but I prefer to add another A to AAA for consistency!), and in an era where big data will increasingly be used to track human behaviors for medical, commercial and security reasons, I believe it’s time for anonymization to join the other 3 As. Perhaps it’s time to talk of authentication, authorization, anonymization and audit as the true foundation for data security.
Thanks to my esteemed Gartner colleague Neil MacDonald for commenting on and improving this post.
Read Complimentary Relevant Research
Organizing for Big Data Through Better Process and Governance
With big data past the Peak of Inflated Expectations on the Hype Cycle, organizations are addressing next-level challenges and asking,...
View Relevant Webinars
What Big Data Means Today and How to Position Effectively
Gartner's original prediction that the term "Big Data" would become meaningless by 2020 was actually a bit off its largely useless already...
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.