If any company must face issues around data security, a credit card payment processor is a likely candidate. Square’s card readers and applications likely process millions of payments per day. One output of this processing is a lot of data. According to a presentation delivered at Facebook, Square stores a substantial amount of this data in Hadoop, conducting data analysis and offering data products to buyers and merchants.
Square does this without storing sensitive information in Hadoop. Redacted data meets about 80% of their use cases. Meeting the remaining 20% of use cases required some fairly radical engineering.
If you’re not familiar with Hadoop, it’s a distributed filesystem and data processing mechanism. You can put anything you want in those files. Square stores its data as Protocol Buffers, or “protobufs”. Protobufs is a way to serialize data. (Trying to stay as non-technical as possible in this post and already failing miserably. It’s a data format.)
Hadoop’s security model applies access control at the file level. This isn’t unreasonable since all it knows about are files. File contents are a mystery. However, file-level security is a blunt instrument in a modern data architecture. Addressing this security impedance mismatch led Square to look at a few options:
- Create a separate, secure Hadoop cluster for sensitive data. This was rejected because it introduces more operational overhead. It also wasn’t clear how to separate sensitive data. And everyone will want access to the secure cluster. Maybe the answers is to…
- Open up access to everyone and track access with audit logs. Yep, this is a bad idea. Nobody looks at logs.
So Square thought about what they needed to support the business. That resulted in the following requirements:
- Control access to individual fields within each record.
- Extend the security framework to the ingest method (Apache Kafka), not just Hadoop.
- Developer transparency.
- Support algorithmic agility and key management.
- Don’t store cryptographic keys with the data.
None of these requirements are unreasonable. All of them should be considered table stakes for an information management application. However, Square had to implement their own encryption model to meet these modest requirements. It did this by modifying the protobufs code to use an existing online crypto service. Key management overhead (always difficult) was challenging when facing growing data volumes and required a few iterations to get right.
Square didn’t disclose the number of hours spent on developing these capabilities. I think it’s fair to assume the investment was substantial.
Maybe the level of effort required is why securing your data in Hadoop isn’t a primary inhibitor to adoption. The lack of data-level security is keeping relevant projects off of Hadoop.
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.