What do you do when you need to ensure data can be stored, shared and retrieved not for just the next five years, but for the next one hundred? I recently stumbled across a system called iRODS (integrated rule-oriented data system) that’s currently in use at multiple government and research organizations to solve this long-term data management problem. iRODS, open sourced under the BSD license, consists of four components or features:
- Storage virtualization, typically through a storage gateway, but JBOD configurations are support.
- Data discovery using standard and user-defined metadata. Future releases will support discovery on data content via ElasticSearch.
- Workflow automation through a rules engine microservice supporting event-triggered process automation.
- Secure collaboration via a data federation capability.
Two components comprise the system. Data is stored in heterogeneous storage systems in the iRODS server, while standard and user-defined metadata is stored in an RDBMS (“iCAT enabled resource server”). The iCAT resource server gets its DR/HA capabilities from the underlying RDBMS. (Currently, MySQL, PostgreSQL and Oracle are supported. NoSQL DBMSs are currently on the roadmap).
iRODS isn’t an implementation option for a data warehouse or data lake. It solves a different set of problems related to geographic distribution and sharing of massive data sets. For example, AMPAS uses iRODS as a core piece of CineGrid Exchange, a globally distributed repository storing motion pictures at HD, 2K and 4K resolutions, digital still images and digital audio in various formats. It organizes distributed data into a shareable collection, allowing the user to view files stored at multiple locations as a single logical collection.
The Wellcome Trust Sanger Institute uses iRODS to manage next generation genomic sequencing data:
- Metadata tracks origin and processing history.
- Workflow automation handles data replication, checksumming and policy-based data resource selection.
- Secure collaboration across workgroups, support for isolating maintenance operations and policy definition on a per-group basis.
Several other bioinformatics centers, such as the Beijing Genomics Institute, The Broad Institute and Uppsala Genome Center, use iRODS in similar use cases. iRODS has also found uses in astronomy, biology, engineering, and numerous other disciplines.
Today, no vendors offer a commercial product based on iRODS. Support for the project is offered through the iRODS Partner Program.
Executive overview: http://irods.org/wp-content/uploads/2014/09/iRODS-Executive-Overview-August-2014.pdf
Technical overview: http://irods.org/wp-content/uploads/2012/04/iRODS-Overview-November-2014.pdf
Correction: Updated the text about support through the partner program.
Read Complimentary Relevant Research
Organizing for Big Data Through Better Process and Governance
With big data past the Peak of Inflated Expectations on the Hype Cycle, organizations are addressing next-level challenges and asking,...
View Relevant Webinars
Hadoop and Spark: Understanding Open Source Opportunities and Risks
As companies build foundational data and analytics infrastructure with Spark and Hadoop, the market continues to shift and evolve in...
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.