Gartner Blog Network


Data Management for 100 Year Lifecycles

by Nick Heudecker  |  January 25, 2016  |  Submit a Comment

What do you do when you need to ensure data can be stored, shared and retrieved not for just the next five years, but for the next one hundred? I recently stumbled across a system called iRODS (integrated rule-oriented data system) that’s currently in use at multiple government and research organizations to solve this long-term data management problem. iRODS, open sourced under the BSD license, consists of four components or features:

  • Storage virtualization, typically through a storage gateway, but JBOD configurations are support.
  • Data discovery using standard and user-defined metadata. Future releases will support discovery on data content via ElasticSearch.
  • Workflow automation through a rules engine microservice supporting event-triggered process automation.
  • Secure collaboration via a data federation capability.

Two components comprise the system. Data is stored in heterogeneous storage systems in the iRODS server, while standard and user-defined metadata is stored in an RDBMS (“iCAT enabled resource server”). The iCAT resource server gets its DR/HA capabilities from the underlying RDBMS. (Currently, MySQL, PostgreSQL and Oracle are supported. NoSQL DBMSs are currently on the roadmap).

iRODS isn’t an implementation option for a data warehouse or data lake. It solves a different set of problems related to geographic distribution and sharing of massive data sets. For example, AMPAS uses iRODS as a core piece of CineGrid Exchange, a globally distributed repository storing motion pictures at HD, 2K and 4K resolutions, digital still images and digital audio in various formats. It organizes distributed data into a shareable collection, allowing the user to view files stored at multiple locations as a single logical collection.

The Wellcome Trust Sanger Institute uses iRODS to manage next generation genomic sequencing data:

  1. Metadata tracks origin and processing history.
  2. Workflow automation handles data replication, checksumming and policy-based data resource selection.
  3. Secure collaboration across workgroups, support for isolating maintenance operations and policy definition on a per-group basis.

Several other bioinformatics centers, such as the Beijing Genomics Institute, The Broad Institute and Uppsala Genome Center, use iRODS in similar use cases. iRODS has also found uses in astronomy, biology, engineering, and numerous other disciplines.

Today, no vendors offer a commercial product based on iRODS. Support for the project is offered through the iRODS Partner Program.

Executive overview: http://irods.org/wp-content/uploads/2014/09/iRODS-Executive-Overview-August-2014.pdf
Technical overview: http://irods.org/wp-content/uploads/2012/04/iRODS-Overview-November-2014.pdf

Correction: Updated the text about support through the partner program.

Category: big-data  

Tags: big-data  irods  storage  

Nick Heudecker
Research Director
4 years at Gartner
18 years IT Industry

Nick Heudecker is an Analyst in Gartner's Research and Advisory Data Management group. Read Full Bio




Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.