Sharing Genomic Data – Cornell Data Services

Maize Diversity (CC BY-NC-SA 2.0) by CIMMYT

The Need

The Maize Diversity Project (funded by NSF award: Biology of Rare Alleles in Maize and Its Wild Relatives) generates many tens of terabytes of genotype and phenotype data. A portion of this data must be shared to the public and must be preserved and protected from loss.

The Challenge

The project as a whole uses a wide array of information technology in core facilities like the Biotechnology Resource Center and in the individual collaborating labs and institutions. Various technologies and platforms are used, with tradeoffs on characteristics such as cost, speed, data protection/durability, scalability, ease of use, ease/ability to share with collaborators and the public.

Researchers and informatics consultants on the project evaluate the lifecycle of the data to make sure the solutions at any point are optimized for the way the data is being used. For example, the technology for performing heavy computational analysis is different from that required to provide raw and analyzed datasets to the public and scientific community.

This optimization is challenging and constraints on funding, available solutions, and technical expertise can affect the final choices.

The Solution

CyVerse, formerly known as iPlant Collaborative, is funded by the National Science Foundation with a mandate to serve all life sciences. CyVerse provides researchers with computational infrastructure to support research with solutions for data storage, analysis, education, consulting and collaboration.

Currently these services and resources are provided at no direct cost to the project or researchers. In the future, there may be a cost-recovery charge for data storage of about $200/TB/year.

Initially, user or project accounts are provided a 100 GB data allocation, but considerably larger allocations may be requested via the data store. All requests are reviewed and requests above 2 TB require approval by the CyVerse executive team. For large requests, it is important to provide a strong scientific justification, evidence of community interest in the data, a list of collaborators and a timeline for making the data public. If there is an existing canonical repository for the data, the data should be moved there when analyses are complete. Otherwise, the data may be hosted in the CyVerse Data Commons.

Through a simple process of requesting additional storage, the Maize Diversity Project was able to publish 20TB of public data and use the Panzea website to make the data discoverable.