Data Storage

Active Data Storage

Active data is that which is under current collection and analysis.

It is recommended that when saving your active research data you use the 3-2-1 rule: Save three copies of your data, on two different storage mediums, and one copy off site.

Archive Data Storage

There are many resources available for data storage at the end of a research project.

Specialist repostitories are those that are dedicated to specific kinds of data.

Examples include:

  1. GenBank. A comprehensive public repository of DNA sequences maintained by NCBI, supporting genomic and metagenomic research.
  2. GenBank Bioproject. BioProject registration is required as part of data deposit to several NCBI primary data archives including SRA, TSA, and WGS.
  3. Sequence Read Archive (SRA). A repository for high-throughput sequencing data, storing raw sequencing reads from genomic studies.
  4. Transcriptome Shotgun Assembly (TSA). A repository for transcriptome sequences, containing assemblies from various organisms.
  5. Gene Expression Omnibus (GEO). A repository for functional genomic data, storing high-throughput gene expression and other genomic data.
  6. European Nucleotide Archive (ENA). A resource for raw sequence data, alignments, and assembly data from high-throughput sequencing projects.
  7. EBI Metagenomics. A European Bioinformatics Institute resource that offers tools for the analysis and archiving of metagenomic data.
  8. GISAID. A global repository for the sharing of influenza and other viral genome sequences to track and monitor viral evolution.
  9. TreeBASE. A repository of phylogenetic information, including published phylogenetic trees and associated data.
  10. Ag Data Commons. A USDA-managed repository for data related to agriculture, including genomic, phenotypic, and environmental data.
  11. Wheat Initiative’s WheatIS. A global wheat data repository for genomic, phenotypic, and breeding data to support agricultural research.
  12. PeptideAtlas. A repository of peptide and proteomics data, providing a large collection of observed peptides from mass spectrometry experiments.
  13. Protein Data Bank (PDB). A repository for 3D structural data of large biological molecules, including proteins and nucleic acids.

Generalist Repositories aren’t specific about the type of research data that they host (although they have other limitations such as dataset size).

In Canada most institutions host generalist data repositories, typically Borealis (which is derived from Dataverse). Larger datasets can be stored on the Fedrated Research Data Repository (FRDR).

The advantage of storing in these Canadian repositories is that the data can be catalogued by other services making your data more findable by other researchers. Search Canadian research data using the Lunaris data search engine.

Other generalist repositories include:

  1. Zenodo. An open-access repository for research data, offering long-term storage for scientific research outputs.
  2. Figshare. A cloud-based repository where researchers can upload, share, and manage research data, figures, and publications.
  3. GDR. The Guide des dépôts de recherche is a platform providing access to a variety of scientific research data repositories.
  4. GitHub. A platform primarily used for code hosting and collaboration, but also for storing research datasets and project documentation. GitHub releases can be automatically configured to be archived in Zenodo.
  5. Borealis. A Canadian research data repository that offers long-term storage and sharing of research data across institutions.
  6. FRDR (Federated Research Data Repository). A platform for discovering and sharing Canadian research data, aimed at supporting data management and preservation.

Repository software

  1. iRODS. The Rule-Oriented Data System (iRODS) is open source data management software used by research organizations and government agencies worldwide.