Presenting the Data Vault

Blog post by University of Manchester project developer Tom Higgins:

Yesterday I gave a short presentation on the Data Vault project at an event in Lancaster:

I based this on the original pitch with a few updates reflecting the work we’ve done over the last couple of months.

Here’s some of the feedback and questions from the event – I think a lot of these are more relevant for “phase 2 and beyond” than the current prototyping:

  • How does the Data Vault differ from iRODS? Perhaps the policy model from iRODS could be useful or iRODS could serve as a back-end. There was a comment that iRODS may be more useful where the researcher’s workflow is known and can be encoded into the system (e.g. it’s deeply involved in the day-to-day active data).
  • Archivematica (being explored by a project in York) can handle many preservation activities but has a specialist user interface which is not suitable for researchers to use directly. Perhaps a Data Vault could be used to ingest data and hand it over the Archivematica for preservation.
  • How would a Data Vault handle sensitive data? Would it be need to be certified? What if the “back-end” was using a certified storage system – would that ease the burden at all? I mentioned that perhaps both a “general” and a locked-down “sensitive” instance of the software could be run in parallel.
  • How could a Data Vault handle a dataset that is changing over time? Perhaps snapshots could be captured periodically – would this use a lot of storage space?
  • Could data be ingested from instruments automatically? I think this is an interesting one because the researcher will presumably want to access the data on active storage too (e.g. just ingesting into the vault isn’t particularly useful since you’d then need to pull it back out to actually work with the data, but you may want to have a frozen copy of the raw data too).
  • How could a Data Vault handle complex data e.g. from a database or an object store? In the simple case a user could export their data (e.g. in a backup format) and store that data (similar to how they might back up a database to a USB drive). Does it make sense for the a vault to try to understand complex data?

Here are some examples of “Active” and “Archive” systems which might be useful targets for integration:

  • Box
  • Hitachi Content Platform
  • DuraCloud
  • iRODS
  • Archivematica

