Presenting the Data Vault

Blog post by University of Manchester project developer Tom Higgins:

Yesterday I gave a short presentation on the Data Vault project at an event in Lancaster:

I based this on the original pitch with a few updates reflecting the work we’ve done over the last couple of months.

Here’s some of the feedback and questions from the event – I think a lot of these are more relevant for “phase 2 and beyond” than the current prototyping:

  • How does the Data Vault differ from iRODS? Perhaps the policy model from iRODS could be useful or iRODS could serve as a back-end. There was a comment that iRODS may be more useful where the researcher’s workflow is known and can be encoded into the system (e.g. it’s deeply involved in the day-to-day active data).
  • Archivematica (being explored by a project in York) can handle many preservation activities but has a specialist user interface which is not suitable for researchers to use directly. Perhaps a Data Vault could be used to ingest data and hand it over the Archivematica for preservation.
  • How would a Data Vault handle sensitive data? Would it be need to be certified? What if the “back-end” was using a certified storage system – would that ease the burden at all? I mentioned that perhaps both a “general” and a locked-down “sensitive” instance of the software could be run in parallel.
  • How could a Data Vault handle a dataset that is changing over time? Perhaps snapshots could be captured periodically – would this use a lot of storage space?
  • Could data be ingested from instruments automatically? I think this is an interesting one because the researcher will presumably want to access the data on active storage too (e.g. just ingesting into the vault isn’t particularly useful since you’d then need to pull it back out to actually work with the data, but you may want to have a frozen copy of the raw data too).
  • How could a Data Vault handle complex data e.g. from a database or an object store? In the simple case a user could export their data (e.g. in a backup format) and store that data (similar to how they might back up a database to a USB drive). Does it make sense for the a vault to try to understand complex data?

Here are some examples of “Active” and “Archive” systems which might be useful targets for integration:

  • Box
  • Hitachi Content Platform
  • DuraCloud
  • iRODS
  • Archivematica

Data Vault hackathon

The development model we chose for the Data Vault is to get us all in a room (Robin, Tom, Claire, Mary, Stuart) and to collaboratively develop the proof of concept system over a few days.  We were kindly hosted by the University of Manchester IT services in their Sackville Street building.

We started by looking at the skeleton framework that Tom and Robin had worked on, and then assigned areas of code to each person to write.  For example work was required on the user interface that the user sees, the broker in the middle that manages the system, and the backend workers that perform the archiving.

All of the code is stored openly in github, and is open source with an MIT license:

Data vault hackathon

Work is now continuing following the hackathon to complete a few areas of remaining code before the next Jisc Data Spring programme meeting where we can share the system with others.