It’s been a busy summer for the DataVault, with further presentations taking place.
First up was a short trip to Dublin to the Open Repositories 2016 conference in June. The presentation was scheduled to be part of the 24×7 session: 24 slides to be presented in 7 minutes – that’s a mere 17.5 seconds per slide (rather than the usual rule-of-thumb of a minute or two!) and with the slides auto-advancing for a bit of added fun!. Thomas Higgins and Stuart Lewis presented, and gave an overview of the project, the platform, and gave a demonstration.
Then in August, Mary McDerby and Stuart Lewis attended the Repository Fringe in Edinburgh, where Stuart presented the DataVault in a session led by the DCC’s Angus Whyte, alongside Rory Macneil from ResearchSpace, looking at the subject of research data workflows, and what this means for systems such as the DataVault which sit within those workflows.
At both events we presented an overview of the DataVault project, followed by a live demonstration. We then used the ‘Poll Anywhere‘ audience participation system to facilitate feedback. In particular we asked the following questions to the audience:
Did we present valid use cases for a Data Vault system?
Do you see the need for a Data Vault iatyour institution?
Do you think the Data Vault does enough ‘preservation’?
Do you think the Data Vault is missing any obvious features?
Perhaps the most interesting piece of feedback was that at IDCC, which is an international conference, the feedback to question 2 (“Do you see the need for a Data Vault at your institution”), the response was 50% yes, and 50% no. However at the Jisc event which only contained a UK audience, the response was 100% yes.
We’re now in the third phase of the DataVault project, and as previously discussed, we are using fortnightly development sprints to undertake the remaining development tasks. Following our monthly project meeting yesterday, we now have draft sprints to take us up until the end of June, and the first full release of the software!
Keep an eye out for a future blog post: we’re scheduled to hold an event for potential early adopters of the DataVault system in their own institutions. 29th June, central London!
Between now and then we have planned a further four sprints (sprints 3, 4, 5, and 6). We plan these in details at the start of each sprint, but right now we have indicative backlogs for the next three. Not only will these involve further developments to the software, but also test installations at our institutions to allow more thorough testing of the software in-situ, especially once fully configured into local systems such as Shibboleth and PURE.
All of the Jisc #DataSpring projects have also been reviewed by both the SSI and the DPC in terms of sustainability and good practice from an open source perspective. We’re glad to report a relatively good result, but there are a few areas where we can improve – so we will also be addressing those in the coming weeks. These include better documentation, links about how to contribute to the project, and clearer contact details.
Phase three of the DataVault project is now earnestly underway. Two weeks ago we held the first of our monthly DataVault project meetings, with Mary McDerby and Thomas Higgins (University of Manchester) visiting us in Edinburgh. One of the changes we are making to this final six month phase is to move to fortnightly development sprints.
We’re now almost at the end of the first sprint, so reviewed the outputs during a Skype call today. We use Trello to manage a backlog of tasks, which then get selected for the next sprint. Most of the development tasks are now completed or almost finished. We also started planning the stories to be tackled in the next sprint, however due to Easter, this will be a little shorter than normal:
In the first sprint (sprint 1) we undertook some rationalisation of the API (as it has grown arms and legs over time), added an enhanced auditing feature (using the PREMIS ontology), added an improved runtime configuration option, and provided better error reporting for a problem that occurs when the CSRF tokens timeout. As always, all of this can be seen in the project’s github account.
In the next sprint we’ll be fixing a few bugs and some of the terminology used (replacing ‘restore’ with ‘retrieve’ when getting some data from the DataVault), and adding better application logging.
A demo system is now available where you can see how the system works, and deposit + retrieve some example data sets.
Jisc have generously provided phase three funding to the Data Vault project. This funding will last for six months, and will provide resources to complete the development of the first complete version of the Data Vault software.
We’re pleased to announce that following the second round of project pitching in London on the 13th and 14th of July, the Data Vault project has been awarded further funding.
The first round of funding provided three months of effort, during which time the project developed a working proof of concept system that could archive and restore data. At the pitching session, team members demonstrated the prototype, and explained what would be achieved in the second round:
Critical success factor: Deliver a first version of a complete data vault platform
Implement the remaining requirements:
Authentication and authorisation
Integration with more storage options
Management / monitoring interface
Example interface to CRIS (PURE)
Development of retention and review policy
Two community engagement events (Manchester: 7th October, Edinburgh: 5th November)
We are grateful to Jisc for providing this second round of funding which will provide effort from August to November.
I based this on the original pitch with a few updates reflecting the work we’ve done over the last couple of months.
Here’s some of the feedback and questions from the event – I think a lot of these are more relevant for “phase 2 and beyond” than the current prototyping:
How does the Data Vault differ from iRODS? Perhaps the policy model from iRODS could be useful or iRODS could serve as a back-end. There was a comment that iRODS may be more useful where the researcher’s workflow is known and can be encoded into the system (e.g. it’s deeply involved in the day-to-day active data).
Archivematica (being explored by a project in York) can handle many preservation activities but has a specialist user interface which is not suitable for researchers to use directly. Perhaps a Data Vault could be used to ingest data and hand it over the Archivematica for preservation.
How would a Data Vault handle sensitive data? Would it be need to be certified? What if the “back-end” was using a certified storage system – would that ease the burden at all? I mentioned that perhaps both a “general” and a locked-down “sensitive” instance of the software could be run in parallel.
How could a Data Vault handle a dataset that is changing over time? Perhaps snapshots could be captured periodically – would this use a lot of storage space?
Could data be ingested from instruments automatically? I think this is an interesting one because the researcher will presumably want to access the data on active storage too (e.g. just ingesting into the vault isn’t particularly useful since you’d then need to pull it back out to actually work with the data, but you may want to have a frozen copy of the raw data too).
How could a Data Vault handle complex data e.g. from a database or an object store? In the simple case a user could export their data (e.g. in a backup format) and store that data (similar to how they might back up a database to a USB drive). Does it make sense for the a vault to try to understand complex data?
Here are some examples of “Active” and “Archive” systems which might be useful targets for integration:
The development model we chose for the Data Vault is to get us all in a room (Robin, Tom, Claire, Mary, Stuart) and to collaboratively develop the proof of concept system over a few days. We were kindly hosted by the University of Manchester IT services in their Sackville Street building.
We started by looking at the skeleton framework that Tom and Robin had worked on, and then assigned areas of code to each person to write. For example work was required on the user interface that the user sees, the broker in the middle that manages the system, and the backend workers that perform the archiving.
All of the code is stored openly in github, and is open source with an MIT license: