Development model

The Data Vault project is a collaboration between the University of Edinburgh and the University of Manchester. The majority of the funding from the Data Spring programme has been allocated to paying for software development effort at both partners, along with  a small proportion to pay for travel costs.

The intention of the three month project is create a proof of concept Data Vault system.  Months one and two perfoemd the scoping, use case, requirements, and design phases.  Month three will spent developing the software.

Three of us (Robin, Claire, Stuart) are due to travel to Manchester University next week to start writing the code with their project staff (Tom and Mary).  We’re taking the approach of a ‘hackathon – all of us coding collaboratively together!  This should allow rapid development as communication will be easy.

 

Month 2 project meeting

The Data Vault project is three months long, and is a collaboration between the universities of Edinburgh and Manchester.  Due to the short nature of the project, we have decided to hold monthly meetings.  The first of these was held in Manchester University Library in April.

We held the second project meeting on Tuesday 5th May in Edinburgh University Library.  One of the main focus points of the meeting was on storage and architecture.  We were therefore luck that experts in these areas attended from both universities.

The agenda for the meeting was:

  • Overview and introductions for architecture/infrastructure attendees
  • Review of the last month
  • User cases and workflows
  • Filesystem / transfer security (user credentials) – not in POC
  • Dealing with large files / large archives (split bags?) configurable per backend
  • Relationship with PURE (metadata harvesting) – not in POC
  • Prototype planning
  • Plans for the next month

Agreed actions from the meeting were:

  • Define the requirements of the data vault to a level that can be implemented:
    • Define ‘broker’, ‘storage’, and ‘archive’ APIs;
    • Define database structure for metadata / search index;
    • Define security requirements (Shibboleth / CAS / CoSign)
    • Select technologies for web user interface and broker;
    • Setup test infrastructure for month 3;
    • Architecture diagrams;
    • Test cases for the APIs / test data sets;
    • User interface wireframes (associated with use cases);
  • Consult in local institutions, and wider via project blog, to ensure the use cases are valid;

Data Vault meeting 2

Data Vault storage considerations

The Data Vault service is a system that joins up two sets of storage: fast, expensive, high quality active research storage, and slow, cheap, archival-quality long term storage.  The Data Vault service will manage the transfer of data from one to the other, and back again if the data needs to be retrieved.

Tom Higgins at Manchester University Library has been considering two aspects of the storage requirements for the Data Vault.  These have been compiled in openly editable documents.  If you have views on these, please join in the conversation by commenting on the documents!

Describing and Packaging data

The concept of the Data Vault service is to take research data that is no longer being actively used, and to archive it in long-term archival storage.  In order to facilitate this two processes need to take place as the data is prepared for storage in the vault:

Description

Metadata needs to be provided with any package that is being archived.  This means that the data can be found, understood, and any compliance issues complied with correctly (for example rights or retention).  Metadata needs to be applied at different levels, for example to the complete vault or container for a project, to deposits made into that vault, and to individual files.

Packaging

Rather than copying large structures of files into the archival storage, it has been decided to compile them into a single packages.  This means that only single files need to be stored, and the packages can have extra information included, such as checksums of the files contained in the package and a copy of the metadata.  Bagit seems to be the obvious choice for this, and there are many bagit libraries available in different programming languages.

As with the evolving project plan, two openly editable documents have been created to discuss these two issues.  Please contribute if you have thoughts about these two issues!