Data Vault hackathon

The development model we chose for the Data Vault is to get us all in a room (Robin, Tom, Claire, Mary, Stuart) and to collaboratively develop the proof of concept system over a few days.  We were kindly hosted by the University of Manchester IT services in their Sackville Street building.

We started by looking at the skeleton framework that Tom and Robin had worked on, and then assigned areas of code to each person to write.  For example work was required on the user interface that the user sees, the broker in the middle that manages the system, and the backend workers that perform the archiving.

All of the code is stored openly in github, and is open source with an MIT license:

Data vault hackathon

Work is now continuing following the hackathon to complete a few areas of remaining code before the next Jisc Data Spring programme meeting where we can share the system with others.

Development model

The Data Vault project is a collaboration between the University of Edinburgh and the University of Manchester. The majority of the funding from the Data Spring programme has been allocated to paying for software development effort at both partners, along with  a small proportion to pay for travel costs.

The intention of the three month project is create a proof of concept Data Vault system.  Months one and two perfoemd the scoping, use case, requirements, and design phases.  Month three will spent developing the software.

Three of us (Robin, Claire, Stuart) are due to travel to Manchester University next week to start writing the code with their project staff (Tom and Mary).  We’re taking the approach of a ‘hackathon – all of us coding collaboratively together!  This should allow rapid development as communication will be easy.

 

Month 2 project meeting

The Data Vault project is three months long, and is a collaboration between the universities of Edinburgh and Manchester.  Due to the short nature of the project, we have decided to hold monthly meetings.  The first of these was held in Manchester University Library in April.

We held the second project meeting on Tuesday 5th May in Edinburgh University Library.  One of the main focus points of the meeting was on storage and architecture.  We were therefore luck that experts in these areas attended from both universities.

The agenda for the meeting was:

  • Overview and introductions for architecture/infrastructure attendees
  • Review of the last month
  • User cases and workflows
  • Filesystem / transfer security (user credentials) – not in POC
  • Dealing with large files / large archives (split bags?) configurable per backend
  • Relationship with PURE (metadata harvesting) – not in POC
  • Prototype planning
  • Plans for the next month

Agreed actions from the meeting were:

  • Define the requirements of the data vault to a level that can be implemented:
    • Define ‘broker’, ‘storage’, and ‘archive’ APIs;
    • Define database structure for metadata / search index;
    • Define security requirements (Shibboleth / CAS / CoSign)
    • Select technologies for web user interface and broker;
    • Setup test infrastructure for month 3;
    • Architecture diagrams;
    • Test cases for the APIs / test data sets;
    • User interface wireframes (associated with use cases);
  • Consult in local institutions, and wider via project blog, to ensure the use cases are valid;

Data Vault meeting 2

Data Vault storage considerations

The Data Vault service is a system that joins up two sets of storage: fast, expensive, high quality active research storage, and slow, cheap, archival-quality long term storage.  The Data Vault service will manage the transfer of data from one to the other, and back again if the data needs to be retrieved.

Tom Higgins at Manchester University Library has been considering two aspects of the storage requirements for the Data Vault.  These have been compiled in openly editable documents.  If you have views on these, please join in the conversation by commenting on the documents!

Describing and Packaging data

The concept of the Data Vault service is to take research data that is no longer being actively used, and to archive it in long-term archival storage.  In order to facilitate this two processes need to take place as the data is prepared for storage in the vault:

Description

Metadata needs to be provided with any package that is being archived.  This means that the data can be found, understood, and any compliance issues complied with correctly (for example rights or retention).  Metadata needs to be applied at different levels, for example to the complete vault or container for a project, to deposits made into that vault, and to individual files.

Packaging

Rather than copying large structures of files into the archival storage, it has been decided to compile them into a single packages.  This means that only single files need to be stored, and the packages can have extra information included, such as checksums of the files contained in the package and a copy of the metadata.  Bagit seems to be the obvious choice for this, and there are many bagit libraries available in different programming languages.

As with the evolving project plan, two openly editable documents have been created to discuss these two issues.  Please contribute if you have thoughts about these two issues!

Data Vault project kickoff meeting

Last week, members of the Data Vault project got together for the kickoff meeting.  Hosted at the University of Manchester Library, we were able to discuss the project plan, milestones for the three month project, agreed terminology for parts of the system, and started to assign tasks to project members for the first month.

Being only three months long, the project is being run in three one-month chunks. These are defined as follows:

  1. Month 1: Define and Investigate: This phase will allow us to agree what the Data Vault should do, and how it does it,  Specifically it will look at:
    1. What are the use cases for the Data Vault
    2. How do we describe the system (create overview diagrams)
    3. How should the data be packed (metadata + data) for long term archival storage
    4. Develop example workflows for how the Data Vault could be used in the research process
    5. Examine the capabilities of archival storage systems to ensure they can support the proposed Data Vault
  2. Month 2: Requirements and Design: This phase will create the requirements specification and initial design of the system:
    1. Define the requirements specification
    2. Use the requirement specification to design the Data Vault system
  3. Month 3: Develop a Proof of Concept: This phase will seek to develop a minimal proof of concept that demonstrates the concept of the Data Vault:
    1. Deliver a working proof of concept that can describe and archive some data, and then retrieve it

At the end of month three, we will prepare for the second Jisc Data Spring sandpit workshop where we will seek to extend the project to take the prototype and develop it into a full system.

All of this is being documented in the project plan, which is a ‘living document’ that is constantly evolving as the project progresses.  The plan is online as a Google Document:

Look out for further blog posts during the month as we undertake the definitions and investigations!

Kickoff meeting

‘Develop a Data Vault’ project funded!

The ‘Develop a Data Vault‘ proposal submitted to the Jisc #DataSpring funding call has been funded!  Jointly submitted by the Universities of Edinburgh and Manchester, the project aims to develop a Data Vault system that can be used to allow the description and long term storage of important research data.

Further details of the funding programme can be found at http://www.jisc.ac.uk/rd/projects/research-data-spring

Watch out for further blog posts as the project progresses!