The Data Vault project is three months long, and is a collaboration between the universities of Edinburgh and Manchester. Due to the short nature of the project, we have decided to hold monthly meetings. The first of these was held in Manchester University Library in April.
We held the second project meeting on Tuesday 5th May in Edinburgh University Library. One of the main focus points of the meeting was on storage and architecture. We were therefore luck that experts in these areas attended from both universities.
The agenda for the meeting was:
- Overview and introductions for architecture/infrastructure attendees
- Review of the last month
- User cases and workflows
- Filesystem / transfer security (user credentials) – not in POC
- Dealing with large files / large archives (split bags?) configurable per backend
- Relationship with PURE (metadata harvesting) – not in POC
- Prototype planning
- Plans for the next month
Agreed actions from the meeting were:
- Define the requirements of the data vault to a level that can be implemented:
- Define ‘broker’, ‘storage’, and ‘archive’ APIs;
- Define database structure for metadata / search index;
- Define security requirements (Shibboleth / CAS / CoSign)
- Select technologies for web user interface and broker;
- Setup test infrastructure for month 3;
- Architecture diagrams;
- Test cases for the APIs / test data sets;
- User interface wireframes (associated with use cases);
- Consult in local institutions, and wider via project blog, to ensure the use cases are valid;
The Data Vault service is a system that joins up two sets of storage: fast, expensive, high quality active research storage, and slow, cheap, archival-quality long term storage. The Data Vault service will manage the transfer of data from one to the other, and back again if the data needs to be retrieved.
Tom Higgins at Manchester University Library has been considering two aspects of the storage requirements for the Data Vault. These have been compiled in openly editable documents. If you have views on these, please join in the conversation by commenting on the documents!
The concept of the Data Vault service is to take research data that is no longer being actively used, and to archive it in long-term archival storage. In order to facilitate this two processes need to take place as the data is prepared for storage in the vault:
Metadata needs to be provided with any package that is being archived. This means that the data can be found, understood, and any compliance issues complied with correctly (for example rights or retention). Metadata needs to be applied at different levels, for example to the complete vault or container for a project, to deposits made into that vault, and to individual files.
Rather than copying large structures of files into the archival storage, it has been decided to compile them into a single packages. This means that only single files need to be stored, and the packages can have extra information included, such as checksums of the files contained in the package and a copy of the metadata. Bagit seems to be the obvious choice for this, and there are many bagit libraries available in different programming languages.
As with the evolving project plan, two openly editable documents have been created to discuss these two issues. Please contribute if you have thoughts about these two issues!
- Metadata investigation:
- Packaging investigation:
Last week, members of the Data Vault project got together for the kickoff meeting. Hosted at the University of Manchester Library, we were able to discuss the project plan, milestones for the three month project, agreed terminology for parts of the system, and started to assign tasks to project members for the first month.
Being only three months long, the project is being run in three one-month chunks. These are defined as follows:
- Month 1: Define and Investigate: This phase will allow us to agree what the Data Vault should do, and how it does it, Specifically it will look at:
- What are the use cases for the Data Vault
- How do we describe the system (create overview diagrams)
- How should the data be packed (metadata + data) for long term archival storage
- Develop example workflows for how the Data Vault could be used in the research process
- Examine the capabilities of archival storage systems to ensure they can support the proposed Data Vault
- Month 2: Requirements and Design: This phase will create the requirements specification and initial design of the system:
- Define the requirements specification
- Use the requirement specification to design the Data Vault system
- Month 3: Develop a Proof of Concept: This phase will seek to develop a minimal proof of concept that demonstrates the concept of the Data Vault:
- Deliver a working proof of concept that can describe and archive some data, and then retrieve it
At the end of month three, we will prepare for the second Jisc Data Spring sandpit workshop where we will seek to extend the project to take the prototype and develop it into a full system.
All of this is being documented in the project plan, which is a ‘living document’ that is constantly evolving as the project progresses. The plan is online as a Google Document:
Look out for further blog posts during the month as we undertake the definitions and investigations!