Thinking about a Data Vault


Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

[Reposted from https://libraryblogs.is.ed.ac.uk/blog/2013/12/20/thinking-about-a-data-vault/]

In a recent blog post, we looked at the four quadrants of research data curation systems.  This categorised systems that manage or describe research data assets by whether their primary role is to store metadata or data, and whether the information is for private or public use.  Four systems were then put into these quadrants.  We then started to investigate further the requirements of a Data Asset Register in another blog post.

quadrants3

This blog post will look at the requirements and characteristics of a Data Vault, and how this component fits into the data curation system landscape.

What?

The first aspect to consider is what exactly is a Data Vault?  For the purposes of this blog post, we’ll simply consider it is a safe, private, store of data that is only accessible by the data creator or their representative.  For simplicity, it could be considered very similar to a safety deposit box within a bank vault.  However other than the concept, this analogy starts to break down quite quickly, as we’ll discuss later.

Why?

There are different use cases where a Data Vault would be useful.  A few are described here:

  • A paper has been published, and according to the research funder’s rules, the data underlying the paper must be made available upon request.  It is therefore important to store a date-stamped golden-copy of the data associated with the paper.  Even if the author’s own copy of the data is subsequently modified, the data at the point of publication is still available.
  • Data containing personal information, perhaps medical records, needs to be stored securely, however the data is ‘complete’ and unlikely to change, yet hasn’t reached the point where it should be deleted.
  • Data analysis of a data set has been completed, and the research finished.  The data may need to be accessed again, but is unlikely to change, so needn’t be stored in the researcher’s active data store.  An example might be a set of completed crystallography analyses, which whilst still useful, will not need to be re-analysed.
  • Data is subject to retention rules and must be kept securely for a given period of time.

How?

Clearly the storage characteristics for a Data Vault are different to an open data repository or active data filestore for working data.  The following is a list of some of the characteristics a Data Vault will need, or could use:

  • Write-once file system: The file system should only allow each file to be written once.  Deleting a file should require extra effort so that it is hard to remove files.
  • Versioning:  If the same file needs to be stored again, then rather than overwriting the existing file, it should be stored alongside the file as a new version.  This should be an automatic function.
  • File security: Only the data owner or their delegate can access the data.
  • Storage security: The Data Vault should only be accessible through the local university network, not the wider Internet.  This reduces the vectors of attack, which is important given the potential sensitivity of the data contained within the Data Vault.
  • Additional security: Encrypt the data, either via key management by the depositors, or within the storage system itself?
  • Upload and access:  Options include via a web interface (issues with very large files), special shared folders, dedicated upload facilities (e.g. GridFTP), or an API for integration with automated workflows.
  • Integration: How would the Data Vault integrate with the Data Asset Register?  Could the register be the main user interface for accessing the Data Vault?
  • Description:  What level of description, or metadata, is required for data sets stored in the Data Vault, to ensure that they can be found and understood in the future?
  • Assurance:  Facilities to ensure that the file uploaded by the researcher is intact and correct when it reaches the vault, and periodic checks to ensure that the file has not become corrupted.  What about more active preservation functions, including file format migration to keep files up to date (e.g. convert Word 95 documents to Word 2013 format)?
  • Speed:  Can the file system be much slower, perhaps a Hierarchical Storage Management (HSM) system that stores frequently accessed data on disk, but relegates older or less frequently accessed data to slower storage mechanisms such as tape?  Access might then be slow (it takes a few minutes for the data to be automatically retrieved from the tape) but the cost of the service is much lower.
  • Allocation: How much allocation should each person be given, or should it be unrestricted so as to encourage use?  What about costing for additional space?  Costings may be hard, because if the data is to be kept for perpetuity, then whole-life costing will be needed.  If allocation is free, how to stop it being used for routine backups of data rather than golden-copy data?
  • Who:  Who is allowed access to the Data Vault to store data?
  • Review periods:  How to remind data owners what data they have in the Data Vault so that they can review their holdings, and remove unneeded data?

Data Vault

Feedback on these issues and discussion points are very welcome!  We will keep this blog updated with further updates as these services develop.

Image available from http://dx.doi.org/10.6084/m9.figshare.873617

Tony Weir, Head of Unix Section, IT Infrastructure
Stuart Lewis, Head of Research and Learning Services, Library & University Collections.

Welcome to the new Research Data Management Service Coordinator: Kerry Miller

We have great pleasure in welcoming a new member of staff to the research data management programme.  Kerry Miller has joined us in the role of Research Data Management Service Coordinator.

Kerry Miller

Kerry is featured in the latest BITS magazine, sharing details of her new role:

What’s your background?
I’ve undertaken research for various organisations, in industry and charity sectors –
including what is now GlaxoSmithKline and Cancer Research UK as well as the
Ministry of Defence and the British Council. I then joined the Digital Curation
Centre (DCC) in 2011 as an Institutional Support Officer. This involved working
with Higher Education institutions across the UK to help them improve their
Research Data Management policy and practice, in response to Research
Councils UK and other similar requirements.

Tell us about the new position.
My new post, RDM Service Co-ordinator, is a newly-created post, aiming to
bring together and co-ordinate all the different aspects of the research data
management work that’s currently being undertaken throughout the University: lots of
infrastructure improvements, and new tools and support for researchers. There
are things like DataShare, which has been active for a while now, but which we’re
promoting, so more researchers are aware of it and know when to use it. There
are also a few more services that are still in the design phases. You can read all
about the RDM work that is going on via the RDM Blog: datablog.is.ed.ac.uk

What particularly excites you about the new role?
The work we do at the DCC is in many ways quite theoretical; we go out and talk
to institutions about what they ought to be doing, what they need to do to meet
requirements, and that sort of thing, but this new role will be going from talking
the talk to walking the walk; I’ve got to actually do what I’ve been telling people at
other institutions to do! It’s quite scary but also quite exciting; just to see whether
or not I can actually turn that into a real, successful service.

Where exactly will you be based?
I’m within the Research and Learning Section of Library & University Collections, on the lower ground floor of the Main Library. There is a huge number of people involved in the area, but the RDM team itself is small and there aren’t that many people full-time at the moment. RDM is part of a lot of people’s jobs – people like Stuart Lewis and John Scally from the library side, Tony Weir in IT Infrastructure and Robin Rice in the Data Library, but I’ll be one of the few people for whom it’s a full-time, dedicated role.

What do you enjoy doing outside work?
I watch a lot of films, and do a lot of cooking and baking. I’ve been doing a recipe
a week from The Great British Bake Off, with greater or lesser success. I often use
my office colleagues as a waste disposal system!

New research data storage

The latest BITS magazine for University of Edinburgh staff (Issue 8, Autumn/ Winter 2013) contains a lead article on new data storage facilities that Information Services have recently procured and will be making available to researchers for their research data management.

“The arrival of the RDM storage and its imminent roll out is an exciting step in the development of our new set of services under the Research Data Management banner. Ensuring that the service we deploy is fair, useful and t transparent are key principles for the IS team.” John Scally

 

Information Services is very pleased to announce that our new Research Data Storage hardware has been safely delivered.

Following a competitive procurement process, a range of suppliers were selected to provide the various parts of the infrastructure, incl. Dell, NetApp, Brocade and Cisco. The bulk of the order was assembled over the summer in China and shipped to the King’s Buildings campus at the end of August. Since then IT Infrastructure staff have been installing, testing and preparing the storage for roll-out.

How good is the storage?
Information Services recognises the importance of the University’s research data and has procured enterprise-class storage infrastructure to underpin the programme of Research
Data services. The infrastructure ranges from the highest class of flash-storage (delivering 375,000 IO operations per second) to 1.6PB (1 Petabyte = 1,024 Terabytes) of bulk storage arrays. The data in the Research Data Management (RDM) file-store is automatically replicated to an off-site disaster facility and also backed up with a 60-day retention period, with 10 days of file history visible online.

Who qualifies for an allocation?
Every active researcher in the University! This is an agreement between the University and the researcher to provide quality active data storage, service support and long term curation for researchers. This is for all researchers, not just Principal Investigators or those in receipt of external grants to fund research.

When do I get my allocation?
We are planning to roll out to early adopter Schools and institutes late November this year. This is dependent on all of the quality checks and performance testing on the system being completed successfully, however, confidence is high that the deadline will be met.
The early adopters for the initial service roll-out are: School of GeoSciences, School of Philosophy, Psychology and Language Sciences, and the Centre for Population Health
Sciences. Phased roll-out to all areas of the University will follow.

How much free allocation will I receive?
The University has committed 0.5TB (500GB) of high quality storage with guaranteed backup and resilience to every active researcher. The important principle at work is that the 0.5TB is for the individual researcher to use primarily to store their active research data. This ensures that they can work in a high quality and resilient environment and, hopefully, move valuable data from potentially unstable local drives. Research groups
and Schools will be encouraged to pool their allocations in order to facilitate shared data management and collaboration.

This formula was developed in close consultation with College and School representatives; however, there will be discipline differences in how much storage is required and individual need will not be uniform. A degree of flexibility will be built into the
allocation model and roll-out, though if researchers go over their 0.5TB free allocation they will have to pay.

Why is the University doing this?
The storage roll-out is one component of a suite of existing and planned services known as our Research Data Management Initiative. An awareness raising campaign accompanies the storage allocation to Schools, units and individuals to
encourage best practice in research data management planning and sharing.

Research Data Management support services:
www.ed.ac.uk/is/data-management

University’s Research Data Management Policy:
www.ed.ac.uk/is/research-data-policy

BITS magazine (Issue 8, Autumn/ Winter 2013)
http://www.ed.ac.uk/schools-departments/information-services/about/news/edinburgh-bits