data curation | Edinburgh Research Data Blog

Drawing on my notes from the SCAPE (Scalable Preservation Environments) workshop I attended last year, here is a summary of the presentation delivered by Peter May (BL) and the activities that were introduced during the hands-on training session.

Peter May (British Library)

To contextualise the activities and the tools we used during the workshop, Peter May presented a case study from the British Library (BL). The BL is a legal deposit archive that among many other resources archives newspapers. This type of item (newspaper) is one of the most brittle in their archive, because it is very sensitive and prone to disintegrate even in optimal preservation conditions (humidity and light controlled environment). With support from Jisc (2007, 2009) and through their current partnership with brightsolid, the BL has been digitising this part of the collection, at a current digitisation rate of about 8000 scans per day.

BL’s main concern is how to ensure long term preservation and access to the newspaper collection, and how to make digitisation processes cost effective (larger files require more storage space, so less storage needed per file means more digitised objects). As part of the digitisation projects, BL had to reflect on:

How would the end-user want to engage with the digitised objects?
What file format would suit all those potential uses?
How will the collection be displayed online?
How to ensure smooth network access to the collection?

As an end-user, you might want to browse thumbnails in the newspaper collection, or you might want to zoom in and read through the text. In order to have the flexibility to display images at different resolutions when required, the BL has to scan the newspapers at high resolution. JPEG2000 has proved to be the most flexible format for displaying images at different resolutions (thumbnails, whole images, image tiles). The BL investigated how to migrate from TIFF to JPEG2000 format to enable this flexibility in access, as well as to reduce the size of files, and thereby the cost of storage and preservation management. A JPEG2000 file is normally half the size of a TIFF file.

At this stage, the SCAPE workflow comes into play. In order to ensure that it was safe to delete the original TIFF files after the migration into JPEG2000, the BL team needed to make quality checks across all the millions of files they were migrating.

For the SCAPE work at the BL, Peter May and the team tested a number of tools and created a workflow to migrate files and perform quality checks. For the migration process they tested codecs such as kakadu and openJPEG, and for checking the integrity of the JPEG2000 and how the format complied with institutional policies and preservation needs, they used Jpylyzer. Other tools such as Matchbox (for image feature analysis and duplication identification) or exifTool (an image metadata extractor, that can be used to find out details about the provenance of the file and later on to compare metadata after migration) were tested within the SCAPE project at the BL. To ensure the success of the migration process, the BL developed their in-house code to compare, at scale, the different outputs of the above mentioned tools.

Peter May’s presentation slides can be found on the Open Planets Foundation wiki.

Hands-on training session

After Peter May’s presentation, the SCAPE workshop team guided us through an activity in which we checked if original TIFFs had migrated to JPEG2000s successfully. For this we used the TIFF compare command (tiffcmp). We first migrated from TIFF to JPEG2000 and then converted JPEG2000 back into TIFF. In both migrations we used tiffcmp to check (bit by bit) if the file had been corrupted (bitstream comparison to check fixity), and if the compression and decompression processes were reliable.

The intention of the exercise was to show the process of migration at small scale. However, when digital preservation tasks (migration, compression, validation, metadata extraction, comparison) have to be applied to thousands of files, a single processor would take a lot of time to run the tasks, and for that reason parallelisation is a good idea. SCAPE has been working on parallelisation and how to divide tasks using computational nodes to deal with big loads of data all at once.

SCAPE uses Taverna workbench to create tailored workflows. To run a preservation workflow you do not need Taverna, because many of the tools that can be incorporated into Taverna can be used as standalone tools such as FITS, a File Information Tool Set that “identifies, validates, and extracts technical metadata for various file formats”. However, Taverna offers a good solution for digital preservation workflows, since you can create a workflow that includes all the tools that you need. The ideal use of Taverna in digital preservation is to choose different tools at different stages of the workflow, depending on the digital preservation requirements of your data.

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

[Reposted from https://libraryblogs.is.ed.ac.uk/blog/2013/12/06/the-four-quadrants-of-research-data-curation-systems/]

The University of Edinburgh, like many other universities, is currently undertaking extensive work to build infrastructure that supports and enables good practice in the area of Research Data Management. This infrastructure ranges from large-scale research storage facilities to data management planning tools.

One aspect of Research Data Management highlighted in the University’s RDM Roadmap is ‘Data stewardship: tools and services to aid in the description, deposit, and continuity of access to completed research data outputs.’

To help describe how these systems fit together yet how they differ from each other, I use a model with two axes to differentiate what they hold, and who can access them. The first axis is used to differentiate between systems that hold only metadata from those that hold files (typically with some level of metadata), while the second differentiates between private systems and public systems.

quadrants1

Research information and data management and associated systems aren’t a new phenomenon. We have been offering services in these areas for some time. To demonstrate this, we have two existing systems that provide services in two of the areas:

PURE is our Current Research Information System (CRIS). It is a private system for the University to record the research outputs it generates. It therefore falls into the metadata / private quadrant. (It can hold files, and has a public interface, but this is primarily for Open Access publications rather than research data).
DataShare is our open research data repository. It holds and curates data (and associated metadata) for public consumption on behalf of the data creators. It therefore falls into the data / public quadrant.

quadrants2

What about the other two quadrants? Are there systems or infrastructure needed to fill these? Is there a case where we need a public store of metadata about research data, or a private store of finished data sets?

The rest of this blog post will argue that there is a need for these, and will describe two pieces of infrastructure that could fill them. Further blog posts will be written that start to unpick the requirements of these systems in more depth.

Public Metadata: Not only is it good practice for a research institution to know what research data it is creating, some research funders require us to do so. In addition the University’s RDM policy requires

“Any data which is retained elsewhere, for example in an international data service or domain repository should be registered with the University.”

The following is an extract from the EPSRC’s expectations for research data management:

“Research organisations will ensure that appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet; in each case the metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier (For example as available through the DataCite organisation – http://datacite.org/).”

This need can be fulfilled by the creation of a Data Asset Register.

Private Data: Whilst some data will be suitable for public sharing, for various reasons some will not, or will need to have access controlled by the data creator. Therefore there is a need for a safe place for keeping data that will be kept secure, both in terms of access and change. Once lodged/archived there, files should only be accessible by the data creator or data manager, and it should not be possible to change files, but only to create newer versions or to remove/delete them.

This need can be fulfilled by the creation of a Data Vault.

quadrants3

Systems however do not live in isolation, and become more powerful, more useful, and more likely to be used if they are able to integrate with each other. With the ever-growing number of ‘systems’ provided by a large research-intensive university, the last thing that a research data management programme wants to do is to introduce further systems that need to be fed with duplicate information. This means that some or all of the components will need to be integrated together.

There are three obvious integrations between these systems, as shown below:

quadrants4

First, because PURE is the master system for holding data and relationships about research outputs (THIS grant, funded THAT piece of equipment, which was used to create THIS data set, that was described in THESE journal articles), records of data sets need to exist within it. However if some or all of these are being created in the Data Asset Register, then they will need to be pushed into PURE. Equally if some data are being registered directly in PURE, it will be useful to pull this out of PURE and into the Data Asset Register.

Secondly, because the Data Asset Register may become the main user interface for entering details of data sets, it could also be the main administrative user interface for uploading files into the Data Vault. If that is the case, then the Data Asset Register and the Vault will need to be integrated.

Finally, for instances where metadata is held in the Data Asset Register, corresponding files are held in the Data Vault, and the data owner decides to make the data openly available, then the Data Asset Register should be able to deposit these as a new item in the Data Repository.

The next challenge will be to describe the requirements for the Data Vault and Data Asset Register. We have some early thoughts about this, and will share these in future blog.

Images available from http://dx.doi.org/10.6084/m9.figshare.873617

Related blog posts:

Stuart Lewis, Head of Research and Learning Services, Library & University Collections.

Edinburgh Research Data Blog

Edinburgh Research Data Blog

Tag Archives: data curation

SCAPE workshop (notes, part 2)

The four quadrants of research data curation systems