IDCC 2014 – take home thoughts

A few weeks ago I attended the 9th International Digital Curation Conference in San Francisco.  The conference was spread over four days, with two days for workshops, and two for the main conference.  The conference was jointly run by the Digital Curation Centre and the California Digital Library.  Unsurprisingly it was an excellent conference with much debate and discussion about the evolving needs for digital curation of research data.

San Francisco

The main points I took home from the conference were:

Science is changing: Atul Butte gave an inspiring keynote that contained an overview of the ways in which his own work is changing.  In particular he explained how it is now possible to ‘outsource’ parts of the scientific process. The first is the ability to visit a web site to buy tissue samplesfor specific diseases which were previously used for medical tests, but which have now been anonymised and collected rather than being discarded.  Secondly it is also now possible to order mouse trials to be undertaken, again via a web site.  These allow routine activities to be performed more quickly and cheaply.

Big Data: This phrase is often used and means different things to different people.  A nice definition given by Jane Hunter was that curation of big data is hard because of its volume, velocity, variety and veracity.  She followed this up by some good examples where data have been effectively used.

Skills need to be taught: There were several sessions about the role of Information Schools in educating a new breed of information professionals with the skills required to effectively handle the growing requirements of analysing and curating data.  This growth was demonstrated by how we are seeing many more job titles such as data engineer / analyst / steward / journalist.  It was proposed that library degrees should include more technical skills such as programming and data formats.

The Data paper: There was much discussion about the concept of a ‘Data Paper’ – a short journal paper that describes a data set.  It was seen as an important element in raising the profile of the creation of re-usable data sets.  Such papers would be citable and trackable in the same ways as journal papers, and could therefore contribute to esteem indicators.  There was a mix of traditional and new publishers with varying business models for achieving this.  One point that stood out for me was that publishers were not proposing to archive the data, only the associated data paper.  The archiving would need to take place elsewhere.

Tools are improving: I attended a workshop about Data Management in the Cloud, facilitated by Microsoft Research.  They gave a demo of some of the latest features of Excel.  Many of the new features seem to nicely fit into two camps, but equally useful and very powerful to both.  Whether you are looking at data from the perspective of business intelligence or research data analysis, tools such as Excel are now much more than a spreadsheet for adding up numbers.  They can import, manipulate, and display data in many new and powerful ways.

I was also able to present a poster that contains some of the evolving thoughts about data curation systems at the University of Edinburgh: http://dx.doi.org/10.6084/m9.figshare.902835

In his closing reflection of the conference, Clifford Lynch said that we need to understand how much progress we are making with data curation.  It will be interesting to see the progress made and what new issues are being discussed at the conference next year which will be held much closer to home in London.

Stuart Lewis
Head of Research and Learning Services
Library & University Collections, Information Services

SCAPE workshop (notes, part 2)

Drawing on my notes from the SCAPE (Scalable Preservation Environments) workshop I attended last year, here is a summary of the presentation delivered by Peter May (BL) and the activities that were introduced during the hands-on training session.

Peter May (British Library)

To contextualise the activities and the tools we used during the workshop, Peter May presented a case study from the British Library (BL). The BL is a legal deposit archive that among many other resources archives newspapers. This type of item (newspaper) is one of the most brittle in their archive, because it is very sensitive and prone to disintegrate even in optimal preservation conditions (humidity and light controlled environment). With support from Jisc (2007, 2009) and through their current partnership with brightsolid, the BL has been digitising this part of the collection, at a current digitisation rate of about 8000 scans per day.

BL’s main concern is how to ensure long term preservation and access to the newspaper collection, and how to make digitisation processes cost effective (larger files require more storage space, so less storage needed per file means more digitised objects). As part of the digitisation projects, BL had to reflect on:

  • How would the end-user want to engage with the digitised objects?
  • What file format would suit all those potential uses?
  • How will the collection be displayed online?
  • How to ensure smooth network access to the collection?

As an end-user, you might want to browse thumbnails in the newspaper collection, or you might want to zoom in and read through the text. In order to have the flexibility to display images at different resolutions when required, the BL has to scan the newspapers at high resolution. JPEG2000 has proved to be the most flexible format for displaying images at different resolutions (thumbnails, whole images, image tiles). The BL investigated how to migrate from TIFF to JPEG2000 format to enable this flexibility in access, as well as to reduce the size of files, and thereby the cost of storage and preservation management. A JPEG2000 file is normally half the size of a TIFF file.

At this stage, the SCAPE workflow comes into play. In order to ensure that it was safe to delete the original TIFF files after the migration into JPEG2000, the BL team needed to make quality checks across all the millions of files they were migrating.

For the SCAPE work at the BL, Peter May and the team tested a number of tools and created a workflow to migrate files and perform quality checks. For the migration process they tested codecs such as kakadu and openJPEG, and for checking the integrity of the JPEG2000 and how the format complied with institutional policies and preservation needs, they used Jpylyzer. Other tools such as Matchbox (for image feature analysis and duplication identification) or exifTool (an image metadata extractor, that can be used to find out details about the provenance of the file and later on to compare metadata after migration) were tested within the SCAPE project at the BL. To ensure the success of the migration process, the BL developed their in-house code to compare, at scale, the different outputs of the above mentioned tools.

Peter May’s presentation slides can be found on the Open Planets Foundation wiki.

Hands-on training session

After Peter May’s presentation, the SCAPE workshop team guided us through an activity in which we checked if original TIFFs had migrated to JPEG2000s successfully. For this we used the TIFF compare command (tiffcmp). We first migrated from TIFF to JPEG2000 and then converted JPEG2000 back into TIFF. In both migrations we used tiffcmp to check (bit by bit) if the file had been corrupted (bitstream comparison to check fixity), and if the compression and decompression processes were reliable.

The intention of the exercise was to show the process of migration at small scale. However, when digital preservation tasks (migration, compression, validation, metadata extraction, comparison) have to be applied to thousands of files, a single processor would take a lot of time to run the tasks, and for that reason parallelisation is a good idea. SCAPE has been working on parallelisation and how to divide tasks using computational nodes to deal with big loads of data all at once.

SCAPE uses Taverna workbench to create tailored workflows. To run a preservation workflow you do not need Taverna, because many of the tools that can be incorporated into Taverna can be used as standalone tools such as FITS, a File Information Tool Set that “identifies, validates, and extracts technical metadata for various file formats”. However, Taverna offers a good solution for digital preservation workflows, since you can create a workflow that includes all the tools that you need. The ideal use of Taverna in digital preservation is to choose different tools at different stages of the workflow, depending on the digital preservation requirements of your data.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/