IDCC 2014 – take home thoughts

A few weeks ago I attended the 9th International Digital Curation Conference in San Francisco.  The conference was spread over four days, with two days for workshops, and two for the main conference.  The conference was jointly run by the Digital Curation Centre and the California Digital Library.  Unsurprisingly it was an excellent conference with much debate and discussion about the evolving needs for digital curation of research data.

San Francisco

The main points I took home from the conference were:

Science is changing: Atul Butte gave an inspiring keynote that contained an overview of the ways in which his own work is changing.  In particular he explained how it is now possible to ‘outsource’ parts of the scientific process. The first is the ability to visit a web site to buy tissue samplesfor specific diseases which were previously used for medical tests, but which have now been anonymised and collected rather than being discarded.  Secondly it is also now possible to order mouse trials to be undertaken, again via a web site.  These allow routine activities to be performed more quickly and cheaply.

Big Data: This phrase is often used and means different things to different people.  A nice definition given by Jane Hunter was that curation of big data is hard because of its volume, velocity, variety and veracity.  She followed this up by some good examples where data have been effectively used.

Skills need to be taught: There were several sessions about the role of Information Schools in educating a new breed of information professionals with the skills required to effectively handle the growing requirements of analysing and curating data.  This growth was demonstrated by how we are seeing many more job titles such as data engineer / analyst / steward / journalist.  It was proposed that library degrees should include more technical skills such as programming and data formats.

The Data paper: There was much discussion about the concept of a ‘Data Paper’ – a short journal paper that describes a data set.  It was seen as an important element in raising the profile of the creation of re-usable data sets.  Such papers would be citable and trackable in the same ways as journal papers, and could therefore contribute to esteem indicators.  There was a mix of traditional and new publishers with varying business models for achieving this.  One point that stood out for me was that publishers were not proposing to archive the data, only the associated data paper.  The archiving would need to take place elsewhere.

Tools are improving: I attended a workshop about Data Management in the Cloud, facilitated by Microsoft Research.  They gave a demo of some of the latest features of Excel.  Many of the new features seem to nicely fit into two camps, but equally useful and very powerful to both.  Whether you are looking at data from the perspective of business intelligence or research data analysis, tools such as Excel are now much more than a spreadsheet for adding up numbers.  They can import, manipulate, and display data in many new and powerful ways.

I was also able to present a poster that contains some of the evolving thoughts about data curation systems at the University of Edinburgh: http://dx.doi.org/10.6084/m9.figshare.902835

In his closing reflection of the conference, Clifford Lynch said that we need to understand how much progress we are making with data curation.  It will be interesting to see the progress made and what new issues are being discussed at the conference next year which will be held much closer to home in London.

Stuart Lewis
Head of Research and Learning Services
Library & University Collections, Information Services

How can you improve your data management skills?

A range of training courses on research data management (RDM) in the form of half-day courses and seminars have been created to help you with research data management issues, and are now available for booking on the MyEd booking system:

  • Research Data Management Programme at the University of Edinburgh
  • Good practice in research data management
  • Creating a data management plan for your grant application
  • Handling data using SPSS (based on the MANTRA module)
  • Handling data with ArcGIS (based on the MANTRA module)

RDM trainingThese courses and seminars aim to equip researchers, postgraduate research students and research support staff with a grounded understanding in data management issues and data handling.

If you manage research data, provide support for research, or are interested in finding out more about efficient and effective ways of managing your research data these course will be for you.

For detailed information about these courses please go to: http://www.ed.ac.uk/schools-departments/information-services/research-support/data-management/rdm-training

We are also happy to arrange tailored sessions for researchers and research support staff in aspects of research data management from planning through to depositing.  Please contact us at IS.Helpline@ed.ac.uk if you would like to arrange a training session.

Cuna Ekmekcioglu
Senior Research Data Officer
Library & University Collections, IS

SCAPE workshop (notes, part 2)

Drawing on my notes from the SCAPE (Scalable Preservation Environments) workshop I attended last year, here is a summary of the presentation delivered by Peter May (BL) and the activities that were introduced during the hands-on training session.

Peter May (British Library)

To contextualise the activities and the tools we used during the workshop, Peter May presented a case study from the British Library (BL). The BL is a legal deposit archive that among many other resources archives newspapers. This type of item (newspaper) is one of the most brittle in their archive, because it is very sensitive and prone to disintegrate even in optimal preservation conditions (humidity and light controlled environment). With support from Jisc (2007, 2009) and through their current partnership with brightsolid, the BL has been digitising this part of the collection, at a current digitisation rate of about 8000 scans per day.

BL’s main concern is how to ensure long term preservation and access to the newspaper collection, and how to make digitisation processes cost effective (larger files require more storage space, so less storage needed per file means more digitised objects). As part of the digitisation projects, BL had to reflect on:

  • How would the end-user want to engage with the digitised objects?
  • What file format would suit all those potential uses?
  • How will the collection be displayed online?
  • How to ensure smooth network access to the collection?

As an end-user, you might want to browse thumbnails in the newspaper collection, or you might want to zoom in and read through the text. In order to have the flexibility to display images at different resolutions when required, the BL has to scan the newspapers at high resolution. JPEG2000 has proved to be the most flexible format for displaying images at different resolutions (thumbnails, whole images, image tiles). The BL investigated how to migrate from TIFF to JPEG2000 format to enable this flexibility in access, as well as to reduce the size of files, and thereby the cost of storage and preservation management. A JPEG2000 file is normally half the size of a TIFF file.

At this stage, the SCAPE workflow comes into play. In order to ensure that it was safe to delete the original TIFF files after the migration into JPEG2000, the BL team needed to make quality checks across all the millions of files they were migrating.

For the SCAPE work at the BL, Peter May and the team tested a number of tools and created a workflow to migrate files and perform quality checks. For the migration process they tested codecs such as kakadu and openJPEG, and for checking the integrity of the JPEG2000 and how the format complied with institutional policies and preservation needs, they used Jpylyzer. Other tools such as Matchbox (for image feature analysis and duplication identification) or exifTool (an image metadata extractor, that can be used to find out details about the provenance of the file and later on to compare metadata after migration) were tested within the SCAPE project at the BL. To ensure the success of the migration process, the BL developed their in-house code to compare, at scale, the different outputs of the above mentioned tools.

Peter May’s presentation slides can be found on the Open Planets Foundation wiki.

Hands-on training session

After Peter May’s presentation, the SCAPE workshop team guided us through an activity in which we checked if original TIFFs had migrated to JPEG2000s successfully. For this we used the TIFF compare command (tiffcmp). We first migrated from TIFF to JPEG2000 and then converted JPEG2000 back into TIFF. In both migrations we used tiffcmp to check (bit by bit) if the file had been corrupted (bitstream comparison to check fixity), and if the compression and decompression processes were reliable.

The intention of the exercise was to show the process of migration at small scale. However, when digital preservation tasks (migration, compression, validation, metadata extraction, comparison) have to be applied to thousands of files, a single processor would take a lot of time to run the tasks, and for that reason parallelisation is a good idea. SCAPE has been working on parallelisation and how to divide tasks using computational nodes to deal with big loads of data all at once.

SCAPE uses Taverna workbench to create tailored workflows. To run a preservation workflow you do not need Taverna, because many of the tools that can be incorporated into Taverna can be used as standalone tools such as FITS, a File Information Tool Set that “identifies, validates, and extracts technical metadata for various file formats”. However, Taverna offers a good solution for digital preservation workflows, since you can create a workflow that includes all the tools that you need. The ideal use of Taverna in digital preservation is to choose different tools at different stages of the workflow, depending on the digital preservation requirements of your data.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/

SCAPE workshop (notes, part 1)

The aim of the SCAPE workshop was to show participants how to cope with large data volumes by using automation and preservation tools developed and combined as part of the SCAPE project. During the workshop, we were introduced to Taverna workbench, a workflow engine we we installed with a virtual machine (Linux) in our laptops.

It has taken me a while to sort out my notes from the workshop, facilitated by Rainer Schmidt (Austrian Institute of Technology, AIT), Dave Tarrant (Open Planets Foundation, OPF), Roman Graf (AIT), Matthias Rella (AIT), Sven Schlarb (Austrian National Library, ONB), Carl Wilson (OPF), Peter May (British Library, BL), and Donal Fellows (University of Manchester, UNIMAN), but here it is. The workshop (September 2013) started with a demonstration of scalable data migration processes, for which they used a number of Raspberry Pis as a computer cluster (only as proof of concept).

Rainer Schimdt (AIT)

Here is a summary of the presentation delivered by Rainer Schmidt (AIT) who explained the SCAPE project framework (FP7 16 organizations from 8 countries). The SCAPE project focuses on planning, managing and preserving digital resources using the concept of scalability. Computer clusters can manage data loads and distribute preservation tasks that cannot be managed in desktop environments. Some automated distributed tasks they have been investigating are extraction of metadata, file format migration, bit checking, quality assurance, etc.

During the workshop, facilitators showed scenarios created and developed as part of the SCAPE project, which had served as test bed to identify best use of different technologies in the preservation workflow. The hands-on activities started with a quick demonstration of the SCAPE preservation platform and how to execute a SCAPE workflow when running it in the virtual machine.

SCAPE uses clusters of commodity hardware to generate bigger environments to make preservation tasks scalable, to distribute the power required for computing efficiently, and to minimise errors. The systems’ architecture is based on partitions. If failure occurs, it only affects one machine and the tasks that it performs, instead of affecting a bigger group of tasks. The cluster can also be used by a number of people, so only a specific part of the cluster gets affected by the error and thereby only one user.

A disadvantage of distributing tasks in a cluster is that you have to manage the load balancing. If you put loads of data into the system, the system distributes the data among the nodes. Once the the distributed data sets have been processed, the results are sent to nodes where the results are aggregated. You have to use a special framework to deal with the distribution environment. SCAPE uses algorithms to find and query the data. The performance of a single CPU is far too small, so they use parallel computing to bring all the data back together.

The Hadoop framework (open source Apache) allows them to deal with the details of scalable preservation environments. Hadoop is a a distributed file system and execution platform that allows them to map, reduce and distribute data and applications. The biggest advantage of Hadoop is that you can build applications on top, so it is easier to build a robust application (the computation doesn’t break because a node goes down or fails). Hadoop relies on the MapReduce programming model which is widely used for data-intensive computations. Google hosts MapReduce clusters with thousands of nodes used for indexing, ranking, mining web content, and statistical analysis, Java APIs and scripts.

The SCAPE platform brings the open source technologies Hadoop, Taverna and Fedora together. SCAPE is currently using Hadoop version 1.2.0. Taverna which allows you to visualise and and model tasks, make them repeatable, and use repository technology such as Fedora (http://www.fedora-commons.org/). The SCAPE platform incorporates the Taverna workflow management system as a workbench (http://www.taverna.org.uk) and Fedora technology as the core repository environment.

You can write your own applications, but a cost-effective solution is to incorporate preservation tools such as Jhove or METS into the Taverna workflow. Taverna allows you to integrate these tools in the workflow, and supports repository integration (read from and distribute data back into preservation environments such as Fedora). The Taverna workflow (sandpit environment) can be run on the desktop, however for running long workflows you might want to run Taverna on the internet. More information about the SCAPE platform and how Hadoop, Taverna and Fedora are integrated is available at http://www.scape-project.eu/publication/1334/.

Setting up a preservation platform like this also entails a series of challenges. Some of the obstacles you might encounter are mismatches in the parallelisation (difference between desktop and cluster environment). Workflows that work for repositories might not work for web archiving, because they use different distributed environments. To avoid mismatches use a cluster that is centred on specific workflow needs. Keeping the cluster in-house is something of which institutions are wary, while on the other hand they may be reluctant about transferring big datasets over the internet.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/