SCAPE workshop (notes, part 1)

The aim of the SCAPE workshop was to show participants how to cope with large data volumes by using automation and preservation tools developed and combined as part of the SCAPE project. During the workshop, we were introduced to Taverna workbench, a workflow engine we we installed with a virtual machine (Linux) in our laptops.

It has taken me a while to sort out my notes from the workshop, facilitated by Rainer Schmidt (Austrian Institute of Technology, AIT), Dave Tarrant (Open Planets Foundation, OPF), Roman Graf (AIT), Matthias Rella (AIT), Sven Schlarb (Austrian National Library, ONB), Carl Wilson (OPF), Peter May (British Library, BL), and Donal Fellows (University of Manchester, UNIMAN), but here it is. The workshop (September 2013) started with a demonstration of scalable data migration processes, for which they used a number of Raspberry Pis as a computer cluster (only as proof of concept).

Rainer Schimdt (AIT)

Here is a summary of the presentation delivered by Rainer Schmidt (AIT) who explained the SCAPE project framework (FP7 16 organizations from 8 countries). The SCAPE project focuses on planning, managing and preserving digital resources using the concept of scalability. Computer clusters can manage data loads and distribute preservation tasks that cannot be managed in desktop environments. Some automated distributed tasks they have been investigating are extraction of metadata, file format migration, bit checking, quality assurance, etc.

During the workshop, facilitators showed scenarios created and developed as part of the SCAPE project, which had served as test bed to identify best use of different technologies in the preservation workflow. The hands-on activities started with a quick demonstration of the SCAPE preservation platform and how to execute a SCAPE workflow when running it in the virtual machine.

SCAPE uses clusters of commodity hardware to generate bigger environments to make preservation tasks scalable, to distribute the power required for computing efficiently, and to minimise errors. The systems’ architecture is based on partitions. If failure occurs, it only affects one machine and the tasks that it performs, instead of affecting a bigger group of tasks. The cluster can also be used by a number of people, so only a specific part of the cluster gets affected by the error and thereby only one user.

A disadvantage of distributing tasks in a cluster is that you have to manage the load balancing. If you put loads of data into the system, the system distributes the data among the nodes. Once the the distributed data sets have been processed, the results are sent to nodes where the results are aggregated. You have to use a special framework to deal with the distribution environment. SCAPE uses algorithms to find and query the data. The performance of a single CPU is far too small, so they use parallel computing to bring all the data back together.

The Hadoop framework (open source Apache) allows them to deal with the details of scalable preservation environments. Hadoop is a a distributed file system and execution platform that allows them to map, reduce and distribute data and applications. The biggest advantage of Hadoop is that you can build applications on top, so it is easier to build a robust application (the computation doesn’t break because a node goes down or fails). Hadoop relies on the MapReduce programming model which is widely used for data-intensive computations. Google hosts MapReduce clusters with thousands of nodes used for indexing, ranking, mining web content, and statistical analysis, Java APIs and scripts.

The SCAPE platform brings the open source technologies Hadoop, Taverna and Fedora together. SCAPE is currently using Hadoop version 1.2.0. Taverna which allows you to visualise and and model tasks, make them repeatable, and use repository technology such as Fedora (http://www.fedora-commons.org/). The SCAPE platform incorporates the Taverna workflow management system as a workbench (http://www.taverna.org.uk) and Fedora technology as the core repository environment.

You can write your own applications, but a cost-effective solution is to incorporate preservation tools such as Jhove or METS into the Taverna workflow. Taverna allows you to integrate these tools in the workflow, and supports repository integration (read from and distribute data back into preservation environments such as Fedora). The Taverna workflow (sandpit environment) can be run on the desktop, however for running long workflows you might want to run Taverna on the internet. More information about the SCAPE platform and how Hadoop, Taverna and Fedora are integrated is available at http://www.scape-project.eu/publication/1334/.

Setting up a preservation platform like this also entails a series of challenges. Some of the obstacles you might encounter are mismatches in the parallelisation (difference between desktop and cluster environment). Workflows that work for repositories might not work for web archiving, because they use different distributed environments. To avoid mismatches use a cluster that is centred on specific workflow needs. Keeping the cluster in-house is something of which institutions are wary, while on the other hand they may be reluctant about transferring big datasets over the internet.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/

Come work with us – Data Library Assistant post

Data Library Assistant

EDINA and Data Library, Information Services

£25,759- £29,837 per year
Full Time, Fixed Term: 36 months
Ref: 022330

The Data Library is working with others in Information Services to enhance and develop services to deliver the University’s Research Data Management programme. To this end the Data Library requires a member of the team to help us offer online and direct support for research data management planning and data curation, and to help raise awareness and provide training to staff and student researchers. office workersThe Data Library hosts Edinburgh DataShare, a research data repository for members of the University along with a data catalogue and a suite of research data support web pages within the University website. This is an excellent opportunity for a graduate to apply their research skills to a growing service area.

You will be a university graduate or have suitable relevant experience. You will be enthusiastic about new forms of scholarly communication such as open access publishing and open data, and working with open source software. You will be able to engage with peers in your discipline and help them to understand how good data management and sharing practices can improve their research and impact.

You will have research experience and data analysis skills as well as knowledge of publishing in an academic environment. You will have an understanding of university structures and norms.

Excellent written and verbal communication skills and up to date computer/Internet literacy is essential.

There are many advantages to working at the University. Benefits include flexible working, an excellent pension, career prospects and generous holiday provision.

Further details (please enter vacancy code 024399)

Closing Date: 29 January 2014

Contact Person: Ingrid Earp
Contact Number: +44 (0)131 651 1240
Contact Email: i.earp@ed.ac.uk

Standing on the shoulders of giants: Phonetics Recording Archive

The University of Edinburgh’s proud heritage of academic and research achievements is underpinned by the calibre of the outstanding staff that have worked and taught within its walls.

One such individual of note was the late Elizabeth Theodora Uldall, a pioneering phonetician who spent over 30 years at the University.  Elizabeth, or as she was more commonly known as, Betsy, came to work at the University in 1949, after postings for the British Council both during and after the Second World War.  Indeed these postings and her subsequent academic work meant that by the time she came to Edinburgh, she had already worked on five continents.

The primary interest of her research was phonetics and at her time in Edinburgh made many valuable contributions to this field, both through her research and teaching, and a touching obituary was published in the Scotsman, when she sadly passed away in 2004.

The Data Library are very happy to announce that recently, with the co-operation from the Linguistics and English Language department, we were able to gather for preservation and sharing, some of the recently digitised research outputs from Betsy Uldall, David Abercrombie, and other distinguished researchers’ work into The University of Edinburgh Phonetics Recording Archive, mid-late 1900s collection on DataShare.

This collection contains five items, containing phonetic and linguistic research including the research outputs and recordings from Betsy Uldall:

Although DataShare was not available for University staff at the time of Betsy Uldall’s retirement in 1983, it would seem that right up until the end, she remained conscious of her responsibilities and the value of her work to other researchers:

“Betsy Uldall, spoke to me before she died asking for this archive to be preserved, and with your help it will be preserved and accessible to people who can use it. – Many many thanks”

We are of course very happy to have played a part in meeting her request, and that her research data is now available to all who wish to study and build upon it.

David Girdwood
EDINA & Data Library

Training subject librarians in RDM

I have just completed running the MANTRA course for librarians http://datalib.edina.ac.uk/mantra/libtraining.html with my team of 8 subject librarians at Stirling University.  A member of the Research Office attended one session and the team manager for Library Content Manager also attended some of the sessions.

We started the librarians training kit on 29 May 2013 and our last session was in December, so the course has actually changed (and improved) whilst we were undertaking it.

I think we found it beneficial to set time aside as a team to look at this issue and take our time over it!  We enjoyed lots of lively discussions.  I am Chair of Stirling’s RDM Task Force and knew that we, as librarians, would be expected to have the skills to help researchers manage their research data.  It was great to know that there was already a training package in existence for librarians.

Everybody really liked the panda film in the last section.  They suggested using that style more often.  Some of my staff thought the videos were too long or too slow.

As the facilitator I found that the instructions were sometimes not clear but by the end I figured out that I just needed to look at the manual.  I think it was really useful at the beginning to have real researchers talking about the issues.

I feel more confident that my team are no longer fearful of RDM enquiries.

Thank you for a fantastic resource and I will continue recommending it to researchers.

Lisa Haddow

Team Manager:  Library Liaison and Development

University of Stirling