SCAPE workshop (notes, part 1)

The aim of the SCAPE workshop was to show participants how to cope with large data volumes by using automation and preservation tools developed and combined as part of the SCAPE project. During the workshop, we were introduced to Taverna workbench, a workflow engine we we installed with a virtual machine (Linux) in our laptops.

It has taken me a while to sort out my notes from the workshop, facilitated by Rainer Schmidt (Austrian Institute of Technology, AIT), Dave Tarrant (Open Planets Foundation, OPF), Roman Graf (AIT), Matthias Rella (AIT), Sven Schlarb (Austrian National Library, ONB), Carl Wilson (OPF), Peter May (British Library, BL), and Donal Fellows (University of Manchester, UNIMAN), but here it is. The workshop (September 2013) started with a demonstration of scalable data migration processes, for which they used a number of Raspberry Pis as a computer cluster (only as proof of concept).

Rainer Schimdt (AIT)

Here is a summary of the presentation delivered by Rainer Schmidt (AIT) who explained the SCAPE project framework (FP7 16 organizations from 8 countries). The SCAPE project focuses on planning, managing and preserving digital resources using the concept of scalability. Computer clusters can manage data loads and distribute preservation tasks that cannot be managed in desktop environments. Some automated distributed tasks they have been investigating are extraction of metadata, file format migration, bit checking, quality assurance, etc.

During the workshop, facilitators showed scenarios created and developed as part of the SCAPE project, which had served as test bed to identify best use of different technologies in the preservation workflow. The hands-on activities started with a quick demonstration of the SCAPE preservation platform and how to execute a SCAPE workflow when running it in the virtual machine.

SCAPE uses clusters of commodity hardware to generate bigger environments to make preservation tasks scalable, to distribute the power required for computing efficiently, and to minimise errors. The systems’ architecture is based on partitions. If failure occurs, it only affects one machine and the tasks that it performs, instead of affecting a bigger group of tasks. The cluster can also be used by a number of people, so only a specific part of the cluster gets affected by the error and thereby only one user.

A disadvantage of distributing tasks in a cluster is that you have to manage the load balancing. If you put loads of data into the system, the system distributes the data among the nodes. Once the the distributed data sets have been processed, the results are sent to nodes where the results are aggregated. You have to use a special framework to deal with the distribution environment. SCAPE uses algorithms to find and query the data. The performance of a single CPU is far too small, so they use parallel computing to bring all the data back together.

The Hadoop framework (open source Apache) allows them to deal with the details of scalable preservation environments. Hadoop is a a distributed file system and execution platform that allows them to map, reduce and distribute data and applications. The biggest advantage of Hadoop is that you can build applications on top, so it is easier to build a robust application (the computation doesn’t break because a node goes down or fails). Hadoop relies on the MapReduce programming model which is widely used for data-intensive computations. Google hosts MapReduce clusters with thousands of nodes used for indexing, ranking, mining web content, and statistical analysis, Java APIs and scripts.

The SCAPE platform brings the open source technologies Hadoop, Taverna and Fedora together. SCAPE is currently using Hadoop version 1.2.0. Taverna which allows you to visualise and and model tasks, make them repeatable, and use repository technology such as Fedora (http://www.fedora-commons.org/). The SCAPE platform incorporates the Taverna workflow management system as a workbench (http://www.taverna.org.uk) and Fedora technology as the core repository environment.

You can write your own applications, but a cost-effective solution is to incorporate preservation tools such as Jhove or METS into the Taverna workflow. Taverna allows you to integrate these tools in the workflow, and supports repository integration (read from and distribute data back into preservation environments such as Fedora). The Taverna workflow (sandpit environment) can be run on the desktop, however for running long workflows you might want to run Taverna on the internet. More information about the SCAPE platform and how Hadoop, Taverna and Fedora are integrated is available at http://www.scape-project.eu/publication/1334/.

Setting up a preservation platform like this also entails a series of challenges. Some of the obstacles you might encounter are mismatches in the parallelisation (difference between desktop and cluster environment). Workflows that work for repositories might not work for web archiving, because they use different distributed environments. To avoid mismatches use a cluster that is centred on specific workflow needs. Keeping the cluster in-house is something of which institutions are wary, while on the other hand they may be reluctant about transferring big datasets over the internet.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/

New research data storage

The latest BITS magazine for University of Edinburgh staff (Issue 8, Autumn/ Winter 2013) contains a lead article on new data storage facilities that Information Services have recently procured and will be making available to researchers for their research data management.

“The arrival of the RDM storage and its imminent roll out is an exciting step in the development of our new set of services under the Research Data Management banner. Ensuring that the service we deploy is fair, useful and t transparent are key principles for the IS team.” John Scally

 

Information Services is very pleased to announce that our new Research Data Storage hardware has been safely delivered.

Following a competitive procurement process, a range of suppliers were selected to provide the various parts of the infrastructure, incl. Dell, NetApp, Brocade and Cisco. The bulk of the order was assembled over the summer in China and shipped to the King’s Buildings campus at the end of August. Since then IT Infrastructure staff have been installing, testing and preparing the storage for roll-out.

How good is the storage?
Information Services recognises the importance of the University’s research data and has procured enterprise-class storage infrastructure to underpin the programme of Research
Data services. The infrastructure ranges from the highest class of flash-storage (delivering 375,000 IO operations per second) to 1.6PB (1 Petabyte = 1,024 Terabytes) of bulk storage arrays. The data in the Research Data Management (RDM) file-store is automatically replicated to an off-site disaster facility and also backed up with a 60-day retention period, with 10 days of file history visible online.

Who qualifies for an allocation?
Every active researcher in the University! This is an agreement between the University and the researcher to provide quality active data storage, service support and long term curation for researchers. This is for all researchers, not just Principal Investigators or those in receipt of external grants to fund research.

When do I get my allocation?
We are planning to roll out to early adopter Schools and institutes late November this year. This is dependent on all of the quality checks and performance testing on the system being completed successfully, however, confidence is high that the deadline will be met.
The early adopters for the initial service roll-out are: School of GeoSciences, School of Philosophy, Psychology and Language Sciences, and the Centre for Population Health
Sciences. Phased roll-out to all areas of the University will follow.

How much free allocation will I receive?
The University has committed 0.5TB (500GB) of high quality storage with guaranteed backup and resilience to every active researcher. The important principle at work is that the 0.5TB is for the individual researcher to use primarily to store their active research data. This ensures that they can work in a high quality and resilient environment and, hopefully, move valuable data from potentially unstable local drives. Research groups
and Schools will be encouraged to pool their allocations in order to facilitate shared data management and collaboration.

This formula was developed in close consultation with College and School representatives; however, there will be discipline differences in how much storage is required and individual need will not be uniform. A degree of flexibility will be built into the
allocation model and roll-out, though if researchers go over their 0.5TB free allocation they will have to pay.

Why is the University doing this?
The storage roll-out is one component of a suite of existing and planned services known as our Research Data Management Initiative. An awareness raising campaign accompanies the storage allocation to Schools, units and individuals to
encourage best practice in research data management planning and sharing.

Research Data Management support services:
www.ed.ac.uk/is/data-management

University’s Research Data Management Policy:
www.ed.ac.uk/is/research-data-policy

BITS magazine (Issue 8, Autumn/ Winter 2013)
http://www.ed.ac.uk/schools-departments/information-services/about/news/edinburgh-bits