Publishing Data Workflows

[Guest post from Angus Whyte, Digital Curation Centre]

In the first week of March the 7th Plenary session of the Research Data Alliance got underway in Tokyo. Plenary sessions are the fulcrum of RDA activity, when its many Working Groups and Interest Groups try to get as much leverage as they can out of the previous 6 months of voluntary activity, which is usually coordinated through crackly conference calls.

The Digital Curation Centre (DCC) and others in Edinburgh contribute to a few of these groups, one being the Working Group (WG) on Publishing Data Workflows. Like all such groups it has a fixed time span and agreed deliverables. This WG completes its run at the Tokyo plenary, so there’s no better time to reflect on why DCC has been involved in it, how we’ve worked with others in Edinburgh and what outcomes it’s had.

DCC takes an active part in groups where we see a direct mutual benefit, for example by finding content for our guidance publications. In this case we have a How-to guide planned on ‘workflows for data preservation and publication’. The Publishing Data Workflows WG has taken some initial steps towards a reference model for data publishing, so it has been a great opportunity to track the emerging consensus on best practice, not to mention examples we can use.

One of those examples was close to hand, and DataShare’s workflow and checklist for deposit is identified in the report alongside workflows from other participating repositories and data centres. That report is now available on Zenodo. [1]

In our mini-case studies, the WG found no hard and fast boundaries between ‘data publishing’ and what any repository does when making data publicly accessible. It’s rather a question of how much additional linking and contextualisation is in place to increase data visibility, assure the data quality, and facilitate its reuse. Here’s the working definition we settled on in that report:

Research data publishing is the release of research data, associated metadata, accompanying documentation, and software code (in cases where the raw data have been processed or manipulated) for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way.

The ‘key components’ of data publishing are illustrated in this diagram produced by Claire C. Austin.

Data publishing components. Source: Claire C. Austin et al [1]

Data publishing components. Source: Claire C. Austin et al [1]

As the Figure implies, a variety of workflows are needed to build and join up the components. They include those ‘upstream’ around the data collection and analysis, ‘midstream’ workflows around data deposit, packaging and ingest to a repository, and ‘downstream’ to link to other systems. These downstream links could be to third-party preservation systems, publisher platforms, metadata harvesting and citation tracking systems.

The WG recently began some follow-up work to our report that looks ‘upstream’ to consider how the intent to publish data is changing research workflows. Links to third-party systems can also be relevant in these upstream workflows. It has long been an ambition of RDM to capture as much as possible of the metadata and context, as early and as easily as possible. That has been referred to variously as ‘sheer curation’ [2], and ‘publication at source [3]). So we gathered further examples, aiming to illustrate some of the ways that repositories are connecting with these upstream workflows.

Electronic lab notebooks (ELN) can offer one route towards fly-on-the-wall recording of the research process, so the collaboration between Research Space and University of Edinburgh is very relevant to the WG. As noted previously on these pages [4] ,[5], the RSpace ELN has been integrated with DataShare so researchers can deposit directly into it. So we appreciated the contribution Rory Macneil (Research Space) and Pauline Ward (UoE Data Library) made to describe that workflow, one of around half a dozen gathered at the end of the year.

The examples the WG collected each show how one or more of the recommendations in our report can be implemented. There are 5 of these short and to the point recommendations:

  1. Start small, building modular, open source and shareable components
  2. Implement core components of the reference model according to the needs of the stakeholder
  3. Follow standards that facilitate interoperability and permit extensions
  4. Facilitate data citation, e.g. through use of digital object PIDs, data/article linkages, researcher PIDs
  5. Document roles, workflows and services

The RSpace-DataShare integration example illustrates how institutions can follow these recommendations by collaborating with partners. RSpace is not open source, but the collaboration does use open standards that facilitate interoperability, namely METS and SWORD, to package up lab books and deposit them for open data sharing. DataShare facilitates data citation, and the workflows for depositing from RSpace are documented, based on DataShare’s existing checklist for depositors. The workflow integrating RSpace with DataShare is shown below:

RSpace-DataShare Workflows

RSpace-DataShare Workflows

For me one of the most interesting things about this example was learning about the delegation of trust to research groups that can result. If the DataShare curation team can identify an expert user who is planning a large number of data deposits over a period of time, and train them to apply DataShare’s curation standards themselves they would be given administrative rights over the relevant Collection in the database, and the curation step would be entrusted to them for the relevant Collection.

As more researchers take up the challenges of data sharing and reuse, institutional data repositories will need to make depositing as straightforward as they can. Delegating responsibilities and the tools to fulfil them has to be the way to go.


[1] Austin, C et al.. (2015). Key components of data publishing: Using current best practices to develop a reference model for data publishing. Available at:

[2] ‘Sheer Curation’ Wikipedia entry. Available at:

[3] Frey, J. et al (2015) Collection, Curation, Citation at Source: Publication@Source 10 Years On. International Journal of Digital Curation. 2015, Vol. 10, No. 2, pp. 1-11


[4] Macneil, R. (2014) Using an Electronic Lab Notebook to Deposit Data

[5] Macdonald, S. and Macneil, R. Service Integration to Enhance Research Data Management: RSpace Electronic Laboratory Notebook Case Study International Journal of Digital Curation 2015, Vol. 10, No. 1, pp. 163-172. http://doi:10.2218/ijdc.v10i1.354

Angus Whyte is a Senior Institutional Support Officer at the Digital Curation Centre.


Non-standard research outputs

I recently attended (13th May 2014) the one-day ‘Non-standard Research Outputs’ workshop at Nottingham Trent University.

[ 1 ] The day started with Prof Tony Kent and his introduction to some of the issues associated with managing and archiving non-text based research outputs. He posed the question: what uses do we expect these outcomes to have in the future? By trying to answer this question, we can think about the information that needs to be preserved with the output and how to preserve both, output and its documentation. He distinguished three common research outcomes in arts-humanities research contexts:

  • Images. He showed us an image of a research output from a fashion design researcher. The issue with research outputs like this one is that they are not always self explanatory, and quite often open up the question of what is recorded in the image, and what the research outcome actually is. In this case, the image contained information about a new design for a heel of a shoe, but the research outcome itself, the heel, wasn’t easily identifiable, and without further explanation (description metadata), the record would be rendered unusable in the future.
  • Videos. The example used to explain this type of non-text based research output was a video featuring some of the research of Helen Storey. The video contains information about the project Wonderland and how textiles dissolve in water and water bottles disintegrate. In the video, researchers explain how creativity and materials can be combined to address environmental issues. Videos like this one contain both, records of the research outcome in action (exhibition) and information about what the research outcome is and how the project ideas developed. These are very valuable outcomes, but they contain so much information that it’s difficult to untangle what is the outcome and what is information about the outcome.

  • Statements. Drawing from his experience, he referred to researchers in fashion and performance arts to explain this research outcome, but I would say it applies to other researchers in humanities and artistic disciplines as well. The issue with these research outcomes is the complexity of the research problems the researchers are addressing and the difficulty of expressing and describing what their research is about, and how the different elements that compose their research project outcomes interact with each other. How much text do we need to understand non-text-based research outcomes such as images and videos? How important is the description of the overall project to understand the different research outcomes?

Other questions that come to mind when thinking about collecting and archiving non-standard research outputs such as exhibitions are: ‘what elements of the exhibition do we need to capture? Do we capture the pieces exhibited individually or collectively? How can audio/visual documentation convey the spatial arrangements of these pieces and their interrelations? What exactly constitutes the research outputs? Installation plans, cards, posters, dresses, objects, images, print-outs, visualisations, visitors comments, etc.? We also discussed how to structure data in a repository for artefacts that go into different exhibitions and installations. How to define a practice-based research output that has a life in its own? How do we address this temporal element, the progression and growth of the research output? This flowchart might be useful. Shared with permission of James Toon and collaborators.


Sketch from group discussion about artefacts and research practices that are ephemeral. How to capture the artefact as well as spatial information, notes, context, images, etc.

[ 2 ] After these first insights into the complexity of what non-standard research outcomes are, Stephanie Meece from the University of the Arts London (UAL) discussed her experience as institutional manager of the UAL repository. This repository is for research outputs, but they have also set up another repository for research data which is currently not publicly available. The research output repository has thousands of deposits, but the data repository has ingested only one dataset in its first two months of existence. The dataset in question is related to a media-archaeology research project where a number of analogue-based media (tapes) are being digitised. This reinforced my suspicion that researchers in the arts and humanities are ready and keen to deposit final research outputs, but are less inclined to deposit their core data, the primary sources from which their research outputs derive.

The UAL learned a great deal about non-standard research outputs through the KULTUR project, a Jisc funded project focused on developing repository solutions for the arts. Practice-based research methods engage with theories and practices in a different way than more traditional research methods. In their enquiries about specific metadata for the arts, the KULTUR project identified that metadata fields like ‘collaborators’ were mostly applicable to the arts (see metadata report, p. 25), and that this type of metadata fields differed from ‘data creator’ or ‘co-author.’ Drawing from this, we should certainly reconsider the metadata fields as well as the wording we use in our repositories to accommodate the needs of researchers in the arts.

Other examples of institutional repositories for the arts shown were VADS (University of the Creative Arts) and RADAR (Glasgow School of Art).

[ 3 ] Afterwards, Bekky Randall made a short presentation in which she explained that non-standard research outputs have a much wider variety of formats than standard text-based outputs. She also explained the importance of getting the researchers to do their own deposits, as they are the ones that know the information required for metadata fields. Once researchers find out what is involved in depositing their research, they will be more aware of what is needed, and get involved earlier with research data management (RDM). This might involve researchers depositing throughout the whole research project instead of at the end when they might have forgotten much of the information related to their files. Increasingly, research funders require data management plans, and there are tools to check what they expect researchers to do in terms of publication and sharing. See SHERPA for more information.

[ 4 ] The presentation slot after lunch is always challenging, but Prof Tom Fisher kept us awake with his insights into non-standard research outcomes. In the arts and humanities it’s sometimes difficult to separate insights from the data. He opened up the question of whether archiving research is mainly for Research Excellence Framework (REF) purposes. His point was to delve into the need to disseminate, access and reuse research outputs in the arts beyond REF. He argued that current artistic practice relates more to the present context (contemporary practice-based research) than to the past. In my opinion, arts and humanities always refer to their context but at the same time look back into the past, and are aware they cannot dismiss the presence of the past. For that reason, it seems relevant to archive current research outputs in the arts, because they will be the resources that arts and humanities researchers might want to use in the future.

He spent some time discussing the Journal for Artistic Research (JAR). This journal was designed taking into account the needs of artistic research (practice-based methodologies and research outcomes in a wide range of media), which do not lend themselves to the linearity of text-based research. The journal is peer-review and this process is made as transparent as possible by publishing the peer-reviews along with the article. Here is an example peer-review of an article submitted to JAR by ECA Professor Neil Mulholland.

[ 5 ] Terry Bucknell delivered a quick introduction to figshare. In his presentation he explained the origins of the figshare repository, and how the platform has improved its features to accommodate non-standard research outputs. The platform was originally thought for sharing scientific data, but has expanded its capabilities to appeal to all disciplines. If you have an ORCID account you can now connect it to figshare.

[ 6 ] The last presentation of the day was delivered by Martin Donnelly from the Digital Curation Centre (DCC) who gave a refreshing view into data management for the arts. He pointed out the issue of a scientifically-centred understanding of research data management, and that in order to reach the arts and humanities research community, we might need to change the wording, and change the word ‘data’ for ‘stuff’ when referring to creative research outputs. This reminded me of the paper ‘Making Sense: Talking Data Management with Researchers’ by Catharine Ward et al. (2011) and the Data Curation Profiles that Jane Furness, Academic Support Librarian, created after interviewing two researchers at Edinburgh College of Art, available here.

Quoting from his slides “RDM is the active management and appraisal of data over all the lifecycle of scholarly research.” In the past, data in the sciences was not curated or taken care of after the publication of articles; now this process has changed and most science researchers already actively manage their data throughout the research project. This could be extended to arts and humanities research. Why wait to do it at the end?

The main argument for RDM and data sharing is transparency. The data is available for scrutiny and replication of findings. Sharing is most important when events cannot be replicated, such as performance or a census survey. In the scientific context ‘data’ stands for evidence, but in the arts and humanities this does not apply in the same way. He then referred to the work of Leigh Garrett, and how data gets reused in the arts. Researchers in the arts reuse research outputs but there is the fear of fraud, because some people might not acknowledge the data sources from which their work derives. To avoid this, there is the tendency to have longer embargoes in humanities and arts than in sciences.

After Martin’s presentation, we called it a day. While, waiting for my train at Nottingham Station, I noticed I had forgotten my phone (and the flower sketch picture with it), but luckily Prof Tony Kent came to my rescue, and brought the phone to the station. Thanks to Tony and Off-Peak train tickets, I was able to travel back home on the day.

Rocio von Jungenfeld
Data Library Assistant

IDCC 2014 – take home thoughts

A few weeks ago I attended the 9th International Digital Curation Conference in San Francisco.  The conference was spread over four days, with two days for workshops, and two for the main conference.  The conference was jointly run by the Digital Curation Centre and the California Digital Library.  Unsurprisingly it was an excellent conference with much debate and discussion about the evolving needs for digital curation of research data.

San Francisco

The main points I took home from the conference were:

Science is changing: Atul Butte gave an inspiring keynote that contained an overview of the ways in which his own work is changing.  In particular he explained how it is now possible to ‘outsource’ parts of the scientific process. The first is the ability to visit a web site to buy tissue samplesfor specific diseases which were previously used for medical tests, but which have now been anonymised and collected rather than being discarded.  Secondly it is also now possible to order mouse trials to be undertaken, again via a web site.  These allow routine activities to be performed more quickly and cheaply.

Big Data: This phrase is often used and means different things to different people.  A nice definition given by Jane Hunter was that curation of big data is hard because of its volume, velocity, variety and veracity.  She followed this up by some good examples where data have been effectively used.

Skills need to be taught: There were several sessions about the role of Information Schools in educating a new breed of information professionals with the skills required to effectively handle the growing requirements of analysing and curating data.  This growth was demonstrated by how we are seeing many more job titles such as data engineer / analyst / steward / journalist.  It was proposed that library degrees should include more technical skills such as programming and data formats.

The Data paper: There was much discussion about the concept of a ‘Data Paper’ – a short journal paper that describes a data set.  It was seen as an important element in raising the profile of the creation of re-usable data sets.  Such papers would be citable and trackable in the same ways as journal papers, and could therefore contribute to esteem indicators.  There was a mix of traditional and new publishers with varying business models for achieving this.  One point that stood out for me was that publishers were not proposing to archive the data, only the associated data paper.  The archiving would need to take place elsewhere.

Tools are improving: I attended a workshop about Data Management in the Cloud, facilitated by Microsoft Research.  They gave a demo of some of the latest features of Excel.  Many of the new features seem to nicely fit into two camps, but equally useful and very powerful to both.  Whether you are looking at data from the perspective of business intelligence or research data analysis, tools such as Excel are now much more than a spreadsheet for adding up numbers.  They can import, manipulate, and display data in many new and powerful ways.

I was also able to present a poster that contains some of the evolving thoughts about data curation systems at the University of Edinburgh:

In his closing reflection of the conference, Clifford Lynch said that we need to understand how much progress we are making with data curation.  It will be interesting to see the progress made and what new issues are being discussed at the conference next year which will be held much closer to home in London.

Stuart Lewis
Head of Research and Learning Services
Library & University Collections, Information Services

RDM in the arts and humanities

St Annes College, Oxford The tenth Research Data Management Forum (RDMF), organised by DCC was held in St Anne’s College, University of Oxford on 3 and 4 September. Thus follows an account of proceedings, the goals of which were to examine aspects of arts and humanities research data that may require a different kind of handling to that given to other disciplines, and to discuss how needs for support, advocacy, training and infrastructure are being met.

Dave De Roure (Director of the Oxford e-Research Centre) started proceedings as keynote #1. He introduced the concept of the ‘fourth quadrant social machine’ (see Fig. 1) which extends the notion of ‘citizen as scientist’ taking advantage of the emergence of new analytical capabilities, (big) data sources and social networking. He talked also about the ‘End of Theory‘ – the idea that the data deluge is making scientific method obsolete also referencing Douglas Kell’s BioEssay ‘Here is the evidence, now what is the hypothesis‘ which argues that that “data- and technology-driven programmes are not alternatives to hypothesis-led studies in scientific knowledge discovery but are complementary and iterative partners with them.” This he continued, could have major influence on how research is conducted not only in hard sciences but also the arts and humanities.Social Machines Fig. 1 – From ‘Social Machines’ by Dave De Roure on Slideshare (

He contends that citizen science initiatives such as Zooniverse (real science online) extend the idea of ‘human as slave to the computer’ however stating that the ‘more we give the computer to do the more we have to keep an eye on what they do.’ De Roure talked about sharing methods being as important as sharing data harnessing the web as lens (e.g. Twitterology) onto society, as infrastructure (e-science/e-research), and as the artifact of study. He went on to highlight innovative sharing initiatives in the arts and humanities that employ new analytical approaches such as:

  • the HATHI Trust Research Center which ‘enables computational access for nonprofit and educational users to published works in the public domain now and, in the future’ i.e. take your code to the data and get back your results!
  • CLAROS which uses semantic web technologies and image recognition techniques to make the geographically separate scholarly art datasets ‘interoperable’ and accessible to a broad range of users’
  • SALAMI (Structural Analysis of Large Amounts of Music Information) which focuses on ‘developing and evaluating practices, frameworks, and tools for the design and construction of worldwide distributed digital music archives and libraries’

De Roure used the phrase ‘from signal to understanding’ to describe the workflow of the ‘social machine’ and went on to described the commonalities that the arts and humanities community have with other disciplines such as working with multiple data sources (often incomplete and born digital), sharing of new and innovative digital methods, the challenges of resource discovery and publication of new digital artifacts, the importance of provenance, and risks of increasing automation. He also highlighted those differences that digital resources in arts and humanities possess in relation to other disciplines in the age of the ‘social machine’  such as specific content types and their relationship to physical artifacts, curated collections and an ‘infinite archive’ of heterogeneous content, publication as both subject and record of research, and the emphasis on multiple interpretations and critical thinking.

Keynote #2 was delivered by Leigh Garrett (Director of the Virtual Arts Data Service) who opined that little is really known about the ‘state’ of research data in the visual arts. It is both tangible yet intangible (what is data in the visual arts?)! Both physical and digital, heterogeneous and infinite, complex and complicated. He made mention of the Jisc-funded KAPTUR project which was aimed to create and pilot a sectoral model of best practice in the management of research data in the visual arts.

KAPTURE schematic drawing

As part of the exercise KAPTUR  evaluated 12 technical systems most suited for managing research data in the visual arts including CKAN, Figshare, EPrints, DataFlow. Criteria such as user friendliness, visual engagement, flexibility, hosted solution, licensing, versioning, searching were considered. Figshare was seen as user friendly, visually engaging, intuitive,and flexible however the use of CC Zero licences were seen as inappropriate for visual arts research data due to commercial usage clauses. Whilst CKAN appeared most suited no single solution completely fulfilled all requirements of researchers.

Fig 2. Lucie Rie, Sheet of sketches from Council of Industrial Design correspondence for Festival of Britain, 1951. © Mrs. Yvonne Mayer/Crafts Study Centre.
Available from VADS

Simon Willmoth then gave an institutional perspective from the University of the Arts London. It was interesting that from his experience abstract terms such as research data do not engage researchers, and indeed the term is not in common usage by art and design researchers. His definition in the context of art and design was that research data can be ‘considered anything created, captured or collected as an output of funded research work in its original state’. He also observed that as soon as researchers understand RDM concepts they ask for all of their material (past and present) to be digitised! Echoing earlier presentations regarding the ‘heterogeneous and infinite’ nature of research data in the arts and humanities Simon indicated that artists and designers normally have their own websites, some of the content can be regarded as research data e.g. drawings, storyboards, images, whilst some of it is a particular view dependent upon what it is used for at that instant. He then described the institutional challenges of resourcing (staff, storage, time), researcher engagement, curation (incl. legal, ethical and commercial constraints), infrastructure, and enhanced access. Simon finished with some very interesting quotes from researchers regarding how they perceive their work in relation to RDM e.g.

The work that is made is evidence of the journey to the completed artwork …… it’s kind of a continuous archive of imagery

I try not to throw things out but I often cut things up to use as collage material …. so it’s continual construction and deconstruction. Actually ordering is part of my own creative process, so the whole idea of archiving I think is really interesting

I used to think of a website as something where you display things and now increasingly I see it as a way of it recording the process, so I am using more like a Blog structure. But I am happy to post notes, photographs, drawings, observations and let other people see the process

My sketch books tend to be a mish-mash between logging of rushes notes, detailing each shot, things that I read, things that I hear, books that I’m reading, so it will be a jumble of texts but they’ve all gone in to the making of a piece of work.


Image of Alex from A Clockwork Orange at Stanley Kubrick Archive (UAL)

Simon showcased the Stanley Kubrick Archive based at UAL which contains draft and completed scripts, research materials such as books, magazines and location photographs. It also holds set plans and production documents such as call sheets, shooting schedules, continuity reports and continuity Polaroids. Props, costumes, poster designs, sound tapes and records also feature, alongside publicity press cuttings. He argues that the approach of digital copy and physical artifact accompanied by a web catalogue may be the way forward for RDM in the field of art and design.

Julianne Nyhan (UCL) kicked off day two providing a researcher’s view on arts and humanities data management/sharing with particular emphasis on infrastructure needs and wants. She observed that arts and humanities data are artifacts of human expression, interaction and imagination which tends to be collected rather than generated and rarely by those who create the object(s) which can be:

  • complex complete with variants, annotations, editorial comment
  • multi-lingual
  • long lasting
  • studied for many purposes

Julianne also reiterated the need to bridge management and sharing on both physical and digital objects as well as more documentation of interpretative processes and interventions (for which she employs a range of note management tools). She went onto say that much of the work done in her own discipline goes beyond disciplinary/academic/institutional boundaries and that the need to retain and appreciate the bespoke should be balanced against need for standardisation.
In terms of strategic developments Julianne saw much mileage in facilitating more research across linguistic borders (through legal instruments at a national/international level) with resultant access to large multilingual datasets from different cultures to inform comparative and transnational research.

Next up Paul Whitty and Felicity Ford (Oxford Brookes University) provided an overview of RDM practices at the Sonic Art Research Unit (SARU). The use of the internet to advertise work  is common amongst SARU researchers, musicians, freelance artists. It was however recognised that there lacked any unified web presence such as Ubuweb.
UbuwebPaul emphasised the need for the internet to be seen as a social space for SARU researchers. At the moment research is disseminated across multiple private websites without consistent links back to the university. Research objects and documentation is split over multiple platforms (e.g. Soundcloud, Vimeo, YouTube). This makes resource discovery difficult except through individual artists web spaces. As of March 2014 research disseminated across multiple private websites will be linked to/from the university with data stored on RADAR, the multi-purpose online “resource bank” for Oxford Brookes thus enabling traceability and impact measurement in terms of the use of digital research assets. As a concluding remark Paul did question whether university IT departments were qualified to provide bespoke website design for arts practitioners and researchers.

James Wilson, Janet McKnight and Sally Rumsey then gave an overview of the University of Oxford approach to RDM in the humanities which utilises different business models (extensible or reducible) for different components e.g. DataFinder, Databank, DataFlow, DHARMa. Findings from earlier projects at Oxford indicated that humanities research data tends not to depreciate over time unlike that for harder sciences, is difficult to define, tends to be compiled from existing sources and not created from scratch,and is often not in an optimal format for analysis. Other findings indicated that humanities researchers are least likely to conduct their research as part of a team, least likely to be externally funded, least likely to have deposited data in a repository. Conclusions reached were that humanities researchers were amongst the hardest to reach and that training and support is required to encourage cultural change. Janet McKnight from the Digital Humanities Archives for Research Materials (DHARMa) Project then spoke about enabling digital humanities research through effective data preservation warning that before you impose a workflow in terms of developing systems and processes it would be wise to ‘walk a mile in their [the researcher’s] shoes!’

This was a well-attended and enlightening event, ably organised and chaired by Martin Donnelly (DCC). It offered insight and a wide range of perspectives all of which enhance our understanding of service, of practice, of advocacy, of support in relation to research data management in the arts and humanities.

Slides from all presenters are available from DCC website.

Stuart Macdonald
EDINA & Data Library