About Robin Rice

Data Librarian and Head, Research Data Support Library & University Collections

How open should your data be?

The RECODE project is looking at open data policy for EU-funded research. I attended a workshop in Sheffield yesterday for a diverse stakeholder group of researchers, funders and data providers. Along with a nice lunch, they delivered their first draft report, in which they synthesised current literature on open research data and presented five case studies of research practice in different disciplines. The format was very interactive with several break-out groups and discussions.

The usual barriers to data sharing were trotted out in different forms. (Forgive my ho-hum tone if this is a newish topic for you – our DISC-UK DataShare project summarised these in its 2007 ‘State-of-the-Art-Review’ and the reasons haven’t really changed since.) The RECODE team ably boiled these down to technical, cultural and economic issues.

The morning’s activity included a small-group discussion about disciplinary differences in motivations for data sharing. One gadfly (not me) questioned the premise of the whole topic. While differences in practice around treatment of data is undeniable, are the motivations for sharing or not sharing data really different amongst groups of researchers?

This seemed a fair point. For any given obstacle – be it commercial viability, fear of being scooped, errors being found or data being misinterpreted, desire to keep one’s ‘working capital’ for future publication, lack of time to properly prepare the data and documentation required for re-use coupled with lack of perceived academic rewards, lack of infrastructure, or disappearance of key personnel (including postgrads) – these are all disincentives for data sharing wherever they crop up.

On the flip-side, motivations to share – making data easily available to one’s colleagues and students, adding to the scholarly record, backing up one’s reported results, desire for others to add value to a treasured dataset, increasing one’s impact and potential citations, passing off the custodianship of a completed dataset to a trusted archive, or mere compliance with a funder’s or publisher’s policy are reasons that transcend disciplinary boundaries.

“Reciprocal altruism” was a new one to me. I’m not sure I believe it exists. I’ve seen more than one study showing that researchers (also teachers, where open educational resources are concerned) crave open access to other people’s ‘stuff’ whether or not they feel obliged to share their own (and more don’t than do).

An afternoon discussion focused on how open data needed to be, to be considered open. This was an amusing diversion from the topic we were given by the organisers. The UK Data Archive funded by ESRC, while a bulwark in the patchy architecture of data preservation and dissemination, does not make any of its collections available without a registration procedure that not only asks you who you are, but what you intend to do with the data. If the data are non-sensitive in nature, how necessary is this? Does the fact that the data owner would like to collect this information warrant collecting it?

A recent consensus on a new jiscmail list, data-publication, was that this sort of ‘red tape’ routinely placed in the way of data access was an affront to academic freedom. Would you agree? Would your answer depend on whether you were the user or the owner?

Edinburgh DataShare has so far resisted the temptation to require user registration for any data deposited with us, because the service was established to be an open data repository for the use of University depositors and for re-use by other researchers as well as the public (which, in most cases paid for the research). We offer our depositors normal website download statistics, and provide a suggested citation to each dataset to encourage proper attribution. We encourage use of an open data licence which requires attribution of the data creator. For depositors who do not wish to use an open licence they are free to provide their own rights statement.

The ODC-attribution licence that we offer by default is compatible with the Budapest Open Access Initiative (BOAI), but is one step less open than “CC0″ (pronounced CC-zero) where rights to the data are waived in the interest of complete freedom for data re-users. Some argue that data – as opposed to publications – should be made completely open in this way to allow pooling of numerous datasets for analysis and machine-processing.

For example, Professor Carol Goble has just written in her blog that “BioMed Central’s adoption of the Creative Commons CC0 waiver opens up the way that data published in their journals can be used, so that it can be freely mined, analysed, and reused.”

While I agree BioMed Central’s decision is good news and that CC0 licences may be the state of the art for open data, as a repository manager I have yet to meet an academic who does not wish to be attributed for data collected by the ‘sweat of the brow’ to use a phrase from copyright case law. It is slightly easier for me to persuade researchers to share their data openly with the reassurance that an open-attribution licence brings than to persuade them to waive their rights to be attributed.

The University Research Data Management Policy asserts, “Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

In practice, it has been acknowledged that this would be difficult to enforce for ‘legacy’ research data, but from now on researchers embarking on a new research project are expected to create a data management plan in which the short and long term management of the data are considered before they are collected: “All new research proposals… must include research data management plans or protocols that explicitly address data capture, management, integrity, confidentiality, retention, sharing and publication.

How open will you make your next dataset? open data button

New RDM post

Research Data Management Service Coordinator

 https://www.vacancies.ed.ac.uk/pls/corehrrecruit/erq_jobspec_version_4.jobspec?p_id=018035

The University of Edinburgh seeks toHelp Wanted - MS Clipart appoint a Research Data Management Service Coordinator to spearhead the development of a compelling user-shaped Research Data Management service for the University of Edinburgh academic community. The University is at the forefront of the evolving research data management domain, and this post will help build a sustainable service to ensure that researchers are able to store and manage their data in a seamless and secure fashion, enabling them to easily manage, manipulate, share and preserve their data either at Edinburgh or in a trusted repository elsewhere.

The key aspects of the role are Programme and Project Management skills, Requirements Gathering and Analysis, Communications, and Advocacy. This post is suitable for applicants who wish to work at the forefront of Research Data Management practice, and who are willing to take responsibility for the coordination of the Research Data Management Service of a prestigious research institution. Candidates will likely possess a Research, Library, or IT background, but will excel in a mixed environment.

This post is fixed term for 3 years.

For further information, please contact Stuart Lewis, Head of Research and Learning Services, Deputy Director of Library and University Collections, 0131 651 5205 (stuart.lewis@ed.ac.uk).

Salary: £30,242 – £36,298 per annum

Closing date: 6th September at 5pm.

We anticipate that interviews will be held in the week commencing date 16th September 2013.

Further details: https://www.vacancies.ed.ac.uk/pls/corehrrecruit/erq_jobspec_version_4.jobspec?p_id=018035

[The University reserves the right to vary the candidate information or make no appointment at all. Neither in part, nor in whole does this information form part of any contract between the University and any individual.]

Green light for research data storage and management plans

BITS thumbnailAs published in IS BITS Magazine, Issue 7, Summer 2013 –

We have received the go-ahead to proceed with our plans to establish a service for the secure storage, management, sharing and preservation ofresearch data in the University. This will build on the enabling work that has been done over the past few years, which has included: introducing an institutional policy, agreeing an 18 month roadmap, running training and awareness raising sessions, and establishing a governance structure. This has involved the formation of a RDSM Steering Group, with university-wide representation, which is chaired by Professor Peter Clarke (Physics), as well as an RDSM Implementation Group, which is chaired by Dr John Scally (Library and University Collections).

Due to the delay in receiving the go-ahead, the initial tasks over the coming weeks will be to revise the original timescales for service conception and delivery, commence procurement for the high-capacity storage and run comparisons with progress made in other institutions worldwide.

A fuller report will be provided for the next edition of BITS.

– John Scally, Director, Library & University Collections, and Chair, IS Research Data Management Implementation Group

Making research ‘really reproducible’

Listening to Victoria Stodden, Assistant Professor of Statistics at Columbia University, give the keynote speech at the recent Open Repositories conference in lovely Prince Edward Island Canada, I realised we have some way to go on the path towards her idealistic vision of how to “perfect the scholarly record.”

Victoria Park, Charlottetown

Charlottetown, Prince Edward Island

As an institutional data repository manager (for Edinburgh DataShare) I often listen and talk to users about the reasons for sharing and not sharing research data. One reason, well-known to users of the UK Data Archive (now known as the UK Data Service), is if the dataset is very rich and can be used for multiple purposes beyond those for which it was created, for example, the British Social Attitudes Survey.

Another reason for sharing data, increasingly being driven by funder and publisher policies, is to allow replication of published results, or in the case of negative results which are never published, to avoid duplication of effort and wasting public monies.

It is the second reason on which Stodden focused, and not just for research data but also for the code that is run on the data to produce the scientific results. It is for this reason she and colleagues have set up the web service, Run My Code. These single-purpose datasets do not normally get added to collections within data archives and data centres, as their re-use value is very limited. Stodden’s message to the audience of institutional repository managers and developers was that the duty of preserving these artefacts of the scientific record should fall to us.

Why should underlying code and data be published and preserved along with articles as part of the scholarly record? Stodden argues, because computation is becoming central to scientific research. We’ve all heard arguments behind the “data deluge”. But Stodden persuasively focuses on the evolution of the scientific record itself, arguing that Reproducible Research is not new. It has its roots in Skepticism – developed by Robert Boyle and the Royal Society of the 1660s. Fundamentally, it’s about “the ubiquity of error: The central motivation of the scientific method is to root out error.”

In her keynote she developed this theme by expanding on the three branches of science.

  • Branch 1: Deductive. This was about maths and formal logic, and the proof as the main product of scientific endeavor.
  • Branch 2: Empirical. Statistical analysis of controlled experiments – hypothesis testing, structured communication of methods and protocols. Peer reviewed articles became the norm.
  • Branch 3: Computational. This is at an immature stage, in part because we have not developed the means to rigorously test assertions from this branch.

Stodden is scathing in her criticism of the way computational science is currently practiced, consisting of “breezy demos” at conferences that can’t be challenged or “poked at.” She argues passionately for the need to facilitate reproducibility – the ability to regenerate published results.

What is needed to achieve openness in science? Stodden argued for the need for deposit and curation of versioned data and code, with a link to the published article, and  permanence of the link.This is indeed within the territory of the repository community.

Moreover, to have sharable products at the end of a research project, one needs to plan to share from the outset. It’s very difficult to reproduce the steps to create the results as an afterthought.

I couldn’t agree more with this last assertion. Since we set up Edinburgh DataShare we have spoken to a number of researchers about their ‘legacy’ datasets which – although they would like to make them publicly available, they cannot, either because of the nature of the consent forms, the format of the material, or the lack of adequate documentation. The easiest way is to plan to share. Our Research Data Management pages have information on how to do this, including use of the Digital Curation Centre’s tool, DMPOnline.

– Robin Rice, Data Librarian