Data and ethics

As an academic support person, I was surprised to find myself invited onto a roundtable about ‘The Ethics of Data-Intensive Research’. Although as a data librarian I’m certainly qualified to talk about data, I was less sure of myself on the ethics front – after all, I’m not the one who has to get my research past an Ethics Review Board or a research funder.

The event was held last Friday at the University of Edinburgh as part of the project Archives Now: Scotland’s National Collections and the Digital Humanities, a knowledge exchange project funded by the Royal Society of Edinburgh. This event attracted attendees across Scotland and had as its focus “Working With Data“.

I figured I couldn’t go wrong with a joke about fellow ‘data people’ with an image from flickr that we use in our online training course, MANTRA.


‘Binary’ by Xerones on Flickr (CC-BY-NC)

Appropriately, about half the people in the room chuckled.

So after introducing myself and my relevant hats, I revisited the quotations I had supplied on request for the organiser, Lisa Otty, who had put together a discussion paper for the roundtable.

“Publishing articles without making the data available is scientific malpractice.”

This quote is attributed to Geoffrey Boulton, Chair of the Royal Society of Edinburgh task force which published Science as an Open Enterprise in 2012. I have heard him say it, if only to say it isn’t his quote. The report itself makes a couple of references to things that have been said that are similar, but are just not as pithy for a quote. But the point is: how relevant is this assertion for scholarship that is outside of the sciences, such as the Humanities? Is data sharing an ethical necessity when the result of research is an expressive work that does not require reproducibility to be valid?

I gave Research Data MANTRA’s definition of research data, in order to reflect on how well it applies to the Humanities:

Research data are collected, observed, or created, for the purposes of analysis to produce and validate original research results.

When we invented this definition, it seemed quite apt for separating ‘stuff’ that is generated in the course of research from stuff that is the object of research; an operational definition, if you will. For example, a set of email messages may just be a set of correspondences; or it may be the basis of a research project if studied. It all depends on the context.

But recently we have become uneasy with this definition when engaging with certain communities, such as the Edinburgh College of Art. They have a lot of digital ‘stuff’ – inputs and outputs of research, but they don’t like to call it data, which has a clinical feel to it, and doesn’t seem to recognise creative endeavour. Is the same true for the Humanities, I wondered? Alas, the audience declined to pursue it in the Q&A, so I still wonder.

“The coolest thing to do with your data will be thought of by someone else.”                          – Rufus Pollock, Cambridge University and Open Knowledge Foundation, 2008

My second quote attempted to illustrate the unease felt by academics about the pressure to share their data, and why the altruistic argument about open data doesn’t tend to win people over, in my experience. I asked people to consider how it made them feel, but perhaps I should have tried it with a show of hands to find out their answers.

Information Wants to Be Free

Quote by John Perry Barlow, image by Robin Rice

I swiftly moved on to talk about open data licensing, the choices we’ve made for Edinburgh DataShare, and whether offering different ‘flavours’ of open licence are important when many people still don’t understand what open licences are about. Again I used an image from MANTRA (above) to point out that the main consideration for depositors should be whether or not to make their data openly available on the internet – regardless of licence.

By putting their outputs ‘in the wild’ academics are necessarily giving up control over how they are used; some users will be ‘unethical’; they will not understand or comply with the terms of use. And we as repository administrators are not in a position to police mis-use for our depositors. Nevertheless, since academic users tend to understand and comply with scholarly norms about citing and giving attribution, those new to data sharing should not be unduly alarmed about the statement illustrated above. (And DataShare provides a ‘suggested citation’ for every data item that helps the user comply with the attribution requirements.)

Since no overview of data and ethics would be complete without consideration given to confidentiality obligations of researchers towards their human subjects, I included a very short video clip from MANTRA, of Professor John MacInnes speaking about caring for data that contain personally identifying information or personal attributes.

For me the most challenging aspect of the roundtable and indeed the day, was the contribution by Dr Anouk Lang about working with data from social media. As an ethical researcher one cannot assume that consent is unnecessary when working with data streams (such as twitter) that are open to public viewing. For one thing, people may not expect views of their posts outside of their own circles – they treat it as a personal communication medium. For another they may assume that what they say is ethereal and will soon be forgotten and unavailable. A show of hands indicated only some of the audience had heard of the Twitter Developers and API, or Storify, which can capture tweets and other objects in a more permanent web page, illustrating her point.

While this whole area may be more common for social researchers – witness the Economic and Social Research Council’s funding of a Big Data Network over several years which includes social media data – Anouk’s work on digital culture proves Humanities researchers cannot escape “the plethora of ethics, privacy and risk issues surrounding the use (and reuse) of social media data.” (Communication on ESRC Big Data Network Phase 3.)

Robin Rice
Data Librarian

Science as an open enterprise – Prof. Geoffrey Boulton

As part of Open Access Week, the Data Library and Scholarly Communications teams in IS hosted a lecture by emeritus Professor Geoffrey Boulton drawing upon his study for the Royal Society: Science as an Open Enterprise (Boulton, et al 2012). The session was introduced by Robin Rice who is the University of Edinburgh Data Librarian.  Robin pointed out that the University of Edinburgh was not just active, but was a leader in research data management having been the first UK institution to have a formal research data management policy.  Looking at who attended the event, perhaps unsurprisingly the majority were from the University of Edinburgh.  Encouragingly, there was roughly a 50:50 split between those actively involved in research and those in support roles.  I say encouragingly as it was later stated that often policies get high-level buy in from institutions but have little impact on those actually doing the research. Perhaps more on that later.

For those that don’t know Prof. Boulton, he is a geologist and glaciologist and has been actively involved in scientific research for over 40 years.  He is used to working with big things (mountains, ice sheets) over timescales measured in millions of years rather than seconds and notes that  while humanity is interesting it will probably be short lived!

Arguably the way we have done science over the last three hundred years has been effective. Science furthers knowledge.  Boulton’s introduction made it clear that he wanted to talk about the processes of science and how they are affected by the gathering, manipulation and analysis of huge amounts of data: the implications, the changes in processes, and why evenness matters in the process of science. This was going to involve a bit of a history lesson, so let’s go back to the start.

Open is not a new concept

Geoffrey Boulton talking about the origins of peer review

“Open is not a new concept”

Open has been a buzzword for a few years now.  Sir Tim Berners-Lee and Prof. Nigel Shadbolt have made great progress in opening up core datasets to the public.  But for science, is open a new concept? Boulton thinks not. Instead he reckons that openness is at the foundations of science but has somehow got a bit lost recently.  Journals originated as a vehicle to disseminate knowledge and trigger discussion of theories.  Boulton  gave a brief history of the origins of journals pointing out that Henry Oldenburg is credited with founding the peer review process with the Philosophical Transactions of the Royal Society.  The journal allowed scientists to share their thoughts and promote discussion.  Oldenburg’s insistence that the Transactions be published in the vernacular rather than Latin was significant as it made science more accessible.  Sound familiar?

Digital data – threat or opportunity? 

We are having the same discussions today, but they are based around technology and, perhaps in some cases, driven by money. The journal publishing model has changed considerably since Oldenburg and it was not the focus of the talk so let us concentrate on the data.  Data are now largely digital.  Journals themselves are also generally digital.  The sheer volume of data we now collect makes it difficult to include the data with a publication. So should data go into a repository?  Yes, and some journals encourage this but few mandate it.  Indeed, many of the funding councils state clearly that research output should be deposited in a repository but don’t seem to enforce this.

Replicability – the cornerstone of the scientific method

Image of Geoffrey Boulton during his talk

Geoffrey Boulton, mid-talk.

Having other independent scientists replicate and validate your findings adds credence to them. Why would you as a professional scientist not want others to confirm that you are correct?  It seems quite simple but it is not the norm.  Boulton pointed us to a recent paper in Nature (Nature v483 n7391) which attempted to replicate the results of a number of studies in cancer research. The team found that they could only replicate 6, around 11%, of the studies.  So the other 81% were fabricating their results?  No, there are a number of reasons why the team could not replicate all the studies.  The methodology may not have been adequately explained leading to slightly different techniques being used, the base data may have been unobtainable and so on but the effect is the same. Most of the previous work that the team looked at is uncorroborated science.  Are we to trust their findings?  Science is supposed to be self-correcting.  You find something, publish, others read it, replicate and corroborate or pose an alternative, old theories are discounted (Science 101 time: “Null Hypothosis“) and our collective knowledge is furthered.  Boulton suggests that, to a large degree, this is not happening. Science is not being corroborated. We have forgotten the process on which our profession is based. Quoting Jim Gray:

“when you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. We are embarrassed by our data.”

Moving forward (or backwards) towards open science

What do we need to do to support, to do to advise, to ensure materials are available for our students, for our researchers to ensure they can be confident about sharing their data?  The University of Edinburgh does reasonably well but we still, like most institutions, have things to do.

Geoffrey looked at some of the benefits of open science and while I am sure we all already know what these are, it is useful to have some high profile examples that we can all aspire to following.

  1. Rapid response – some scientific research is reactive. This is especially true in research into epidemiology and infectious diseases.  An outbreak occurs, it is unfamiliar and we need to understand it as quickly as possible to limit its effects. During an e-coli outbreak in Hamburg local scientists were struggling to identify the source. They analysed the strain and released the genome under an open licence. Within a week they had a dozen reports from 4 continents. This helped to identify the source of the outbreak and ultimately saved lives.(Rohde et al 2011)
  2. Crowd-sourcing – mathematical research is unfathomable to many.  Mathematicians are looking for solutions to problems. Working in isolation or small research clusters is the norm, but is it effective?  Tim Gowers (University of Cambridge) decided to break with convention and post the “problems” he was working on to his blog.  The result; 32 days – 27 people – 800 substantive contributions. 800 substantive contributions!  I am sure that Tim also fostered some new research collaborations from his 27 respondents.
  3. Change the social dynamic of science – “We are scientists, you wouldn’t understand” is not exactly a helpful stance to adopt.  “We are scientists and we need your help,” now that’s much better!  The rise of the app has seen a new arm of science emerge, “citizen science”. The crowd, or sometimes the informed crowd, is a powerful thing. With a carefully designed app you can collect a lot of data from a lot of places over a short period. Projects such as ASHtag and LeafWatch are just two examples where the crowd has been usefully deployed to help collect data for scientists.  Actually, this has been going on for some time in different forms, do you remember the SETI@Home screensaver?  It’s still going, 3 million users worldwide processing data for scientists since 1999.
  4. Openness and transparency – no one wants another “Climategate“.  In fact Climategate need not have happened at all. Much of the data was already publicly available and the scientists had done nothing wrong. Their lack of openness was seen as an admission that they had something to hide and this was used to damaging effect by the climate sceptics.
  5. Fraud – open data is crucial as it shines the light on science and the scientific technique and helps prevent fraud.

What value if not intelligent?

However, Boulton’s closing comments made the point that openness has little value if it is not “intelligent” so this means it is:

  • accessible (can it be found?)
  • intelligible (can you make sense of it?)
  • assessable (can you rationally look at the data objectively?)
  • re-usable (has sufficient metadata to describe how is was created?)

I would agree with Boulton’s criteria but would personally modify the accessible entry. In my opinion data is not open if it is buried in a PDF document. OK, I may be able to find it, but getting the data into a usable format still takes considerable effort, and in some cases, skill.  The data should be ready to use.

Of course, not every dataset can be made open.  Many contain sensitive data that needs to be guarded as it could perhaps identify an individual.  There are also considerations to do with safety and security that may prevent data becoming open.  In such cases, perhaps the metadata could be open and identify the data custodian.

Questions and Discussion

One of the first questions from the floor focused on the fuzzy boundaries of openness and the questioner was worried that scientist could, and would, hide behind the “legitimate commercial interest” since all data had value and research was important within a university’s business model.  Boulton agreed but suggested that the publishers could do more and force authors to make their data open. Since we are, in part, judged by our publication record you would have to comply and publish your data.  Monetising the data would then have to be a separate thing. He alluded to the pharmaceutical industry, long perceived to be driven by money but which has recently moved to be more open.

The second question followed on from this asking if anything could be learned from the licences used for software such as the GNU and the Apache Licence.  Boulton stated that the government is currently looking at how to licence publicly-funded research.  What is being considered at the EU level may be slightly regressive and based on EU lobbying from commercial organisations. There is a lot going on in this area at the moment so keep your eyes and ears open.

The final point from the session sought clarification of The University of Edinburgh research data management policy.  Item nine states

“Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

But how do we know what is important, or what will be deemed significant in the future? Boulton agreed that this was almost impossible.  We cannot archive all data and inevitably some important “stuff” will be lost – but that has always been the case.

View of the audience for Geoffrey Boulton's talk as part of Open Access Week at UoE

The audience for Geoffrey Boulton’s talk as part of Open Access Week at UoE

My Final Thoughts on Geoffrey’s Talk

An interesting talk.  There was nothing earth-shattering or new in it, but a good review of the argument for openness in science from someone who actually has the attention of those who need to recognise the importance of the issue and take action on it.  But instead of just being a top down talk, there was certainly a bottom up message.  Why wait for a mandate from a research council or a university? There are advantages to be had from being open with your data and these benefits are potentially bigger for the early adopters.

I will leave you with an aside from Boulton on libraries…

“Libraries do the wrong thing, employ the wrong people.”

For good reasons we’ve been centralising libraries. But perhaps we have to reverse that. Publications are increasingly online but soon it will be the data that we seek and tomorrow’s librarians should be skilled data analysts who understand data and data manipulation.  Discuss.

Some links and further reading:

Addy Pope

Research and Geodata team, EDINA



How open should your data be?

The RECODE project is looking at open data policy for EU-funded research. I attended a workshop in Sheffield yesterday for a diverse stakeholder group of researchers, funders and data providers. Along with a nice lunch, they delivered their first draft report, in which they synthesised current literature on open research data and presented five case studies of research practice in different disciplines. The format was very interactive with several break-out groups and discussions.

The usual barriers to data sharing were trotted out in different forms. (Forgive my ho-hum tone if this is a newish topic for you – our DISC-UK DataShare project summarised these in its 2007 ‘State-of-the-Art-Review’ and the reasons haven’t really changed since.) The RECODE team ably boiled these down to technical, cultural and economic issues.

The morning’s activity included a small-group discussion about disciplinary differences in motivations for data sharing. One gadfly (not me) questioned the premise of the whole topic. While differences in practice around treatment of data is undeniable, are the motivations for sharing or not sharing data really different amongst groups of researchers?

This seemed a fair point. For any given obstacle – be it commercial viability, fear of being scooped, errors being found or data being misinterpreted, desire to keep one’s ‘working capital’ for future publication, lack of time to properly prepare the data and documentation required for re-use coupled with lack of perceived academic rewards, lack of infrastructure, or disappearance of key personnel (including postgrads) – these are all disincentives for data sharing wherever they crop up.

On the flip-side, motivations to share – making data easily available to one’s colleagues and students, adding to the scholarly record, backing up one’s reported results, desire for others to add value to a treasured dataset, increasing one’s impact and potential citations, passing off the custodianship of a completed dataset to a trusted archive, or mere compliance with a funder’s or publisher’s policy are reasons that transcend disciplinary boundaries.

“Reciprocal altruism” was a new one to me. I’m not sure I believe it exists. I’ve seen more than one study showing that researchers (also teachers, where open educational resources are concerned) crave open access to other people’s ‘stuff’ whether or not they feel obliged to share their own (and more don’t than do).

An afternoon discussion focused on how open data needed to be, to be considered open. This was an amusing diversion from the topic we were given by the organisers. The UK Data Archive funded by ESRC, while a bulwark in the patchy architecture of data preservation and dissemination, does not make any of its collections available without a registration procedure that not only asks you who you are, but what you intend to do with the data. If the data are non-sensitive in nature, how necessary is this? Does the fact that the data owner would like to collect this information warrant collecting it?

A recent consensus on a new jiscmail list, data-publication, was that this sort of ‘red tape’ routinely placed in the way of data access was an affront to academic freedom. Would you agree? Would your answer depend on whether you were the user or the owner?

Edinburgh DataShare has so far resisted the temptation to require user registration for any data deposited with us, because the service was established to be an open data repository for the use of University depositors and for re-use by other researchers as well as the public (which, in most cases paid for the research). We offer our depositors normal website download statistics, and provide a suggested citation to each dataset to encourage proper attribution. We encourage use of an open data licence which requires attribution of the data creator. For depositors who do not wish to use an open licence they are free to provide their own rights statement.

The ODC-attribution licence that we offer by default is compatible with the Budapest Open Access Initiative (BOAI), but is one step less open than “CC0″ (pronounced CC-zero) where rights to the data are waived in the interest of complete freedom for data re-users. Some argue that data – as opposed to publications – should be made completely open in this way to allow pooling of numerous datasets for analysis and machine-processing.

For example, Professor Carol Goble has just written in her blog that “BioMed Central’s adoption of the Creative Commons CC0 waiver opens up the way that data published in their journals can be used, so that it can be freely mined, analysed, and reused.”

While I agree BioMed Central’s decision is good news and that CC0 licences may be the state of the art for open data, as a repository manager I have yet to meet an academic who does not wish to be attributed for data collected by the ‘sweat of the brow’ to use a phrase from copyright case law. It is slightly easier for me to persuade researchers to share their data openly with the reassurance that an open-attribution licence brings than to persuade them to waive their rights to be attributed.

The University Research Data Management Policy asserts, “Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

In practice, it has been acknowledged that this would be difficult to enforce for ‘legacy’ research data, but from now on researchers embarking on a new research project are expected to create a data management plan in which the short and long term management of the data are considered before they are collected: “All new research proposals… must include research data management plans or protocols that explicitly address data capture, management, integrity, confidentiality, retention, sharing and publication.

How open will you make your next dataset? open data button