Open up! On the scientific and public benefits of data sharing

Research published a year ago in the journal Current Biology found that 80 percent of original scientific data obtained through publicly-funded research is lost within two decades of publication. The study, based on 516 random journal articles which purported to make associated data available, found the odds of finding the original data for these papers fell by 17 percent every year after publication, and concluded that “Policies mandating data archiving at publication are clearly needed” (http://dx.doi.org/10.1016/j.cub.2013.11.014).

In this post I’ll touch on three different initiatives aimed at strengthening policies requiring publicly funded data – whether produced by government or academics – to be made open. First, a report published last month by the Research Data Alliance Europe, “The Data Harvest: How sharing research data can yield knowledge, jobs and growth.”  Second, a report by an EU-funded research project called RECODE on “Policy Recommendations for Open Access to Research Data”, released last week at their conference in Athens.  Third, the upcoming publication of Scotland’s Open Data Strategy, pre-released to attendees of an Open Data and PSI Directive Awareness Raising Workshop Monday in Edinburgh.

Experienced so close together in time (having read the data harvest report on the plane back from Athens in between the two meetings), these discrete recommendations, policies and reports are making me just about believe that 2015 will lead not only to a new world of interactions in which much more research becomes a collaborative and integrative endeavour, playing out the idea of ‘Science 2.0’ or ‘Open Science’, and even that the long-promised ‘knowledge economy’ is actually coalescing, based on new products and services derived from the wealth of (open) data being created and made available.

‘The initial investment is scientific, but the ultimate return is economic and social’

John Wood, currently the Co-Chair of the global Research Data Alliance (RDA) as well as Chair of RDA-Europe, set out the case in his introduction to the Data Harvest report, and from the podium at the RECODE conference, that the new European commissioners and parliamentarians must first of all, not get in the way, and second, almost literally ‘plan the harvest’ for the economic benefits that the significant public investments in data, research and technical infrastructure are bringing.

CaptureThe report’s irrepressible argument goes, “Just as the World Wide Web, with all its associated technologies and communications standards, evolved from a scientific network to an economic powerhouse, so we believe the storing, sharing and re-use of scientific data on a massive scale will stimulate great new sources of wealth.” The analogy is certainly helped by the fact that the WWW was invented at a research institute (CERN), by a researcher, for researchers. The web – connecting 2 billion people, according to a McKinsey 2011 report, contributed more to GDP globally than energy or agriculture. The report doesn’t shy away from reminding us and the politicians it targets, that it is the USA rather than Europe that has grabbed the lion’s share of economic benefit– via Internet giants Google, Amazon, eBay, etc. – from the invention of the Web and that we would be foolish to let this happen again.

This may be a ruse to convince politicians to continue to pour investment into research and data infrastructure, but if so it is a compelling one. Still, the purpose of the RDA, with its 3,000 members from 96 countries is to further global scientific data sharing, not economies. The report documents what it considers to be a step-change in the nature of scientific endeavour, in discipline after discipline. The report – which is the successor to the 2010 report also chaired by Wood, “Riding the Wave: How Europe can gain from the rising tide of scientific data,” celebrates rather than fears the well-documented data deluge, stating,

“But when data volumes rise so high, something strange and marvellous happens: the nature of science changes.”

The report gives examples of successful European collaborative data projects, mainly but not exclusively in the sciences, such as the following:

  • Lifewatch – monitors Europe’s wetlands, providing a single point to collect information on migratory birds. Datasets created help to assess the impact of climate change and agricultural practices on biodiversity
  • Pharmacog – partnership of academic institutions and pharmaceutical companies to find promising compounds for Alzheimer’s research to avoid expensive late-stage failures of drugs in development.
  • Human Brain Project – multidisciplinary initiative to collect and store data in a standardised and systematic way to facilitate modelling.
  • Clarin – integrating archival information from across Europe to make it discoverable and usable through a single portal regardless of language.

The benefits of open data, the report claims, extends to three main groups:

  • to citizens, who will benefit indirectly from new products and services and also be empowered to participate in civic society and scientific endeavour (e.g. citizen science);
  • to entrepeneurs, who can innovate based on new information that no one organisation has the money or expertise to exploit alone;
  • to researchers, for whom the free exchange of data will open up new research and career opportunities, allow crossing of boundaries of disciplines, institutions, countries, and languages, and whose status in society will be enhanced.

‘Open by Default’

If the data harvest report lays out the argument for funding open data and open science, the RECODE policy recommendations focus on what the stakeholders can do to make it a reality. The project is fundamentally a research project which has been producing outputs such as disciplinary case studies in physics, health, bioengineering, environment and archaeology. The researchers have examined what they consider to be four grand challenges for data sharing.

  • Stakeholder values and ecosystems: the road towards open access is not perceived in the same way by those funding, creating, disseminating, curating and using data.
  • Legal and ethical concerns: unintended secondary uses, misappropriation and commercialization of research data, unequal distribution of scientific results and impacts on academic freedom.
  • Infrastructure and technology challenges: heterogeneity and interoperability; accessibility and discoverability; preservation and curation; quality and assessibility; security.
  • Institutional challenges: financial support, evaluating and maintaining the quality, value and trustworthiness of research data, training and awareness-raising on opportunities and limitations of open data.

Capture1RECODE gives overarching recommendations as well as stake-holder specific ones, a ‘practical guide for developing policies’ with checklist for the four major stakeholder groups: funders, data managers, research institutions and publishers.

‘Open Changes Everything’

The Scottish government event was a pre-release of the  open data strategy, which is awaiting final ministerial approval, though in its final draft, following public consultation. The speakers made it clear that Scotland wants to be a leader in this area and drive culture change to achieve it. The policy is driven in part by the G8 countries’ “Open Data Charter” to act by the end of 2015 on a set of five basic principles – for instance, that public data should be open to all “by default” rather than only in special cases, and supported by UK initiatives such as the government-funded Open Data Institute and the grassroots Open Knowledge Foundation.

Capture

Improved governance (or public services) and ‘unleashing’ innovation in the economy are the two main themes of both the G8 charter and the Scotland strategy. The fact was not lost on the bureaucrats devising the strategy that public sector organisations have as much to gain as the public and businesses from better availability of government data.

The thorny issue of personal data is not overlooked in the strategy, and a number of important strides have been taken in Scotland by government and (University of Edinburgh) academics recently on both understanding the public’s attitudes, and devising governance strategies for important uses of personal data such as linking patient records with other government records for research.

According to Jane Morgan from the Digital Public Services Division of the Scottish Government, the goal is for citizens to feel ownership of their own data, while opening up “trustworthy uses of data for public benefit.”

Tabitha Stringer, whose title might be properly translated as ‘policy wonk’ for open data, reiterated the three main reason for the government to embrace open data:

  • Transparency, accountability, supporting civic engagement
  • Designing and delivering public services (and increasingly digital services)
  • Basis for innovation, supporting the economy via growth of products & services

‘Digital first’

The remainder of the day focused on the new EU Public Service Information directive and how it is being ‘transposed’ into UK legislation to be completed this year. In short, the Freedom of Information and other legislation is being built upon to require not just publication schemes but also asset lists with particular titles by government agencies. The effect of which, and the reason for the awareness raising workshop is that every government agency is to become a data publisher, and must learn how to manage their data not just for their own use but for public ‘re-users’. Also, for the first time academic libraries and other ‘cultural organisations’ are to be included in the rules, where there is a ‘public task’ in their mission.

‘Digital first’ refers to the charging rules in which only marginal costs (not full recovery) may be passed on, and where information is digital the marginal cost is expected to be zero, so that the vast majority of data will be made freely available.

keep-calm-and-open-data-11Robin Rice
EDINA and Data Library

 

 

Using an electronic lab notebook to deposit data into Edinburgh DataShare

This is heads up about a ‘coming attraction’.  For the past several months a group at Research Space has been working with the DataShare team, including Robin Rice and George Hamilton, to make it possible to deposit research data from our new RSpace electronic notebook into DataShare.

I gave the first public preview of this integration last month in a presentation called Electronic lab notebooks and data repositories:  Complementary responses to the scientific data problem  to a session on Research Data and Electronic Lab Notebooks at the American Chemical Society conference in Dallas.

When the RSpace ELN becomes available to researchers at Edinburgh later this spring, users of RSpace will be able to make deposits to DataShare directly from RSpace using a simple interface we have built into RSpace.  The whole process only takes a few clicks, and starts with selecting records to be deposited into DataShare and clicking on the DataShare button as illustrated in the following screenshot:b2_workspaceHighlightedYou are then asked to enter some information about the deposit:

c2_datashareDialogFilledAfter confirming a few details about the deposit, the data is deposited directly into DataShare, and information about the deposit appears in DataShare.

h2_viewInDatashare2We will provide details about how to sign up for an RSpace account in a future post later in the spring.  In the meantime, I’d like to thank Robin and George for working with us at RSpace on this exciting project.  As far as we know this is the first time an electronic lab notebook has ever been integrated with an institutional data repository, so this is a pioneering and very exciting experiment!  We hope to use it as a model for similar integrations with other institutional and domain-specific repositories.

Rory MacNeil
Chief Executive, Research Space

Science as an open enterprise – Prof. Geoffrey Boulton

As part of Open Access Week, the Data Library and Scholarly Communications teams in IS hosted a lecture by emeritus Professor Geoffrey Boulton drawing upon his study for the Royal Society: Science as an Open Enterprise (Boulton, et al 2012). The session was introduced by Robin Rice who is the University of Edinburgh Data Librarian.  Robin pointed out that the University of Edinburgh was not just active, but was a leader in research data management having been the first UK institution to have a formal research data management policy.  Looking at who attended the event, perhaps unsurprisingly the majority were from the University of Edinburgh.  Encouragingly, there was roughly a 50:50 split between those actively involved in research and those in support roles.  I say encouragingly as it was later stated that often policies get high-level buy in from institutions but have little impact on those actually doing the research. Perhaps more on that later.

For those that don’t know Prof. Boulton, he is a geologist and glaciologist and has been actively involved in scientific research for over 40 years.  He is used to working with big things (mountains, ice sheets) over timescales measured in millions of years rather than seconds and notes that  while humanity is interesting it will probably be short lived!

Arguably the way we have done science over the last three hundred years has been effective. Science furthers knowledge.  Boulton’s introduction made it clear that he wanted to talk about the processes of science and how they are affected by the gathering, manipulation and analysis of huge amounts of data: the implications, the changes in processes, and why evenness matters in the process of science. This was going to involve a bit of a history lesson, so let’s go back to the start.

Open is not a new concept

Geoffrey Boulton talking about the origins of peer review

“Open is not a new concept”

Open has been a buzzword for a few years now.  Sir Tim Berners-Lee and Prof. Nigel Shadbolt have made great progress in opening up core datasets to the public.  But for science, is open a new concept? Boulton thinks not. Instead he reckons that openness is at the foundations of science but has somehow got a bit lost recently.  Journals originated as a vehicle to disseminate knowledge and trigger discussion of theories.  Boulton  gave a brief history of the origins of journals pointing out that Henry Oldenburg is credited with founding the peer review process with the Philosophical Transactions of the Royal Society.  The journal allowed scientists to share their thoughts and promote discussion.  Oldenburg’s insistence that the Transactions be published in the vernacular rather than Latin was significant as it made science more accessible.  Sound familiar?

Digital data – threat or opportunity? 

We are having the same discussions today, but they are based around technology and, perhaps in some cases, driven by money. The journal publishing model has changed considerably since Oldenburg and it was not the focus of the talk so let us concentrate on the data.  Data are now largely digital.  Journals themselves are also generally digital.  The sheer volume of data we now collect makes it difficult to include the data with a publication. So should data go into a repository?  Yes, and some journals encourage this but few mandate it.  Indeed, many of the funding councils state clearly that research output should be deposited in a repository but don’t seem to enforce this.

Replicability – the cornerstone of the scientific method

Image of Geoffrey Boulton during his talk

Geoffrey Boulton, mid-talk.

Having other independent scientists replicate and validate your findings adds credence to them. Why would you as a professional scientist not want others to confirm that you are correct?  It seems quite simple but it is not the norm.  Boulton pointed us to a recent paper in Nature (Nature v483 n7391) which attempted to replicate the results of a number of studies in cancer research. The team found that they could only replicate 6, around 11%, of the studies.  So the other 81% were fabricating their results?  No, there are a number of reasons why the team could not replicate all the studies.  The methodology may not have been adequately explained leading to slightly different techniques being used, the base data may have been unobtainable and so on but the effect is the same. Most of the previous work that the team looked at is uncorroborated science.  Are we to trust their findings?  Science is supposed to be self-correcting.  You find something, publish, others read it, replicate and corroborate or pose an alternative, old theories are discounted (Science 101 time: “Null Hypothosis“) and our collective knowledge is furthered.  Boulton suggests that, to a large degree, this is not happening. Science is not being corroborated. We have forgotten the process on which our profession is based. Quoting Jim Gray:

“when you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. We are embarrassed by our data.”

Moving forward (or backwards) towards open science

What do we need to do to support, to do to advise, to ensure materials are available for our students, for our researchers to ensure they can be confident about sharing their data?  The University of Edinburgh does reasonably well but we still, like most institutions, have things to do.

Geoffrey looked at some of the benefits of open science and while I am sure we all already know what these are, it is useful to have some high profile examples that we can all aspire to following.

  1. Rapid response – some scientific research is reactive. This is especially true in research into epidemiology and infectious diseases.  An outbreak occurs, it is unfamiliar and we need to understand it as quickly as possible to limit its effects. During an e-coli outbreak in Hamburg local scientists were struggling to identify the source. They analysed the strain and released the genome under an open licence. Within a week they had a dozen reports from 4 continents. This helped to identify the source of the outbreak and ultimately saved lives.(Rohde et al 2011)
  2. Crowd-sourcing – mathematical research is unfathomable to many.  Mathematicians are looking for solutions to problems. Working in isolation or small research clusters is the norm, but is it effective?  Tim Gowers (University of Cambridge) decided to break with convention and post the “problems” he was working on to his blog.  The result; 32 days – 27 people – 800 substantive contributions. 800 substantive contributions!  I am sure that Tim also fostered some new research collaborations from his 27 respondents.
  3. Change the social dynamic of science – “We are scientists, you wouldn’t understand” is not exactly a helpful stance to adopt.  “We are scientists and we need your help,” now that’s much better!  The rise of the app has seen a new arm of science emerge, “citizen science”. The crowd, or sometimes the informed crowd, is a powerful thing. With a carefully designed app you can collect a lot of data from a lot of places over a short period. Projects such as ASHtag and LeafWatch are just two examples where the crowd has been usefully deployed to help collect data for scientists.  Actually, this has been going on for some time in different forms, do you remember the SETI@Home screensaver?  It’s still going, 3 million users worldwide processing data for scientists since 1999.
  4. Openness and transparency – no one wants another “Climategate“.  In fact Climategate need not have happened at all. Much of the data was already publicly available and the scientists had done nothing wrong. Their lack of openness was seen as an admission that they had something to hide and this was used to damaging effect by the climate sceptics.
  5. Fraud – open data is crucial as it shines the light on science and the scientific technique and helps prevent fraud.

What value if not intelligent?

However, Boulton’s closing comments made the point that openness has little value if it is not “intelligent” so this means it is:

  • accessible (can it be found?)
  • intelligible (can you make sense of it?)
  • assessable (can you rationally look at the data objectively?)
  • re-usable (has sufficient metadata to describe how is was created?)

I would agree with Boulton’s criteria but would personally modify the accessible entry. In my opinion data is not open if it is buried in a PDF document. OK, I may be able to find it, but getting the data into a usable format still takes considerable effort, and in some cases, skill.  The data should be ready to use.

Of course, not every dataset can be made open.  Many contain sensitive data that needs to be guarded as it could perhaps identify an individual.  There are also considerations to do with safety and security that may prevent data becoming open.  In such cases, perhaps the metadata could be open and identify the data custodian.

Questions and Discussion

One of the first questions from the floor focused on the fuzzy boundaries of openness and the questioner was worried that scientist could, and would, hide behind the “legitimate commercial interest” since all data had value and research was important within a university’s business model.  Boulton agreed but suggested that the publishers could do more and force authors to make their data open. Since we are, in part, judged by our publication record you would have to comply and publish your data.  Monetising the data would then have to be a separate thing. He alluded to the pharmaceutical industry, long perceived to be driven by money but which has recently moved to be more open.

The second question followed on from this asking if anything could be learned from the licences used for software such as the GNU and the Apache Licence.  Boulton stated that the government is currently looking at how to licence publicly-funded research.  What is being considered at the EU level may be slightly regressive and based on EU lobbying from commercial organisations. There is a lot going on in this area at the moment so keep your eyes and ears open.

The final point from the session sought clarification of The University of Edinburgh research data management policy.  Item nine states

“Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

But how do we know what is important, or what will be deemed significant in the future? Boulton agreed that this was almost impossible.  We cannot archive all data and inevitably some important “stuff” will be lost – but that has always been the case.

View of the audience for Geoffrey Boulton's talk as part of Open Access Week at UoE

The audience for Geoffrey Boulton’s talk as part of Open Access Week at UoE

My Final Thoughts on Geoffrey’s Talk

An interesting talk.  There was nothing earth-shattering or new in it, but a good review of the argument for openness in science from someone who actually has the attention of those who need to recognise the importance of the issue and take action on it.  But instead of just being a top down talk, there was certainly a bottom up message.  Why wait for a mandate from a research council or a university? There are advantages to be had from being open with your data and these benefits are potentially bigger for the early adopters.

I will leave you with an aside from Boulton on libraries…

“Libraries do the wrong thing, employ the wrong people.”

For good reasons we’ve been centralising libraries. But perhaps we have to reverse that. Publications are increasingly online but soon it will be the data that we seek and tomorrow’s librarians should be skilled data analysts who understand data and data manipulation.  Discuss.

Some links and further reading:

Addy Pope

Research and Geodata team, EDINA