Electronic Laboratory Notebooks – help or hindrance to academic research?

On the 30 October 2013 the University of Edinburgh (UoE) organised what I believe to be the first University wide meeting on Electronic Lab Notebooks (ELN), and allowed a number of Principal Investigators (PIs) and others the opportunity to provide useful feedback on their user experiences.  This provided an excellent opportunity to help discuss and inform what the UoE can do to help its researchers, and whether there is likely to be one ‘solution’ which could be implemented across the UoE or if a more bespoke and individual/discipline specific approach would be required.

Lab Notes by S.S.K. – Flickr

Good research and good research data management (RDM) stem from the ability of researchers to accurately record, find, retrieve and store the information from their research endeavours.  For many, but by no means all, this will initially be done by recording their outputs on the humble piece of paper.  Albeit one contained within a hardbound notebook (to ensure an accurate chronological record of the work) and supplemented liberally with printouts, photographs, x-rays, etc. and reminders of where to look for the electronic data relevant to the day’s work (ideally at least).

Presentations from University researchers

Slides from these presentations are available to UoE members via the wiki.

The event kicked off with a live demonstration from the member of the School of Physics & Astronomy, and his positive experiences with the Livescribe system.  This demonstration impressively articulated the functions of the electronic pen, which allows its user to record, stroke by stroke, their writings, and pass on this information either as a movie or document to others, and store the output electronically.  Although there were some disadvantages noted, such as the physical size of the pen and the reliance on WiFi for certain features, and that to date, only certain iOS 7 devices are supported (although this list will grow in 2014).  Clearly, this device has had a positive effect on both the presenter’s research and teaching duties.  However the livescribe pen does not in itself help address how to store these digital files.

The remainder of the presentations from the academic researchers were from the fields of life science, although their experiences were quite diverse.  This helpfully provided a good set-up for a healthy discussion, on both ELNs and indeed the wider aspects of RDM at the UoE.

Of the active researchers who presented, two were PIs from the School of Molecular, Genetic and Population Health Sciences and one was a postdoctoral researcher from the School of Biological Sciences.  All three had prior experience in using previous versions of ELNs, and had sought an ELN to address a range of similar issues with paper laboratory notebooks.

Merits and pitfalls of electronic notebooks

I have chosen not to provide feedback on the specific ELNs trialled here, but the software discussed was Evernote, eCAT, and Accelyrs, and as the UoE does not recommend or discourage the use of any particular ELN to-date, I won’t either.

In all cases these electronic systems were purchased for help with key areas:

Motivation/Benefits

  • Searchable data resource
  • Safe archive
  • Sharing data
  • Copy and paste functions
  • Functionality for reviewing lab member’s progress
  • Ability to organise by experiments (not just chronologically)
  • One system to store reagents/freezer contents with experimental data

And in general, key problematic issues raised with these systems were:

Barriers/Problems

  • Need for reliable internet access
  • Hardware integration into lab environment
  • Required more time to document and import data
  • Poor user interface/experience
  • Copy and paste functions (although time-saving, may increase errors as data are not reviewed)
  • Administration time by PI is required
  • PhD students and postdocs (when given the choice*) preferred to use paper notebooks

*it was mooted that no choice should be given.

Infrastructure

A common theme with the use of ELNs was that of the hardware, and the reliance on WiFi.  Clearly when working at the bench with reagents that are potentially hazardous (chemicals, radiation, etc) or with biologicals that you don’t wish to contaminate (primary cell cultures for instance) the hardware used is not supposed to be moved between such locations and  ‘dry areas’ such as your office.  A number of groups have attempted to solve this problem by utilising tablets, and sync to both the “cloud” and their office computers, and this is of course dependent on WiFi.  Without WiFi, you might unexpectedly find yourself with no access to any of your data/protocols, which leads to real problems if you are in the middle of an experiment.  Additionally this requires the outlay of monies for the purchase of the tablets, and provides a tempting means of distraction to group members (both of which may be frowned upon by many PIs).  This monetary concern was identified as a potential problem for the larger groups, where multiple tablets would be required.

Research Data Management & Electronic Laboratory Notebooks

From an RDM perspective the subsequent discussions raised a number of interesting issues.  Firstly, as a number of these ELN services utilise the “cloud” for storage, it was clear that many researchers, PIs included, were unaware of what was expected from them by both their funding councils and the UoE.

Secure Cloud Computing by FutUndBeidl – Flickr

The Data Protection Act 1998 sets out how organisations may use personal data, and the Records Management Section’s guidance on ‘Taking sensitive information and personal data outside the University’s secure computing environment  details the UoE position on this matter, but essentially all sensitive or personal information leaving the UoE should be encrypted.  This guidance would seem not to have reached a significant proportion of the researchers yet.

ELN? – not for academic research!

Whilst the first two presentations were broadly supportive of ELNs, the third researcher’s presentation was distinctly negative, and he provided his interpretation on the use of an ELN in an academic setting.  Although broadly speaking this presentation was on one product, it was made clear that his opinions were not based on one ‘software product’ alone.  In this case the PI has since abandoned the ELN (after four years of use and requiring his lab members to use it), citing reasons of practicality; it took too long to document the results (paper is always quicker), there is no standard for writing up documentation online**, and the data have effectively been stored twice.

He was also of the strong opinion that the use of ELNs:

“were not going to improve your research quality – it’s for those who want to spend time making their data look pretty.”

And –

“it is not for academic research, but more suited for service labs and industry.”

These would seem to be viewpoints that cannot easily be addressed.

The role of the PI

**Of course this is also true for paper versions, with the National Postdoctoral Association (USA)  noting in their toolkit section on ‘Data Acquisition, Management, Sharing and Ownership’ that with the multinational approach to research that:

“many [postdocs] may prefer to keep their notes in their native language instead of English. Postdoc supervisors need to take this into consideration and establish guidelines for the extent to which record keeping must be generally accessible.”

The role of the PI cannot be overlooked in this process and to-date, even if a paper notebook is utilised, there is often no standard to observe.

The next generation of ELNs

Despite these concerns ResearchSpace Ltd are poised to release the next generation of an ELN, with an enterprise release of their popular eCAT ELN, to be called RSpace.  The RSpace team seem confident that they are both aware and capable of addressing these various user requirements and it will certainly be interesting to see how they get on.  Certainly they provided clear evidence of improved user interfaces, enhanced tools, knowledge of University policy, with the prospect of integration into the existing UoE digital infrastructure, such as the data repository, Edinburgh DataShare.

Researcher engagement

Importantly whilst this programme identified concerns and benefits with the various software systems available, it also highlighted issues with the UoE dissemination of RDM knowledge to the research community, and so perhaps fittingly the last word will be from the chair:

“The University has a lot of useful information on this area of data management; please look at the research support pages!”

So the fundamental question remains, what is the best way to engage researchers in RDM and how can we best address this need at all levels?

Links

David Girdwood
EDINA & Data Library

Science as an open enterprise – Prof. Geoffrey Boulton

As part of Open Access Week, the Data Library and Scholarly Communications teams in IS hosted a lecture by emeritus Professor Geoffrey Boulton drawing upon his study for the Royal Society: Science as an Open Enterprise (Boulton, et al 2012). The session was introduced by Robin Rice who is the University of Edinburgh Data Librarian.  Robin pointed out that the University of Edinburgh was not just active, but was a leader in research data management having been the first UK institution to have a formal research data management policy.  Looking at who attended the event, perhaps unsurprisingly the majority were from the University of Edinburgh.  Encouragingly, there was roughly a 50:50 split between those actively involved in research and those in support roles.  I say encouragingly as it was later stated that often policies get high-level buy in from institutions but have little impact on those actually doing the research. Perhaps more on that later.

For those that don’t know Prof. Boulton, he is a geologist and glaciologist and has been actively involved in scientific research for over 40 years.  He is used to working with big things (mountains, ice sheets) over timescales measured in millions of years rather than seconds and notes that  while humanity is interesting it will probably be short lived!

Arguably the way we have done science over the last three hundred years has been effective. Science furthers knowledge.  Boulton’s introduction made it clear that he wanted to talk about the processes of science and how they are affected by the gathering, manipulation and analysis of huge amounts of data: the implications, the changes in processes, and why evenness matters in the process of science. This was going to involve a bit of a history lesson, so let’s go back to the start.

Open is not a new concept

Geoffrey Boulton talking about the origins of peer review

“Open is not a new concept”

Open has been a buzzword for a few years now.  Sir Tim Berners-Lee and Prof. Nigel Shadbolt have made great progress in opening up core datasets to the public.  But for science, is open a new concept? Boulton thinks not. Instead he reckons that openness is at the foundations of science but has somehow got a bit lost recently.  Journals originated as a vehicle to disseminate knowledge and trigger discussion of theories.  Boulton  gave a brief history of the origins of journals pointing out that Henry Oldenburg is credited with founding the peer review process with the Philosophical Transactions of the Royal Society.  The journal allowed scientists to share their thoughts and promote discussion.  Oldenburg’s insistence that the Transactions be published in the vernacular rather than Latin was significant as it made science more accessible.  Sound familiar?

Digital data – threat or opportunity? 

We are having the same discussions today, but they are based around technology and, perhaps in some cases, driven by money. The journal publishing model has changed considerably since Oldenburg and it was not the focus of the talk so let us concentrate on the data.  Data are now largely digital.  Journals themselves are also generally digital.  The sheer volume of data we now collect makes it difficult to include the data with a publication. So should data go into a repository?  Yes, and some journals encourage this but few mandate it.  Indeed, many of the funding councils state clearly that research output should be deposited in a repository but don’t seem to enforce this.

Replicability – the cornerstone of the scientific method

Image of Geoffrey Boulton during his talk

Geoffrey Boulton, mid-talk.

Having other independent scientists replicate and validate your findings adds credence to them. Why would you as a professional scientist not want others to confirm that you are correct?  It seems quite simple but it is not the norm.  Boulton pointed us to a recent paper in Nature (Nature v483 n7391) which attempted to replicate the results of a number of studies in cancer research. The team found that they could only replicate 6, around 11%, of the studies.  So the other 81% were fabricating their results?  No, there are a number of reasons why the team could not replicate all the studies.  The methodology may not have been adequately explained leading to slightly different techniques being used, the base data may have been unobtainable and so on but the effect is the same. Most of the previous work that the team looked at is uncorroborated science.  Are we to trust their findings?  Science is supposed to be self-correcting.  You find something, publish, others read it, replicate and corroborate or pose an alternative, old theories are discounted (Science 101 time: “Null Hypothosis“) and our collective knowledge is furthered.  Boulton suggests that, to a large degree, this is not happening. Science is not being corroborated. We have forgotten the process on which our profession is based. Quoting Jim Gray:

“when you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. We are embarrassed by our data.”

Moving forward (or backwards) towards open science

What do we need to do to support, to do to advise, to ensure materials are available for our students, for our researchers to ensure they can be confident about sharing their data?  The University of Edinburgh does reasonably well but we still, like most institutions, have things to do.

Geoffrey looked at some of the benefits of open science and while I am sure we all already know what these are, it is useful to have some high profile examples that we can all aspire to following.

  1. Rapid response – some scientific research is reactive. This is especially true in research into epidemiology and infectious diseases.  An outbreak occurs, it is unfamiliar and we need to understand it as quickly as possible to limit its effects. During an e-coli outbreak in Hamburg local scientists were struggling to identify the source. They analysed the strain and released the genome under an open licence. Within a week they had a dozen reports from 4 continents. This helped to identify the source of the outbreak and ultimately saved lives.(Rohde et al 2011)
  2. Crowd-sourcing – mathematical research is unfathomable to many.  Mathematicians are looking for solutions to problems. Working in isolation or small research clusters is the norm, but is it effective?  Tim Gowers (University of Cambridge) decided to break with convention and post the “problems” he was working on to his blog.  The result; 32 days – 27 people – 800 substantive contributions. 800 substantive contributions!  I am sure that Tim also fostered some new research collaborations from his 27 respondents.
  3. Change the social dynamic of science – “We are scientists, you wouldn’t understand” is not exactly a helpful stance to adopt.  “We are scientists and we need your help,” now that’s much better!  The rise of the app has seen a new arm of science emerge, “citizen science”. The crowd, or sometimes the informed crowd, is a powerful thing. With a carefully designed app you can collect a lot of data from a lot of places over a short period. Projects such as ASHtag and LeafWatch are just two examples where the crowd has been usefully deployed to help collect data for scientists.  Actually, this has been going on for some time in different forms, do you remember the SETI@Home screensaver?  It’s still going, 3 million users worldwide processing data for scientists since 1999.
  4. Openness and transparency – no one wants another “Climategate“.  In fact Climategate need not have happened at all. Much of the data was already publicly available and the scientists had done nothing wrong. Their lack of openness was seen as an admission that they had something to hide and this was used to damaging effect by the climate sceptics.
  5. Fraud – open data is crucial as it shines the light on science and the scientific technique and helps prevent fraud.

What value if not intelligent?

However, Boulton’s closing comments made the point that openness has little value if it is not “intelligent” so this means it is:

  • accessible (can it be found?)
  • intelligible (can you make sense of it?)
  • assessable (can you rationally look at the data objectively?)
  • re-usable (has sufficient metadata to describe how is was created?)

I would agree with Boulton’s criteria but would personally modify the accessible entry. In my opinion data is not open if it is buried in a PDF document. OK, I may be able to find it, but getting the data into a usable format still takes considerable effort, and in some cases, skill.  The data should be ready to use.

Of course, not every dataset can be made open.  Many contain sensitive data that needs to be guarded as it could perhaps identify an individual.  There are also considerations to do with safety and security that may prevent data becoming open.  In such cases, perhaps the metadata could be open and identify the data custodian.

Questions and Discussion

One of the first questions from the floor focused on the fuzzy boundaries of openness and the questioner was worried that scientist could, and would, hide behind the “legitimate commercial interest” since all data had value and research was important within a university’s business model.  Boulton agreed but suggested that the publishers could do more and force authors to make their data open. Since we are, in part, judged by our publication record you would have to comply and publish your data.  Monetising the data would then have to be a separate thing. He alluded to the pharmaceutical industry, long perceived to be driven by money but which has recently moved to be more open.

The second question followed on from this asking if anything could be learned from the licences used for software such as the GNU and the Apache Licence.  Boulton stated that the government is currently looking at how to licence publicly-funded research.  What is being considered at the EU level may be slightly regressive and based on EU lobbying from commercial organisations. There is a lot going on in this area at the moment so keep your eyes and ears open.

The final point from the session sought clarification of The University of Edinburgh research data management policy.  Item nine states

“Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

But how do we know what is important, or what will be deemed significant in the future? Boulton agreed that this was almost impossible.  We cannot archive all data and inevitably some important “stuff” will be lost – but that has always been the case.

View of the audience for Geoffrey Boulton's talk as part of Open Access Week at UoE

The audience for Geoffrey Boulton’s talk as part of Open Access Week at UoE

My Final Thoughts on Geoffrey’s Talk

An interesting talk.  There was nothing earth-shattering or new in it, but a good review of the argument for openness in science from someone who actually has the attention of those who need to recognise the importance of the issue and take action on it.  But instead of just being a top down talk, there was certainly a bottom up message.  Why wait for a mandate from a research council or a university? There are advantages to be had from being open with your data and these benefits are potentially bigger for the early adopters.

I will leave you with an aside from Boulton on libraries…

“Libraries do the wrong thing, employ the wrong people.”

For good reasons we’ve been centralising libraries. But perhaps we have to reverse that. Publications are increasingly online but soon it will be the data that we seek and tomorrow’s librarians should be skilled data analysts who understand data and data manipulation.  Discuss.

Some links and further reading:

Addy Pope

Research and Geodata team, EDINA

 

 

RDM in the arts and humanities

St Annes College, Oxford The tenth Research Data Management Forum (RDMF), organised by DCC was held in St Anne’s College, University of Oxford on 3 and 4 September. Thus follows an account of proceedings, the goals of which were to examine aspects of arts and humanities research data that may require a different kind of handling to that given to other disciplines, and to discuss how needs for support, advocacy, training and infrastructure are being met.

Dave De Roure (Director of the Oxford e-Research Centre) started proceedings as keynote #1. He introduced the concept of the ‘fourth quadrant social machine’ (see Fig. 1) which extends the notion of ‘citizen as scientist’ taking advantage of the emergence of new analytical capabilities, (big) data sources and social networking. He talked also about the ‘End of Theory‘ – the idea that the data deluge is making scientific method obsolete also referencing Douglas Kell’s BioEssay ‘Here is the evidence, now what is the hypothesis‘ which argues that that “data- and technology-driven programmes are not alternatives to hypothesis-led studies in scientific knowledge discovery but are complementary and iterative partners with them.” This he continued, could have major influence on how research is conducted not only in hard sciences but also the arts and humanities.Social Machines Fig. 1 – From ‘Social Machines’ by Dave De Roure on Slideshare (http://www.slideshare.net/davidderoure/social-machinesgss)

He contends that citizen science initiatives such as Zooniverse (real science online) extend the idea of ‘human as slave to the computer’ however stating that the ‘more we give the computer to do the more we have to keep an eye on what they do.’ De Roure talked about sharing methods being as important as sharing data harnessing the web as lens (e.g. Twitterology) onto society, as infrastructure (e-science/e-research), and as the artifact of study. He went on to highlight innovative sharing initiatives in the arts and humanities that employ new analytical approaches such as:

  • the HATHI Trust Research Center which ‘enables computational access for nonprofit and educational users to published works in the public domain now and, in the future’ i.e. take your code to the data and get back your results!
  • CLAROS which uses semantic web technologies and image recognition techniques to make the geographically separate scholarly art datasets ‘interoperable’ and accessible to a broad range of users’
  • SALAMI (Structural Analysis of Large Amounts of Music Information) which focuses on ‘developing and evaluating practices, frameworks, and tools for the design and construction of worldwide distributed digital music archives and libraries’

De Roure used the phrase ‘from signal to understanding’ to describe the workflow of the ‘social machine’ and went on to described the commonalities that the arts and humanities community have with other disciplines such as working with multiple data sources (often incomplete and born digital), sharing of new and innovative digital methods, the challenges of resource discovery and publication of new digital artifacts, the importance of provenance, and risks of increasing automation. He also highlighted those differences that digital resources in arts and humanities possess in relation to other disciplines in the age of the ‘social machine’  such as specific content types and their relationship to physical artifacts, curated collections and an ‘infinite archive’ of heterogeneous content, publication as both subject and record of research, and the emphasis on multiple interpretations and critical thinking.

Keynote #2 was delivered by Leigh Garrett (Director of the Virtual Arts Data Service) who opined that little is really known about the ‘state’ of research data in the visual arts. It is both tangible yet intangible (what is data in the visual arts?)! Both physical and digital, heterogeneous and infinite, complex and complicated. He made mention of the Jisc-funded KAPTUR project which was aimed to create and pilot a sectoral model of best practice in the management of research data in the visual arts.

KAPTURE schematic drawing

As part of the exercise KAPTUR  evaluated 12 technical systems most suited for managing research data in the visual arts including CKAN, Figshare, EPrints, DataFlow. Criteria such as user friendliness, visual engagement, flexibility, hosted solution, licensing, versioning, searching were considered. Figshare was seen as user friendly, visually engaging, intuitive,and flexible however the use of CC Zero licences were seen as inappropriate for visual arts research data due to commercial usage clauses. Whilst CKAN appeared most suited no single solution completely fulfilled all requirements of researchers.

Fig 2. Lucie Rie, Sheet of sketches from Council of Industrial Design correspondence for Festival of Britain, 1951. © Mrs. Yvonne Mayer/Crafts Study Centre.
Available from VADS

Simon Willmoth then gave an institutional perspective from the University of the Arts London. It was interesting that from his experience abstract terms such as research data do not engage researchers, and indeed the term is not in common usage by art and design researchers. His definition in the context of art and design was that research data can be ‘considered anything created, captured or collected as an output of funded research work in its original state’. He also observed that as soon as researchers understand RDM concepts they ask for all of their material (past and present) to be digitised! Echoing earlier presentations regarding the ‘heterogeneous and infinite’ nature of research data in the arts and humanities Simon indicated that artists and designers normally have their own websites, some of the content can be regarded as research data e.g. drawings, storyboards, images, whilst some of it is a particular view dependent upon what it is used for at that instant. He then described the institutional challenges of resourcing (staff, storage, time), researcher engagement, curation (incl. legal, ethical and commercial constraints), infrastructure, and enhanced access. Simon finished with some very interesting quotes from researchers regarding how they perceive their work in relation to RDM e.g.

The work that is made is evidence of the journey to the completed artwork …… it’s kind of a continuous archive of imagery

I try not to throw things out but I often cut things up to use as collage material …. so it’s continual construction and deconstruction. Actually ordering is part of my own creative process, so the whole idea of archiving I think is really interesting

I used to think of a website as something where you display things and now increasingly I see it as a way of it recording the process, so I am using more like a Blog structure. But I am happy to post notes, photographs, drawings, observations and let other people see the process

My sketch books tend to be a mish-mash between logging of rushes notes, detailing each shot, things that I read, things that I hear, books that I’m reading, so it will be a jumble of texts but they’ve all gone in to the making of a piece of work.

 

Image of Alex from A Clockwork Orange at Stanley Kubrick Archive (UAL)

Simon showcased the Stanley Kubrick Archive based at UAL which contains draft and completed scripts, research materials such as books, magazines and location photographs. It also holds set plans and production documents such as call sheets, shooting schedules, continuity reports and continuity Polaroids. Props, costumes, poster designs, sound tapes and records also feature, alongside publicity press cuttings. He argues that the approach of digital copy and physical artifact accompanied by a web catalogue may be the way forward for RDM in the field of art and design.

Julianne Nyhan (UCL) kicked off day two providing a researcher’s view on arts and humanities data management/sharing with particular emphasis on infrastructure needs and wants. She observed that arts and humanities data are artifacts of human expression, interaction and imagination which tends to be collected rather than generated and rarely by those who create the object(s) which can be:

  • complex complete with variants, annotations, editorial comment
  • multi-lingual
  • long lasting
  • studied for many purposes

Julianne also reiterated the need to bridge management and sharing on both physical and digital objects as well as more documentation of interpretative processes and interventions (for which she employs a range of note management tools). She went onto say that much of the work done in her own discipline goes beyond disciplinary/academic/institutional boundaries and that the need to retain and appreciate the bespoke should be balanced against need for standardisation.
In terms of strategic developments Julianne saw much mileage in facilitating more research across linguistic borders (through legal instruments at a national/international level) with resultant access to large multilingual datasets from different cultures to inform comparative and transnational research.

Next up Paul Whitty and Felicity Ford (Oxford Brookes University) provided an overview of RDM practices at the Sonic Art Research Unit (SARU). The use of the internet to advertise work  is common amongst SARU researchers, musicians, freelance artists. It was however recognised that there lacked any unified web presence such as Ubuweb.
UbuwebPaul emphasised the need for the internet to be seen as a social space for SARU researchers. At the moment research is disseminated across multiple private websites without consistent links back to the university. Research objects and documentation is split over multiple platforms (e.g. Soundcloud, Vimeo, YouTube). This makes resource discovery difficult except through individual artists web spaces. As of March 2014 research disseminated across multiple private websites will be linked to/from the university with data stored on RADAR, the multi-purpose online “resource bank” for Oxford Brookes thus enabling traceability and impact measurement in terms of the use of digital research assets. As a concluding remark Paul did question whether university IT departments were qualified to provide bespoke website design for arts practitioners and researchers.

James Wilson, Janet McKnight and Sally Rumsey then gave an overview of the University of Oxford approach to RDM in the humanities which utilises different business models (extensible or reducible) for different components e.g. DataFinder, Databank, DataFlow, DHARMa. Findings from earlier projects at Oxford indicated that humanities research data tends not to depreciate over time unlike that for harder sciences, is difficult to define, tends to be compiled from existing sources and not created from scratch,and is often not in an optimal format for analysis. Other findings indicated that humanities researchers are least likely to conduct their research as part of a team, least likely to be externally funded, least likely to have deposited data in a repository. Conclusions reached were that humanities researchers were amongst the hardest to reach and that training and support is required to encourage cultural change. Janet McKnight from the Digital Humanities Archives for Research Materials (DHARMa) Project then spoke about enabling digital humanities research through effective data preservation warning that before you impose a workflow in terms of developing systems and processes it would be wise to ‘walk a mile in their [the researcher’s] shoes!’

This was a well-attended and enlightening event, ably organised and chaired by Martin Donnelly (DCC). It offered insight and a wide range of perspectives all of which enhance our understanding of service, of practice, of advocacy, of support in relation to research data management in the arts and humanities.

Slides from all presenters are available from DCC website.

Stuart Macdonald
EDINA & Data Library

Making research ‘really reproducible’

Listening to Victoria Stodden, Assistant Professor of Statistics at Columbia University, give the keynote speech at the recent Open Repositories conference in lovely Prince Edward Island Canada, I realised we have some way to go on the path towards her idealistic vision of how to “perfect the scholarly record.”

Victoria Park, Charlottetown

Charlottetown, Prince Edward Island

As an institutional data repository manager (for Edinburgh DataShare) I often listen and talk to users about the reasons for sharing and not sharing research data. One reason, well-known to users of the UK Data Archive (now known as the UK Data Service), is if the dataset is very rich and can be used for multiple purposes beyond those for which it was created, for example, the British Social Attitudes Survey.

Another reason for sharing data, increasingly being driven by funder and publisher policies, is to allow replication of published results, or in the case of negative results which are never published, to avoid duplication of effort and wasting public monies.

It is the second reason on which Stodden focused, and not just for research data but also for the code that is run on the data to produce the scientific results. It is for this reason she and colleagues have set up the web service, Run My Code. These single-purpose datasets do not normally get added to collections within data archives and data centres, as their re-use value is very limited. Stodden’s message to the audience of institutional repository managers and developers was that the duty of preserving these artefacts of the scientific record should fall to us.

Why should underlying code and data be published and preserved along with articles as part of the scholarly record? Stodden argues, because computation is becoming central to scientific research. We’ve all heard arguments behind the “data deluge”. But Stodden persuasively focuses on the evolution of the scientific record itself, arguing that Reproducible Research is not new. It has its roots in Skepticism – developed by Robert Boyle and the Royal Society of the 1660s. Fundamentally, it’s about “the ubiquity of error: The central motivation of the scientific method is to root out error.”

In her keynote she developed this theme by expanding on the three branches of science.

  • Branch 1: Deductive. This was about maths and formal logic, and the proof as the main product of scientific endeavor.
  • Branch 2: Empirical. Statistical analysis of controlled experiments – hypothesis testing, structured communication of methods and protocols. Peer reviewed articles became the norm.
  • Branch 3: Computational. This is at an immature stage, in part because we have not developed the means to rigorously test assertions from this branch.

Stodden is scathing in her criticism of the way computational science is currently practiced, consisting of “breezy demos” at conferences that can’t be challenged or “poked at.” She argues passionately for the need to facilitate reproducibility – the ability to regenerate published results.

What is needed to achieve openness in science? Stodden argued for the need for deposit and curation of versioned data and code, with a link to the published article, and  permanence of the link.This is indeed within the territory of the repository community.

Moreover, to have sharable products at the end of a research project, one needs to plan to share from the outset. It’s very difficult to reproduce the steps to create the results as an afterthought.

I couldn’t agree more with this last assertion. Since we set up Edinburgh DataShare we have spoken to a number of researchers about their ‘legacy’ datasets which – although they would like to make them publicly available, they cannot, either because of the nature of the consent forms, the format of the material, or the lack of adequate documentation. The easiest way is to plan to share. Our Research Data Management pages have information on how to do this, including use of the Digital Curation Centre’s tool, DMPOnline.

– Robin Rice, Data Librarian