RDM & Cornell University

I’ve been fortunate to have been given the opportunity to take up a secondment at the Cornell Institute for Social and Economic Research (CISER) as Data Services Librarian, the primary tasks of which are to:

  • Modernise the CISER data archive, and if possible, begin the implementation. Tasks include: introduction of persistent identifiers (DOIs) to all archival datasets (via EZID); investigate metadata mapping of archival datasets (DDI, DC, MARCXML); streamline data catalogue functionality (by introducing result sorting, relevance searches, subject classification), assist scoping a data repository solution for social science data assets generated by Cornell researchers
  • Actively participate in the Research Data Management Services Group at Cornell, assisting researchers with their RDM plans, contributing to the advancement of the work of the group
  • Actively consult with researchers about social science datasets and other data outreach activities.
  • Co-ordinate and collate assessment statements in order to gain Data Seal of Approval for CISER data archive.

Last Friday I gave my first presentation on the CISER data archive along with other CISER colleagues (they talked about datasets used in restriction at the Cornell Restricted Access Data Centre, and the CISER Statistical Consultancy Service & ICPSR) at a Policy and Analysis and Management (PAM) workshop for graduate students. This was held at the Survey Research Institute (https://www.sri.cornell.edu/sri/ ) where much discussion centred around survey non-response and mechanisms to counter this increasingly common phenomenon.

On Tuesday of this week I presented on the University of Edinburgh RDM Roadmap at a meeting of the monthly Research Data Management Service Group (RDMSG – http://data.research.cornell.edu). This was followed by two presentations yesterday, one at a Demography Pro-seminar (for graduate students) on campus and later at a Cornell University Library Data Discussion Group meeting in the Mann Library set up to introduce the CISER Data Services Librarian to a range of subject librarians principally in the social sciences. In each case the Edinburgh RDM Roadmap was received with great enthusiasm and engendered much discussion, in particular the centralised and inclusive approach adopted by Edinburgh. Follow up discussion and meetings are being planned including the potential use of MANTRA and the RDM Toolkit for Librarians as materials to raise the profile of RDM at Cornell.

As an aside, at a CISER team meeting the subject was raised about password protection (in some instances passwords to CISER resources are changed on a very regular basis for security purposes) and issues surrounding inappropriate recording of passwords. A site licence for a software protection software package was seen as a possible solution to both user disgruntlement and possible security breaches. As a thought, this might be worth considering as part of the Active Data Infrastructure tool suite.

Stuart Macdonald
Associate Data Librarian, UoE / Visiting CISER Data Services Librarian

Science as an open enterprise – Prof. Geoffrey Boulton

As part of Open Access Week, the Data Library and Scholarly Communications teams in IS hosted a lecture by emeritus Professor Geoffrey Boulton drawing upon his study for the Royal Society: Science as an Open Enterprise (Boulton, et al 2012). The session was introduced by Robin Rice who is the University of Edinburgh Data Librarian.  Robin pointed out that the University of Edinburgh was not just active, but was a leader in research data management having been the first UK institution to have a formal research data management policy.  Looking at who attended the event, perhaps unsurprisingly the majority were from the University of Edinburgh.  Encouragingly, there was roughly a 50:50 split between those actively involved in research and those in support roles.  I say encouragingly as it was later stated that often policies get high-level buy in from institutions but have little impact on those actually doing the research. Perhaps more on that later.

For those that don’t know Prof. Boulton, he is a geologist and glaciologist and has been actively involved in scientific research for over 40 years.  He is used to working with big things (mountains, ice sheets) over timescales measured in millions of years rather than seconds and notes that  while humanity is interesting it will probably be short lived!

Arguably the way we have done science over the last three hundred years has been effective. Science furthers knowledge.  Boulton’s introduction made it clear that he wanted to talk about the processes of science and how they are affected by the gathering, manipulation and analysis of huge amounts of data: the implications, the changes in processes, and why evenness matters in the process of science. This was going to involve a bit of a history lesson, so let’s go back to the start.

Open is not a new concept

Geoffrey Boulton talking about the origins of peer review

“Open is not a new concept”

Open has been a buzzword for a few years now.  Sir Tim Berners-Lee and Prof. Nigel Shadbolt have made great progress in opening up core datasets to the public.  But for science, is open a new concept? Boulton thinks not. Instead he reckons that openness is at the foundations of science but has somehow got a bit lost recently.  Journals originated as a vehicle to disseminate knowledge and trigger discussion of theories.  Boulton  gave a brief history of the origins of journals pointing out that Henry Oldenburg is credited with founding the peer review process with the Philosophical Transactions of the Royal Society.  The journal allowed scientists to share their thoughts and promote discussion.  Oldenburg’s insistence that the Transactions be published in the vernacular rather than Latin was significant as it made science more accessible.  Sound familiar?

Digital data – threat or opportunity? 

We are having the same discussions today, but they are based around technology and, perhaps in some cases, driven by money. The journal publishing model has changed considerably since Oldenburg and it was not the focus of the talk so let us concentrate on the data.  Data are now largely digital.  Journals themselves are also generally digital.  The sheer volume of data we now collect makes it difficult to include the data with a publication. So should data go into a repository?  Yes, and some journals encourage this but few mandate it.  Indeed, many of the funding councils state clearly that research output should be deposited in a repository but don’t seem to enforce this.

Replicability – the cornerstone of the scientific method

Image of Geoffrey Boulton during his talk

Geoffrey Boulton, mid-talk.

Having other independent scientists replicate and validate your findings adds credence to them. Why would you as a professional scientist not want others to confirm that you are correct?  It seems quite simple but it is not the norm.  Boulton pointed us to a recent paper in Nature (Nature v483 n7391) which attempted to replicate the results of a number of studies in cancer research. The team found that they could only replicate 6, around 11%, of the studies.  So the other 81% were fabricating their results?  No, there are a number of reasons why the team could not replicate all the studies.  The methodology may not have been adequately explained leading to slightly different techniques being used, the base data may have been unobtainable and so on but the effect is the same. Most of the previous work that the team looked at is uncorroborated science.  Are we to trust their findings?  Science is supposed to be self-correcting.  You find something, publish, others read it, replicate and corroborate or pose an alternative, old theories are discounted (Science 101 time: “Null Hypothosis“) and our collective knowledge is furthered.  Boulton suggests that, to a large degree, this is not happening. Science is not being corroborated. We have forgotten the process on which our profession is based. Quoting Jim Gray:

“when you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. We are embarrassed by our data.”

Moving forward (or backwards) towards open science

What do we need to do to support, to do to advise, to ensure materials are available for our students, for our researchers to ensure they can be confident about sharing their data?  The University of Edinburgh does reasonably well but we still, like most institutions, have things to do.

Geoffrey looked at some of the benefits of open science and while I am sure we all already know what these are, it is useful to have some high profile examples that we can all aspire to following.

  1. Rapid response – some scientific research is reactive. This is especially true in research into epidemiology and infectious diseases.  An outbreak occurs, it is unfamiliar and we need to understand it as quickly as possible to limit its effects. During an e-coli outbreak in Hamburg local scientists were struggling to identify the source. They analysed the strain and released the genome under an open licence. Within a week they had a dozen reports from 4 continents. This helped to identify the source of the outbreak and ultimately saved lives.(Rohde et al 2011)
  2. Crowd-sourcing – mathematical research is unfathomable to many.  Mathematicians are looking for solutions to problems. Working in isolation or small research clusters is the norm, but is it effective?  Tim Gowers (University of Cambridge) decided to break with convention and post the “problems” he was working on to his blog.  The result; 32 days – 27 people – 800 substantive contributions. 800 substantive contributions!  I am sure that Tim also fostered some new research collaborations from his 27 respondents.
  3. Change the social dynamic of science – “We are scientists, you wouldn’t understand” is not exactly a helpful stance to adopt.  “We are scientists and we need your help,” now that’s much better!  The rise of the app has seen a new arm of science emerge, “citizen science”. The crowd, or sometimes the informed crowd, is a powerful thing. With a carefully designed app you can collect a lot of data from a lot of places over a short period. Projects such as ASHtag and LeafWatch are just two examples where the crowd has been usefully deployed to help collect data for scientists.  Actually, this has been going on for some time in different forms, do you remember the SETI@Home screensaver?  It’s still going, 3 million users worldwide processing data for scientists since 1999.
  4. Openness and transparency – no one wants another “Climategate“.  In fact Climategate need not have happened at all. Much of the data was already publicly available and the scientists had done nothing wrong. Their lack of openness was seen as an admission that they had something to hide and this was used to damaging effect by the climate sceptics.
  5. Fraud – open data is crucial as it shines the light on science and the scientific technique and helps prevent fraud.

What value if not intelligent?

However, Boulton’s closing comments made the point that openness has little value if it is not “intelligent” so this means it is:

  • accessible (can it be found?)
  • intelligible (can you make sense of it?)
  • assessable (can you rationally look at the data objectively?)
  • re-usable (has sufficient metadata to describe how is was created?)

I would agree with Boulton’s criteria but would personally modify the accessible entry. In my opinion data is not open if it is buried in a PDF document. OK, I may be able to find it, but getting the data into a usable format still takes considerable effort, and in some cases, skill.  The data should be ready to use.

Of course, not every dataset can be made open.  Many contain sensitive data that needs to be guarded as it could perhaps identify an individual.  There are also considerations to do with safety and security that may prevent data becoming open.  In such cases, perhaps the metadata could be open and identify the data custodian.

Questions and Discussion

One of the first questions from the floor focused on the fuzzy boundaries of openness and the questioner was worried that scientist could, and would, hide behind the “legitimate commercial interest” since all data had value and research was important within a university’s business model.  Boulton agreed but suggested that the publishers could do more and force authors to make their data open. Since we are, in part, judged by our publication record you would have to comply and publish your data.  Monetising the data would then have to be a separate thing. He alluded to the pharmaceutical industry, long perceived to be driven by money but which has recently moved to be more open.

The second question followed on from this asking if anything could be learned from the licences used for software such as the GNU and the Apache Licence.  Boulton stated that the government is currently looking at how to licence publicly-funded research.  What is being considered at the EU level may be slightly regressive and based on EU lobbying from commercial organisations. There is a lot going on in this area at the moment so keep your eyes and ears open.

The final point from the session sought clarification of The University of Edinburgh research data management policy.  Item nine states

“Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

But how do we know what is important, or what will be deemed significant in the future? Boulton agreed that this was almost impossible.  We cannot archive all data and inevitably some important “stuff” will be lost – but that has always been the case.

View of the audience for Geoffrey Boulton's talk as part of Open Access Week at UoE

The audience for Geoffrey Boulton’s talk as part of Open Access Week at UoE

My Final Thoughts on Geoffrey’s Talk

An interesting talk.  There was nothing earth-shattering or new in it, but a good review of the argument for openness in science from someone who actually has the attention of those who need to recognise the importance of the issue and take action on it.  But instead of just being a top down talk, there was certainly a bottom up message.  Why wait for a mandate from a research council or a university? There are advantages to be had from being open with your data and these benefits are potentially bigger for the early adopters.

I will leave you with an aside from Boulton on libraries…

“Libraries do the wrong thing, employ the wrong people.”

For good reasons we’ve been centralising libraries. But perhaps we have to reverse that. Publications are increasingly online but soon it will be the data that we seek and tomorrow’s librarians should be skilled data analysts who understand data and data manipulation.  Discuss.

Some links and further reading:

Addy Pope

Research and Geodata team, EDINA

 

 

RDM reflection – finishing the data life cycle

Research Data Management and I were a chance acquaintance. I was asked to stand in for one of the steering group despite having some very tenuous qualifications for the role. That said, I quickly realised that it was an important and complex initiative and our University is leading with this initiative.

Progressing with RDM in the University is not straightforward but it is essential.

This reflection could go off on many tracks but it will concentrate on one – finishing the data life cycle.

If we consider in a very simplistic way the funding of a researcher, it might look like this:

The point at which data should transfer to Data Stewardship may coincide with higher priorities for the researcher.

A big hurdle that RDM has to cross is the final point of data transition. The data manager wants to see data moved into Data Stewardship.  The researcher’s priorities are publication and next grant application. The result:

Data will not flow easily from stage 2. Active Data Management to 3. Data Stewardship.

Of course, a researcher and a data manager may look at the above diagram and say it is wrong. They will see solutions. And when they do, this reflection will have succeeded in communicating what it needed to say.

James Jarvis, Senior Computing Officer

IS User Services Division

Issues for research software preservation

[Reposted from https://libraryblogs.is.ed.ac.uk/blog/2013/09/18/research-software-preservation/]

preservingcode

Twelve years ago I was working as a research assistant on an EPSRC funded project.  My primary role was to write software that allowed vehicle wiring to be analysed, and faults identified early in the design process, typically during the drafting stage within Computer Aided Design (CAD) tools.  As with all product design, the earlier that potential faults can be identified, the cheaper it is to eliminate them.

Life moves on, and in the intervening years I’ve moved between six jobs, and have worked in three different universities.  Part of my role now includes overseeing areas of the University’s Research Data Management service.  In this work, one area that gets raised from time to time is the issue of preserving software.  Preserving data is talked about more often, but the software that created it can be important too, particularly if the data ever needs to be recreated, or requires the software in order to interrogate or visualise the data.  The rest of this blog post takes a look at some of the important areas that should be thought about when writing software for research purposes.

In their paper ‘A Framework for Software Preservation’ (doi:10.2218/ijdc.v5i1.145) Matthews et al describe four aspects of software preservation:

1. Storage: is the software stored somewhere?

2. Retrieval: can the software be retrieved from wherever it is stored?

3. Reconstruction: can the software be reconstructed (executed)?

4. Replay: when executed, does the software produce the same results as it did originally?

Storage:

Storage of source code is perhaps one of the easier aspects to tackle, however there are a multitude of issues and options.  The first step, and this is just good software development, is documentation about the software.  In some ways this is no different to lab notebooks or experiment records that help explain what was created, why it was created, and how it was created.  This includes everything from basic practices such as comments in the code and using meaningful variable names, through to design documentation and user manuals.  The second step, which again is just good software development practice, is to store code in a source code management system such as git, mercurial, SVN, CVS, or going back a few years, RCS or SCCS.  A third step will be to store the code on a supported and maintained platform, perhaps a departmental or institutional file store.

However it may be more than the code and documentation that should be stored.  Depending on the language used, it may be prudent to store more than just the source code.  If the code is written in a well-known language such as Java, C, or Perl, then the chances are that you’ll be OK.  However there can be complexities related to code libraries.  Take the example of a bit of software written in Java and using the Maven build system.  Maven helps by allowing dependencies to be downloaded at build time, rather than storing it locally.  This gives benefits such as ensuring new versions are used, but what if the particular maven repository is no longer available in five years time? I may be in the situation where I can’t rebuild my code as I don’t have access to the dependencies.

Retrieval:

If good and appropriate storage is used, then retrieval should also be straightforward.  However, if nothing else, time and change can be an enemy.  Firstly, is there sufficient information easily available to describe to someone else, or to act as a reminder to yourself, what to access and where it is? Very often filestore permissions are used to limit who can access the storage.  If access is granted (if it wasn’t held already) then it is important to know where to look.  Using extra systems such as source code control systems can be a blessing and a curse.  You may end up having to ask a friendly sysadmin to install a SCCS client to access your old code repository!

Reconstruction:

You’ve stored your code, you’ve retrieved it, but can it be reconstructed?  Again this will often come down to how well you stored the software and its dependencies in the first place.  In some instances, perhaps where specialist programming languages or environments had to be used, these may have been stored too.  However can a programming tool written for Windows 95 still be used today?  Maybe – it might be possible to build such a machine if you can’t find one, or to download a virtual machine running Windows 95.  This raises another consideration of what to store – you may wish to store virtual machine images of either the development environment, or the execution environment, to make it easier to fire-up and run the code at a later date.  However there are no doubt issues here with choosing a virtual format that will still be accessible in twenty years time, and in line with normal preservation practice, storing a virtual machine in no way removes the need to store raw textual source code that can be easily read by any text editor in the future.

Replay:

Assuming you now have your original code in an executable format, you can now look forward to being able to replay it, and get data in and out of it.  That it, of course, as long as you have also preserved the data!

To recap, here are a few things to think about:

– Like with many areas of Research Data Management, planning is essential.  Subsequent retrieval, reconstruction, and replay is only possible if the right information is stored in the right way originally, so you need a plan reminding you what to store.

– Consider carefully what to store, and what else might be needed to recompile or execute the code in the future.

– Think about where to store the code, and where it will most likely be accessible in the future.

– Remember to store dependencies which might be quite normal today, but that might not be so easily found in the future.

– Popular programming languages may be easier to execute in the future than niche languages.

– Even if you are storing complete environments as virtual machines, remember that these may be impenetrable in the future, whereas plain text source code will always be accessible.

So, back to the project I was working on twelve years ago.  How did I do?

– Storage: The code was stored on departmental filestore. Shamefully I have to admit that no source code control system was used, the three programmers on the project just merged their code periodically.

– Retrieval: I don’t know!  It was stored on departmental filestore, so after I moved from that department to another, it became inaccessible to me. However, I presume the filestore has been maintained by the department, but was my area kept after I left, or deleted automatically?

– Reconstruction: The software was written in Java and Perl, so should be relatively easy to rebuild.

– Replay: I can’t remember how much documentation we wrote to explain how to run the code, and how to read / write data, or what format the data files had to be in.  Twelve years on, I’m not sure I could remember!

Final grading: Room for improvement!

Stuart Lewis, Head of Research and Learning Services, Library & University Collections.