Dealing With Data 2018: Summary reflections

The annual Dealing With Data conference has become a staple of the University’s data-interest calendar. In this post, Martin Donnelly of the Research Data Service gives his reflections on this year’s event, which was held in the Playfair Library last week.

One of the main goals of open data and Open Science is that of reproducibility, and our excellent keynote speaker, Dr Emily Sena, highlighted the problem of translating research findings into real-world clinical interventions which can be relied upon to actually help humans. Other challenges were echoed by other participants over the course of the day, including the relative scarcity of negative results being reported. This is an effect of policy, and of well-established and probably outdated reward/recognition structures. Emily also gave us a useful slide on obstacles, which I will certainly want to revisit: examples cited included a lack of rigour in grant awards, and a lack of incentives for doing anything different to the status quo. Indeed Emily described some of what she called the “perverse incentives” associated with scholarship, such as publication, funding and promotion, which can draw researchers’ attention away from the quality of their work and its benefits to society.

However, Emily reminded us that the power to effect change does not just lie in the hands of the funders, governments, and at the highest levels. The journal of which she is Editor-in-Chief (BMJ Open Science) has a policy commitment to publish sound science regardless of positive or negative results, and we all have a part to play in seeking to counter this bias.

Photo-collage of several speakers at the event

A collage of the event speakers, courtesy Robin Rice (CC-BY)

In terms of other challenges, Catriona Keerie talked about the problem of transferring/processing inconsistent file formats between heath boards, causing me to wonder if it was a question of open vs closed formats, and how could such a situation might have been averted, e.g. via planning, training (and awareness raising, as Roxanne Guildford noted), adherence to the 5-star Open Data scheme (where the third star is awarded for using open formats), or something else? Emily earlier noted a confusion about which tools are useful – and this is a role for those of us who provide tools, and for people like myself and my colleague Digital Research Services Lead Facilitator Lisa Otty who seek to match researchers with the best tools for their needs. Catriona also reminded us that data workflow and governance were iterative processes: we should always be fine-tuning these, and responding to new and changing needs.

Another theme of the first morning session was the question of achieving balances and trade-offs in protecting data and keeping it useful. And a question from the floor noted the importance of recording and justifying how these balance decisions are made etc. David Perry and Chris Tuck both highlighted the need to strike a balance, for example, between usability/convenience and data security. Chris spoke about dual testing of data: is it anonymous? / is it useful? In many cases, ideally it will be both, but being both may not always be possible.

This theme of data privacy balanced against openness was taken up in Simon Chapple’s presentation on the Internet of Things. I particularly liked the section on office temperature profiles, which was very relevant to those of us who spend a lot of time in Argyle House where – as in the Playfair Library – ambient conditions can leave something to be desired. I think Simon’s slides used the phrase “Unusual extremes of temperatures in micro-locations.” Many of us know from bitter experience what he meant!

There is of course a spectrum of openness, just as there are grades of abstraction from the thing we are observing or measuring and the data that represents it. Bert Remijsen’s demonstration showed that access to sound recordings, which compared with transcription and phonetic renderings are much closer to the data source (what Kant would call the thing-in-itself (das Ding an sich) as opposed to the phenomenon, the thing as it appears to an observer) is hugely beneficial to linguistic scholarship. Reducing such layers of separation or removal is both a subsidiary benefit of, and a rationale for, openness.

What it boils down to is the old storytelling adage: “Don’t tell, show.” And as Ros Attenborough pointed out, openness in science isn’t new – it’s just a new term, and a formalisation of something intrinsic to Science: transparency, reproducibility, and scepticism. By providing access to our workings and the evidence behind publications, and by joining these things up – as Ewan McAndrew described, linked data is key (this the fifth star in the aforementioned 5-star Open Data scheme.) Open Science, and all its various constituent parts, support this goal, which is after all one of the goals of research and of scholarship. The presentations showed that openness is good for Science; our shared challenge now is to make it good for scientists and other kinds of researchers. Because, as Peter Bankhead says, Open Source can be transformative – Open Data and Open Science can be transformative. I fear that we don’t emphasise these opportunities enough, and we should seek to provide compelling evidence for them via real-world examples. Opportunities like the annual Dealing With Data event make a very welcome contribution in this regard.

PDFs of the presentations are now available in the Edinburgh Research Archive (ERA). Videos from the day are published on MediaHopper.

Other resources

Martin Donnelly
Research Data Support Manager
Library and University Collections
University of Edinburgh

New video: the benefits of RDM training

A big part of the role of the Research Data Service is to provide a mixture of online and (general/tailored) in-person training courses on Research Data Management (RDM) to all University research staff and students.

In this video, PhD student Lis talks about her experiences of accessing both our online training and attending some of our face-to-face courses. Lis emphasises how valuable both of these can be to new PhD candidates, who may well be applying RDM good practice for the first time in their career.

[youtube]https://youtu.be/ycCiXoJw1MY[/youtube]

It is interesting to see Lis reflect on how these training opportunities made her think about how she handles data on a daily basis, bringing a realisation that much of her data was sensitive and therefore needed to be safeguarded in an appropriate manner.

Our range of regularly scheduled face-to-face training courses are run through both Digital Skills and the Institute of Academic Development – these are open to all research staff and students. In addition, we also create and provide bespoke training courses for schools and research groups based on their specific needs. Online training is delivered via MANTRA and the Research Data Management MOOC which we developed in collaboration with the University of North Carolina.

In the video Lis also discusses her experiences using some RDS tools and services, such as DataStore for storing and backing-up her research data to prevent data loss, and contacting our team for timely support in writing a Data Management Plan for her project.

If you would like to learn more about any of the things Lis mentions in her interview you should visit the RDS website, or to discuss bespoke training for your school or research centre / group please contact us via data-support@ed.ac.uk.

Kerry Miller
Research Data Support Officer
Library and University Collections
The University of Edinburgh

New team members, new team!

Time has passed, so inevitably we have said goodbye to some and hello to others on the Research Data Support team. Amongst other changes, all of us are now based together in Library & University Collections – organisationally, that is, while remaining located in Argyle House with the rest of the Research Data Service providers such as IT Infrastructure. (For an interview with the newest team member there, David Fergusson, Head of Research Services, see this month’s issue of BITS.)

So two teams have come together under Research Data Support as part of Library Research Support, headed by Dominic Tate in L&UC. Those of us leaving EDINA and Data Library look back on a rich legacy dating back to the early 1980s when the Data Library was set up as a specialist function within computing services. We are happy to become ‘mainstreamed’ within the Library going forward, as research data support becomes an essential function of academic librarianship all over the world*. Of course we will continue to collaborate with EDINA for software engineering requirements and new projects.

Introducing –

Jennifer Daub has worked in a range of research roles, from lab-based parasite genomics at the University of Edinburgh to bioinformatics at the Wellcome Trust Sanger Institute. Prior to joining the team, Jennifer provided data management support to users of clinical trials management software across the UK and is experienced managing sensitive data.

As Research Data Service Assistant, Jennifer has joined veterans Pauline Ward and Bob Sanders in assisting users with DataShare and Data Library as well as the newer DataVault and Data Safe Haven functions, and additionally providing general support and training along with the rest of the team.

Catherine Clarissa is doing her PhD in Nursing Studies at the University of Edinburgh. Her study is looking at patients’ and staff experiences of early mobilisation during the course of mechanical ventilation in an Intensive Care Unit. She has good knowledge of good practice in Research Data Management that has been expanded by taking training from the University and by developing a Data Management Plan for her own research.

As Project Officer she is working closely with project manager Pauline Ward on the Video Case Studies project, funded by the IS Innovation Fund over the next few months. We have invited her to post to the blog about the project soon!

Last but not least, Martin Donnelly will be joining us from the Digital Curation Centre, where he has spent the last decade helping research institutions raise their data management capabilities via a mixture of paid consultancy and pro bono assistance. He has a longstanding involvement in data management planning and policy, and interests in training, advocacy, holistic approaches to managing research outputs, and arts and humanities data.

Before joining Edinburgh in 2008, Martin worked at the University of Glasgow, where he was involved in European cultural heritage and digital preservation projects, and the pre-merger Edinburgh College of Art where he coordinated quality and accreditation processes. He has acted as an expert reviewer for European Commission data management plans on multiple occasions, and is a Fellow of the Software Sustainability Institute.

We look forward to Martin joining the team next month, where he will take responsibility as Research Data Support Manager, providing expertise and line management support to the team as well as senior level support to the service owner, Robin Rice, and to the Data Safe Haven Manager, Cuna Ekmekcioglu – who recently shifted her role from lead on training and outreach. Kerry Miller, Research Data Support Officer, is actively picking up her duties and making new contacts throughout the university to find new avenues for the team’s outreach and training delivery.

*The past and present rise of data librarianship within academic libraries is traced in the first chapter of The Data Librarian’s Handbook, by Robin Rice and John Southall.

Robin Rice
Data Librarian and Head, Research Data Support
Library & University Collections

DataShare 3.0: The ‘Download Release’ means deposits up to 100 GB

With the DataShare 3.0 release, completed on 6 October, 2017, the data repository can manage data items of 100 GB. This means a single dataset of up to 100 GB can be cited with a single DOI, viewed at a single URL, and downloaded through the browser with a single click of our big red “Download all files” button. We’re not saying the system cannot handle datasets larger than this, but 100 GB is what we’ve tested for, and can offer with confidence. This release joins up the DSpace asset store to our managed filestore space (DataStore) making this milestone release possible.

How to deposit up to 100 GB

In practice, what this means for users is:

– You can still upload up to 20 GB of data files as part of a single deposit via our web submission form.

– For sets of files over 20 GB, depositors may contact the Research Data Service team on data-support@ed.ac.uk to arrange a batch import. The key improvement in this step is that all the files can be in a single deposit, displayed together on one page with their descriptive metadata, rather than split up into five separate deposits.

Users of DataShare can now also benefit from MD5 integrity checking

The MD5 checksum of every file in DataShare is displayed (on the Full Item view), including historic deposits. This allows users downloading files to check their integrity.

For example, suppose I download Professor Richard Ribchester’s fluorescence microscopy of the neuromuscular junction from http://datashare.is.ed.ac.uk/handle/10283/2749. N.B. the “Download all files” button in this release works differently than before. And one of the differences which users will see is that the zip file it downloads is now named with the two numbers from the deposit’s handle identifier, separated by an underscore instead of a forward slash. So I’ve downloaded the file “DS_10283_2749.zip”.

I want to ensure there was no glitch in the download – I want to know the file I’ve downloaded is identical to the one in the repository. So, I do the following:

  • Click on “Show full item record”.
  • Scroll down to the red button labelled “Download all files”, where I see “zip file MD5 Checksum: a77048c58a46347499827ce6fe855127” (see screenshot). I copy the checksum (highlighted in yellow).

    screenshot from DataShare showing where the MD5 checksum hash of the zip file is displayed

    DataShare displays MD5 checksum hash

  • On my PC, I generate the MD5 checksum hash of the downloaded copy, and then I check that the hash on DataShare matches. There are a number of free tools available for this task: I could use the Windows command line, or I could use an MD5 utility such as the free “MD5 and SHA Checksum Utility”. In the case of the Checksum Utility, I do this as follows:
    • I paste the hash I copied from DataShare into the desktop utility (ignoring the fact the program confusingly displays the checksum hashes all in upper case).
    • I click the “Verify” button.

In this case they are identical – I have a match. I’ve confirmed the integrity of the file I downloaded.

Screenshot showing result of MD5 match

The MD5 checksum hashes match each other.

More confidence in request-a-copy for embargoed files

Another improvement we’ve made is to give depositors confidence in the request-a-copy feature. If the files in your deposit are under temporary embargo, they will not be available for users to download directly. However, users can send you a request for the files through DataShare, which you’ll receive via email. If you then agree to the request using the form and the “Send” button in DataShare, the system will attempt to email the files to the user. However, as we all know, some files are too large for email servers.

If the email server refuses to send the email message because the attachment is too large, DataShare 3.0 will immediately display an error message for you in the browser saying “File too large”. Thus allowing you to make alternative arrangements to get those files to the user. Otherwise, the system moves on to offer you a chance to change the permissions on the file to open access. So, if you see no error after clicking “Send”, you’ll have peace of mind the files have been sent successfully.

Pauline Ward, Research Data Service Assistant
EDINA and Data Library