“Archiving Your Data” – new videos from the Research Data Service

In three new videos released today, researchers from the University of Edinburgh talk about why and how they archive their research data, and the ways in which they make their data openly available using the support, tools and resources provided by the University’s Research Data Service.

Professor Richard Baldock from the MRC Human Genetics Unit explains how he’s been able to preserve important research data relating to developmental biology – and make it available for the long term using Edinburgh DataShare – in a way that was not possible by other means owing to the large amount of histology data produced.

 

Dr Marc Metzger from the School of GeoSciences tells how he saves himself time by making his climate mapping research data openly available so that others can download it for themselves, rather than him having to send out copies in response to requests. This approach represents best practice – making the data openly available is also more convenient for users, removing a potential barrier to the re-use of the data.

Professor Miles Glendinning from Edinburgh College of Art talks about how his architectural photographs of social housing are becoming more discoverable as a result of being shared on Edinburgh DataShare. And Robin Rice, the University’s Data Librarian, discusses the difference between the open (DataShare) and restricted (DataVault) archiving options provided by the Research Data Service.

For more details about Edinburgh’s Research Data Service, including the DataShare and DataVault systems, see:

https://www.ed.ac.uk/is/research-data-service

Pauline Ward
Research Data Service Assistant
Library and University Collections
University of Edinburgh

Managing data: photographs in research

In collaboration with Scholarly Communications, the Data Library participated in the workshop “Data: photographs in research” as part of a series of workshops organised by Dr Tom Allbeson and Dr Ella Chmielewska for the pilot project “Fostering Photographic Research at CHSS” supported by the College of Humanities and Social Science (CHSS) Challenge Investment Fund.

In our research support roles, Theo Andrew and I addressed issues associated with finding and using photographs from repositories, archives and collections, and the challenges of re-using photographs in research publications. Workshop attendants came from a wide range of disciplines, and were at different stages in their research careers.

First, I gave a brief intro on terminology and research data basics, and navigated through media platforms and digital repositories like Jisc Media Hub, VADS, Wellcome Trust, Europeana, Live Art Archive, Flickr Commons, Library of Congress Prints & Photographs Online Catalog (Muybridge http://hdl.loc.gov/loc.pnp/cph.3a45870) – links below.

Eadweard Muybridge. 1878. The Horse in motion. Photograph.

From the Library of Congress Prints and Photographs Online Catalog

Then, Theo presented key concepts of copyright and licensing, which opened up an extensive discussion on what things researchers have to consider when re-using photographs and what institutional support researchers expect to have. Some workshop attendees shared their experience of reusing photographs from collections and archives, and discussed the challenges they face with online publications.

The last presentation tackling the basics of managing photographic research data was not delivered due to time constraints. The presentation was for researchers who produce photographic materials, however, advice on best RDM practice is relevant to any researcher independently of whether they are producing primary data or reusing secondary data. There may be another opportunity to present the remaining slides to CHSS researchers at a future workshop.

ONLINE RESOURCES

LICENSING

Digital Scholarship Day of Ideas: Data

The theme of this year’s ‘Digital Scholarship Day of Ideas’ (14th May) focused on ‘data’ and what data is for the humanities and social sciences. This post summarises the presentation of Prof Annette Markham, the first speaker of the day. She started her presentation with an illustration of Alice in Wonderland. She then posed the question: What does data mean anyway?

Markham then explained how she had quit her job as a professor in order to enquire into the methods used in different disciplines. Since then, she has thought a lot about method and methodologies, and run many workshops on the theme of ‘data’. In her view, we need to be careful when using the term ‘data’ because although we think we are talking about the same thing we have different understandings of what the term actually means. So, we need to critically interrogate the word and reflect upon the methodologies.

Markham talked about the need to look at ‘methods’ sideways, we need to look at them from above and below. We need to collate as many insights into these methods as possible; we might then understand what ‘data’ means for different disciplines. Sometimes, methods are related to funding, which can be an issue in the current climate, because innovative data collection procedures that might not be suitable for archival aren’t that valuable to funders. The issue is that not all research can be added to digital archives. For an ethnographer, a stain of coffee in a fieldwork notebook has meaning, but this subtle meaning cannot be archived or be meaningful to others unless digitised and clearly documented.

Drawing on Gregory Bateson’s Steps to an Ecology of Mind (1972), she asked us to think about ‘frames’ and how these draw our attention to what is inside and dismiss what lays outside. If you change the frame with which you look, it changes what you see. She showed and suggested using different frames. For example there are: traditional frames, structures like the sphere, molecular structures. Different structures afford different ways of understanding, and convey themes and ideas that are embedded within them.

Empty-framesphere-296433_1280Azithromycin_3d_structure

 

To use another example, she used an image of McArthur’s Universal Corrective Map of the World to illustrate how our understanding of our environment changes when information is shown and structured in a different and unexpected way.

  • What happens when we change the frame?
  • How does the structure shape the information and affect the way we engage with it?

Reversed Earth map 1000x500
Satellite image of McArthur’s Austral-centric view of the world [Public domain]

1. How do we frame culture and experiences in the 21st Century? How has our concept of society changed since the internet?
Continuing the discussion on frames, she spoke about how the internet has brought on a significant frame shift. This new frame has influenced the way we interact with media and data. To illustrate this, she showed work by Sparacino, Pentland, Davenport, Hlavac and Obelnicki, who in the project the ‘City of News’ (Sparacino, 1997) addressed this frame shift caused by the internet. The MIT project (1996) presented a 3D information browsing system, where buildings were the information spaces where information would be stored and retrieved. Through this example, Markham emphasized how our interaction with information and the methods we use for looking at social culture are changing, and so are the visual-technical frames we use to enquire into the world.

2. How do we frame objects and processes of enquiry?
She argued that this framing of objects and processes hasn’t changed enough. If we were to draw a picture or map of what research is and how the data in any research project is structured, we would end up with a multi-dimensional mass of connected blobs and lines instead of with a neatly composed bi-dimensional picture frame (research looks more like a molecular structure than like a rectangular frame). However, we still associate qualitative research with traditional ethnographic methods and we see quite linear and “neat and tidy” methods as legitimate. There is a need to look at new methods of collecting and analysing research ‘data’ if we are to enquire into socio-cultural changes.

3. How do we frame what counts as proper legitimate enquiry?
In order to change the frame, we have to involve the research community. The frame shift can happen, even if slowly, when established research methods are reinvented. Markham used 1960s feminist scholars as an example, for they approached their research using a frame that was previously inconceivable. This new methodological approach was based on situated knowledge production and embodied understanding, which challenged the way in which scientific research methods had been operating (more on the subject, (Haraway 1988). But in the last decade at least we are seeing an upsurge of to scientific research methods – evidence based, problem solving approaches – dominating the funding and media understanding of research.

So, what is DATA?
‘Data’ is often an easy term to toss around, as it stands for unspecified stuff. Ultimately, ‘data’ is “a lot of highly specific but unspecified stuff”, that we use to make sense of the world around us, a phenomenon. The term ‘data’ is a arguably quite a powerfully rhetorical word in humanities and social sciences, in that it shapes what we see and what we think.

The term data comes from the Latin verb dare, to give. In light of this, ‘data’ is something that is already given in the argument – pre-analytical and pre-semantics. Facts and arguments might have theoretical underpinnings, but data is devoid of any theoretical value. Data is everywhere. Markham referring to Daniel Rosenberg‘s paper ‘Data before the fact’, pointed out that facts can be proved wrong, and then they are no longer a facts, but data is always data even when proven wrong. In the 80s, she was trained not to use the term ‘data,’ they said:

“we do not use it, we collect material, artifacts, notes, information…”

Data is conceived as something that is discrete, identifiable, disconnected. The issue, she said was that ‘data’ poorly represents a conversation (gesture and embodiment), the emergence of meaning from non verbal information, because when we extract things from their context and then use them as a stand-alone ‘data’, we loose a wealth of information.

Markham then showed two ads (Samsung Galaxy SII and Global Pulse) to illustrate her concerns about life becoming data-fied. She referenced Kate Crawford’s perspective on “big data fundamentalism”, because not all human experiences can be reduced to big data, to digital signals, to data points. We have to trouble the idea of thinking about “humans (and their data) as data”. We don’t understand data as it is happening, and “data has never been raw”. Data is always filtered, transformed. We need to use our strong and robust methods of enquery, and that these do not necessarily focus on data as the centre stage, it may be about understanding the phenomenon of what we have made,this thing called data. We have to remember that that’s possible.

Data functions very powerfully as a term, and from a methodological perspective it creates a very particular frame. It warrants careful consideration, especially in an era where the predominant framework is telling us that data is really the important part of research.

References

  • Image of Alice in Wanderland after original illustration by Danny Pig (CC BY-SA 2.0)
  • Sparacino, Flavia, A. Pentland, G. Davenport, M. Hlavac and M. Obelnicki (1997). ‘City of News’ in Proceedings of Ars Electronica Festival, Linz, Austria, 8-13 Sep.
  • Bateson, Gregory (1972). Steps to an ecology of mind: collected essays in anthropology, psychiatry, evolution, and epistemology. Aylesbury: Intertext.
  • Frame by Hubert Robert [Public domain], via Wikimedia Commons
  • Sphere by anonymous (CC 1.0) [Public Domain]
  • Image of 3D structure (CC BY-SA 3.0)
  • Map by Poulpy, from work by jimht[at]shaw[dot]ca, modified by Rodrigocd, from Image Earthmap1000x500compac.jpg, [Public domain], via Wikimedia Commons
  • Rosenberg, Daniel (2013). ‘Data before the fact’ in Lisa Gitelman (ed.) “Raw data” is an oxymoron. Cambridge, Mass.: MIT Press, pp. 15–40.

More about

Rocio von Jungenfeld
Data Library Assistant

Science as an open enterprise – Prof. Geoffrey Boulton

As part of Open Access Week, the Data Library and Scholarly Communications teams in IS hosted a lecture by emeritus Professor Geoffrey Boulton drawing upon his study for the Royal Society: Science as an Open Enterprise (Boulton, et al 2012). The session was introduced by Robin Rice who is the University of Edinburgh Data Librarian.  Robin pointed out that the University of Edinburgh was not just active, but was a leader in research data management having been the first UK institution to have a formal research data management policy.  Looking at who attended the event, perhaps unsurprisingly the majority were from the University of Edinburgh.  Encouragingly, there was roughly a 50:50 split between those actively involved in research and those in support roles.  I say encouragingly as it was later stated that often policies get high-level buy in from institutions but have little impact on those actually doing the research. Perhaps more on that later.

For those that don’t know Prof. Boulton, he is a geologist and glaciologist and has been actively involved in scientific research for over 40 years.  He is used to working with big things (mountains, ice sheets) over timescales measured in millions of years rather than seconds and notes that  while humanity is interesting it will probably be short lived!

Arguably the way we have done science over the last three hundred years has been effective. Science furthers knowledge.  Boulton’s introduction made it clear that he wanted to talk about the processes of science and how they are affected by the gathering, manipulation and analysis of huge amounts of data: the implications, the changes in processes, and why evenness matters in the process of science. This was going to involve a bit of a history lesson, so let’s go back to the start.

Open is not a new concept

Geoffrey Boulton talking about the origins of peer review

“Open is not a new concept”

Open has been a buzzword for a few years now.  Sir Tim Berners-Lee and Prof. Nigel Shadbolt have made great progress in opening up core datasets to the public.  But for science, is open a new concept? Boulton thinks not. Instead he reckons that openness is at the foundations of science but has somehow got a bit lost recently.  Journals originated as a vehicle to disseminate knowledge and trigger discussion of theories.  Boulton  gave a brief history of the origins of journals pointing out that Henry Oldenburg is credited with founding the peer review process with the Philosophical Transactions of the Royal Society.  The journal allowed scientists to share their thoughts and promote discussion.  Oldenburg’s insistence that the Transactions be published in the vernacular rather than Latin was significant as it made science more accessible.  Sound familiar?

Digital data – threat or opportunity? 

We are having the same discussions today, but they are based around technology and, perhaps in some cases, driven by money. The journal publishing model has changed considerably since Oldenburg and it was not the focus of the talk so let us concentrate on the data.  Data are now largely digital.  Journals themselves are also generally digital.  The sheer volume of data we now collect makes it difficult to include the data with a publication. So should data go into a repository?  Yes, and some journals encourage this but few mandate it.  Indeed, many of the funding councils state clearly that research output should be deposited in a repository but don’t seem to enforce this.

Replicability – the cornerstone of the scientific method

Image of Geoffrey Boulton during his talk

Geoffrey Boulton, mid-talk.

Having other independent scientists replicate and validate your findings adds credence to them. Why would you as a professional scientist not want others to confirm that you are correct?  It seems quite simple but it is not the norm.  Boulton pointed us to a recent paper in Nature (Nature v483 n7391) which attempted to replicate the results of a number of studies in cancer research. The team found that they could only replicate 6, around 11%, of the studies.  So the other 81% were fabricating their results?  No, there are a number of reasons why the team could not replicate all the studies.  The methodology may not have been adequately explained leading to slightly different techniques being used, the base data may have been unobtainable and so on but the effect is the same. Most of the previous work that the team looked at is uncorroborated science.  Are we to trust their findings?  Science is supposed to be self-correcting.  You find something, publish, others read it, replicate and corroborate or pose an alternative, old theories are discounted (Science 101 time: “Null Hypothosis“) and our collective knowledge is furthered.  Boulton suggests that, to a large degree, this is not happening. Science is not being corroborated. We have forgotten the process on which our profession is based. Quoting Jim Gray:

“when you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. We are embarrassed by our data.”

Moving forward (or backwards) towards open science

What do we need to do to support, to do to advise, to ensure materials are available for our students, for our researchers to ensure they can be confident about sharing their data?  The University of Edinburgh does reasonably well but we still, like most institutions, have things to do.

Geoffrey looked at some of the benefits of open science and while I am sure we all already know what these are, it is useful to have some high profile examples that we can all aspire to following.

  1. Rapid response – some scientific research is reactive. This is especially true in research into epidemiology and infectious diseases.  An outbreak occurs, it is unfamiliar and we need to understand it as quickly as possible to limit its effects. During an e-coli outbreak in Hamburg local scientists were struggling to identify the source. They analysed the strain and released the genome under an open licence. Within a week they had a dozen reports from 4 continents. This helped to identify the source of the outbreak and ultimately saved lives.(Rohde et al 2011)
  2. Crowd-sourcing – mathematical research is unfathomable to many.  Mathematicians are looking for solutions to problems. Working in isolation or small research clusters is the norm, but is it effective?  Tim Gowers (University of Cambridge) decided to break with convention and post the “problems” he was working on to his blog.  The result; 32 days – 27 people – 800 substantive contributions. 800 substantive contributions!  I am sure that Tim also fostered some new research collaborations from his 27 respondents.
  3. Change the social dynamic of science – “We are scientists, you wouldn’t understand” is not exactly a helpful stance to adopt.  “We are scientists and we need your help,” now that’s much better!  The rise of the app has seen a new arm of science emerge, “citizen science”. The crowd, or sometimes the informed crowd, is a powerful thing. With a carefully designed app you can collect a lot of data from a lot of places over a short period. Projects such as ASHtag and LeafWatch are just two examples where the crowd has been usefully deployed to help collect data for scientists.  Actually, this has been going on for some time in different forms, do you remember the SETI@Home screensaver?  It’s still going, 3 million users worldwide processing data for scientists since 1999.
  4. Openness and transparency – no one wants another “Climategate“.  In fact Climategate need not have happened at all. Much of the data was already publicly available and the scientists had done nothing wrong. Their lack of openness was seen as an admission that they had something to hide and this was used to damaging effect by the climate sceptics.
  5. Fraud – open data is crucial as it shines the light on science and the scientific technique and helps prevent fraud.

What value if not intelligent?

However, Boulton’s closing comments made the point that openness has little value if it is not “intelligent” so this means it is:

  • accessible (can it be found?)
  • intelligible (can you make sense of it?)
  • assessable (can you rationally look at the data objectively?)
  • re-usable (has sufficient metadata to describe how is was created?)

I would agree with Boulton’s criteria but would personally modify the accessible entry. In my opinion data is not open if it is buried in a PDF document. OK, I may be able to find it, but getting the data into a usable format still takes considerable effort, and in some cases, skill.  The data should be ready to use.

Of course, not every dataset can be made open.  Many contain sensitive data that needs to be guarded as it could perhaps identify an individual.  There are also considerations to do with safety and security that may prevent data becoming open.  In such cases, perhaps the metadata could be open and identify the data custodian.

Questions and Discussion

One of the first questions from the floor focused on the fuzzy boundaries of openness and the questioner was worried that scientist could, and would, hide behind the “legitimate commercial interest” since all data had value and research was important within a university’s business model.  Boulton agreed but suggested that the publishers could do more and force authors to make their data open. Since we are, in part, judged by our publication record you would have to comply and publish your data.  Monetising the data would then have to be a separate thing. He alluded to the pharmaceutical industry, long perceived to be driven by money but which has recently moved to be more open.

The second question followed on from this asking if anything could be learned from the licences used for software such as the GNU and the Apache Licence.  Boulton stated that the government is currently looking at how to licence publicly-funded research.  What is being considered at the EU level may be slightly regressive and based on EU lobbying from commercial organisations. There is a lot going on in this area at the moment so keep your eyes and ears open.

The final point from the session sought clarification of The University of Edinburgh research data management policy.  Item nine states

“Research data of future historical interest, and all research data that represent records of the University, including data that substantiate research findings, will be offered and assessed for deposit and retention in an appropriate national or international data service or domain repository, or a University repository.”

But how do we know what is important, or what will be deemed significant in the future? Boulton agreed that this was almost impossible.  We cannot archive all data and inevitably some important “stuff” will be lost – but that has always been the case.

View of the audience for Geoffrey Boulton's talk as part of Open Access Week at UoE

The audience for Geoffrey Boulton’s talk as part of Open Access Week at UoE

My Final Thoughts on Geoffrey’s Talk

An interesting talk.  There was nothing earth-shattering or new in it, but a good review of the argument for openness in science from someone who actually has the attention of those who need to recognise the importance of the issue and take action on it.  But instead of just being a top down talk, there was certainly a bottom up message.  Why wait for a mandate from a research council or a university? There are advantages to be had from being open with your data and these benefits are potentially bigger for the early adopters.

I will leave you with an aside from Boulton on libraries…

“Libraries do the wrong thing, employ the wrong people.”

For good reasons we’ve been centralising libraries. But perhaps we have to reverse that. Publications are increasingly online but soon it will be the data that we seek and tomorrow’s librarians should be skilled data analysts who understand data and data manipulation.  Discuss.

Some links and further reading:

Addy Pope

Research and Geodata team, EDINA