Skip to primary content
Skip to secondary content

Edinburgh Research Data Blog

Edinburgh Research Data Blog

Edinburgh Research Data Blog

Main menu

  • Home
  • About

Tag Archives: metadata

DataShare Spotlight: The Hiberlink project, and anticipating change in a data repository

Posted on 20 March 2025 by Evelyn Williams
Reply

While looking through DataShare for datasets about research practice, I found the Hiberlink project dataset [1], which perfectly illustrates some of the challenges the changing digital research landscape poses to good research practice.

The Hiberlink project investigated ‘reference rot’, which happens when a publication includes links that no longer contain the information the authors originally cited. The Hiberlink researchers found one in five articles in their sample contained at least one link which didn’t resolve to the originally referenced content [2].

This project garnered lots of press coverage around 2015-16, and is just as relevant 10 years on, still resurfacing on Twitter every so often. Sometimes the coverage takes a negative bent, from those who view the reproducibility crisis as proof that academia is a waste of tax money and human potential. More often though, it comes from people advocating better, more open research practices.

Richard Tobin, who deposited the Hiberlink dataset, did a good job of completing the metadata and helpfully included a link to the project’s website, hiberlink.org. However, if you follow this link it appears to have expired and been purchased by an unknown party selling “virtual data rooms”. None of us are safe.

Screenshot of the Hiberlink website having been taken over by a company selling something called virtual data rooms.

Unknown author, “What Is a Virtual Data Room? A Guide to Secure Online Document Management,” Hiberlink (https://hiberlink.org : 5 February 2025); archived at Wayback Machine (https://web.archive.org/) > https://hiberlink.org > 19 March; citing a capture dated 15 March 2025.

Why links die and why it matters

Change is a given in research projects nowadays. The path of enquiry shifts, researchers leave the university and others join, and new tools emerge for carrying out research and sharing findings. Researchers share their work on blogs, project websites, and a seemingly endless chain of social media platforms, moving with the digital tides.

But digital repositories, which are often modelled after physical archives and aim to take a static snapshot of research, can struggle to represent research projects in flux. This article takes a look at how nascent web archives used archiving frameworks taken from the field of librarianship [3].

Traceability is a core tenant of open research, but traceability can be difficult online. Digital and web-based resources often evolve over time, unlike traditionally published articles. URLs in publications or metadata may expire, or the content on the linked page may radically change. Changing content makes the research process less transparent and makes research findings harder to verify and reproduce.

Academic staff are often encouraged to have a strong online presence. Sharing research online is considered an important part of doing ‘open research’, and a strong online presence is seen as important for academics’ personal career development. But most of us don’t often think much about the web of links created during a research project. These links can drift and degrade as researchers leave an institution or move on to new projects, as a pot of funding runs dry, or as reminders to renew a domain registration end up in a spam folder.

The case of web domains is especially interesting, as they can be bought by anyone. Old links to defunct external web pages may lend credibility to questionable groups or dodgy businesses that claim a domain once a researcher’s registration expires.

Why links pose a challenge for data management

While depositing a research dataset in an open access repository seems like a natural conclusion to a project, you might be horrified to learn that the work doesn’t stop here. In data management plans we think about how we can share, store and preserve our data for the future. We don’t usually consider how all the peripheral stuff around our data might also need to be preserved and maintained.

We usually recommend that instead of using a live URL to link to a project website, researchers cite archived copies of the web pages and use persistent identifiers (like DOIs) where possible. The authors of the Hiberlink paper encourage researchers to reference archived versions of the web, like the non-profit Internet Archive. But as could have been the case with the Hiberlink project, linking to an archived copy isn’t ideal if a project website is still being actively updated and you want to direct people to the most recent version.

What can we do?

So, beyond using persistent identifiers, how can we make sure that a repository doesn’t contain a bunch of dead links? The obvious answer is that researchers should periodically review all their items in the research repositories at every university they’ve worked at, and update the metadata… forever? As is often the case with good research practice, solutions often require an ongoing investment of time that researchers simply don’t have.

Maybe as data archivists, we have to accept that metadata will decay, and that messy metadata is better than invisible research data.

Flow chart showing the data management process, with 'Update metadata' stuck inside an endless loop.

The metadata update hell loop

How can you avoid reference rot?

For help figuring out how you can preserve your websites, see the university’s resources on web archiving [4]. Dr Alice Austin ­does regular training sessions on web archiving, which I recently attended and found very useful. And if you’d like advice on linking all your archived research outputs together in a way that ensures they’ll be useful in years to come, get in touch with us!

Learn more about the Hiberlink project

You can read more about these issues in the archived versions of hiberlink.org from before March 2022 in the Internet Archive [5].

If you’d like to read more about reference rot, the authors of this project wrote two great papers on this phenomenon:

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot [2] looks at reference rot in over 3.5 million STEM articles from 1997-2012 and finds that one in five contains a link which doesn’t resolve to the originally referenced content.

No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving [6] explores the use of machine learning to identify links at risk of rotting.

While writing this blog post I spoke with Richard Tobin, one of the researchers on the Hiberlink project, and he had this to say: “Neither of us is working on related research now, but it would be interesting if someone were to take the data and see how much further the links have decayed.”

 

Links

[1] Tobin, Richard; Grover, Claire; Zhou, Ke. (2015). Hiberlink project data, 1997-2012 [dataset]. University of Edinburgh. https://doi.org/10.7488/ds/230.

[2] Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, Zhou K, et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. https://doi.org/10.1371/journal.pone.0115253

[3] Hegarty, K. (2022). The invention of the archived web: tracing the influence of library frameworks on web archiving infrastructure. Internet Histories, 6(4), 432–451. https://doi.org/10.1080/24701475.2022.2103988

[4] The university’s web archiving service: https://library.ed.ac.uk/heritage-collections/collections-and-search/archives/digital-archives-and-preservation/web-archiving

[5] Archived version of Hiberlink project website. https://web.archive.org/web/20220306174247/http://hiberlink.org/

[6] Ke Zhou, Claire Grover, Martin Klein, and Richard Tobin. (2015). No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’15). Association for Computing Machinery, New York, NY, USA, 233–236. https://doi.org/10.1145/2756406.2756940

 

Posted in digital preservation | Tagged archiving data, DataShare, digital preservation, metadata, reference rot, research integrity | Leave a Reply

Thinking about Research Data Asset Registers

Posted on 12 December 2013 by slewis23
Reply

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

[Reposted from https://libraryblogs.is.ed.ac.uk/blog/2013/12/12/thinking-about-research-data-asset-registers/]

In my last blog post, I looked at the four quadrants of research data curation systems.  This categorised systems that manage or describe research data assets by whether their primary role is to store metadata or data, and whether the information is for private or public use.  Four systems were then put into these quadrants.

quadrants3

The University of Edinburgh already has two active services from this diagram: PURE, our Current Research Information System and DataShare, our open data repository.

This blog post will start to unpack some of the requirements for a Data Asset Register.

The first aspect to cover is its name.  What should it be called?  Traditionally systems like this, which only hold metadata records that either just describe, or describe and point to other resources, are known as registers, catalogues, directories, indexes, or inventories.

The University already has a ‘Data Catalogue’, maintained by the Data Library.  However this list has a different purpose, to hold details of external data.  Oxford University, instead of opting for a name such as this, have instead opted to call their service by the verb ‘find’ – DataFinder.  Whilst there may be some brand or service name applied to the system we create at the University of Edinburgh, for now its working title is ‘Data Asset Register’ as one of its main functions will be to allow data creators to ‘register’ their data assets by describing them, and if the data is published online to link to the data.

But what should the Data Asset Register provide?  The following diagram shows some early thoughts:

dar-thoughts

The diagram splits this up into three broad areas:

  • Description – what the asset register should describe
  • Functions – the functions needed to allow data asset description
  • Services – the value-added services that will add benefit to people who register their data

Description

The core purpose of the system is to describe data.  This is split into two categories: being able to describe single items or data assets, and describing collections of data assets.  Many data assets are created on their own, for example a population health longitudinal study.  As such, this should be described on its own.  In contrast, some data are created in large sets, where it isn’t necessarily useful to describe every part of that set on its own.  In this case, the collection as a whole can be described. A good example of this is the Research Data Australia service from the Australian National Data Service.

We’ll need to decide how to describe the data.  A likely initial candidate will be the DataCite Metadata Schema, but we may find this needs to be extended to cover extra elements relevant to the University or the discipline of the data asset being described.  There will also be requirements coming from a possible UK research data registry, development of which is being led by the Digital Curation Centre.

Functions

In order to enable data asset description, a register will need certain functions.  So far three have been identified:

  1. CRUD: Create / Read / Update / Delete are the basic functions required when manipulating data.  The system should allow records of research data to be created, read later, updated, and if needed, deleted.
  2. User Interface (UI): In order to enable CRUD functionality, a user interface will be required.  To be useful, this will need to provide search and display functionality, for example using faceted search and browse.
  3. Log: Some funders have requirements to keep data for certain lengths of time, or for periods of time that must be reset each time a data set is accessed.  For this reason each access of a data asset must be logged by the system.  An example is from the EPSRC:

“Research organisations will ensure that EPSRC-funded research data is securely preserved for a minimum of 10-years from the date that any researcher ‘privileged access’ period expires or, if others have accessed the data, from last date on which access to the data was requested by a third party;”

It may also be that the Data Asset Register can be a front-end for our Data Vault too – more about that in another blog post!

Services

Extra value-added services are required in order to make the Data Asset Register useful to people.  Our initial thoughts about these services include the following:

  • Identify: The ability to assign identifiers to data assets.  Some of these identifiers will need to be persistent.
    • DOI: DataCite DOIs allow DOIs to be assigned to data assets, in the same way that DOIs are assigned to journal articles.  This allows them to be persistently identified over time even if they move between systems, but also allow them to be cited using a well-known identifier.
    • TinyURL: A short URL such as those provided by TinyURL or bitly are useful to give easy web identifiers to objects.  For example it might be nice to be able to issue URLs such as http://data.ed.ac.uk/abcd.
    • Other: Are there any other identifier systems that we should consider using?
  • Discover: It is important that the data records held in the Data Asset Register are searchable and can be indexed by external services.  This may be by national, international, or discipline-based data aggregators, or by normal web search engines.
  • Share: Whilst often the data assets will be described online but kept offline by the researcher, they may wish to share the data.  The Data Asset Register may need to facilitate this in a number of ways:
    • Deposit:  If the data is held in the Data Vault, along with a description in the Data Asset Register, then using a deposit protocol such as SWORD it would be possible to deposit the data into the institutional data repository, or into an external repository.  The Data Asset Register can then record the identifier for the hosted data set.
    • Redirect:  Where the data is hosted online elsewhere, the Data Asset Register could automatically redirect users.  For example visiting http://data.ed.ac.uk/abcd could redirect a visitor directly to the repository, rather than showing them just the data asset record description.  If the data is not shared openly, then contact details can be provided of the data owner.
    • RCUK: Some funders, such as the RCUK members (Research Councils UK) require funded journal papers to include “a statement on how the underlying research materials – such as data, samples or models – can be accessed”.  The data asset register could facilitate this by automatically writing statements such as “Details about accessing the data referenced in this paper may be found at http://data.ed.ac.uk/abcd”

It is very early days in our thinking about what features a Data Asset Register should offer, and like many components of a modern research data management infrastructure, there are very few existing examples to look at.  Our thoughts will be refined over the coming months so that we can start looking at implementation options.  Is there an existing system that can do all of this for us, or is it better to build something new, either alone or with collaborators?

Images available from http://dx.doi.org/10.6084/m9.figshare.873617

Stuart Lewis, Head of Research and Learning Services, Library & University Collections.

Posted in Data stewardship, Roadmap progress | Tagged data asset register, interoperability, metadata, open data, private data | Leave a Reply

The four quadrants of research data curation systems

Posted on 6 December 2013 by slewis23
1

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

Warning: Undefined array key "file" in /apps/www/wordpress/blogs/wp-includes/media.php on line 1686

[Reposted from https://libraryblogs.is.ed.ac.uk/blog/2013/12/06/the-four-quadrants-of-research-data-curation-systems/]

The University of Edinburgh, like many other universities, is currently undertaking extensive work to build infrastructure that supports and enables good practice in the area of Research Data Management.  This infrastructure ranges from large-scale research storage facilities to data management planning tools.

One aspect of Research Data Management highlighted in the University’s RDM Roadmap is ‘Data stewardship: tools and services to aid in the description, deposit, and continuity of access to completed research data outputs.’

To help describe how these systems fit together yet how they differ from each other, I use a model with two axes to differentiate what they hold, and who can access them.  The first axis is used to differentiate between systems that hold only metadata from those that hold files (typically with some level of metadata), while the second differentiates between private systems and public systems.

quadrants1

Research information and data management and associated systems aren’t a new phenomenon. We have been offering services in these areas for some time.  To demonstrate this, we have two existing systems that provide services in two of the areas:

  •  PURE is our Current Research Information System (CRIS).  It is a private system for the University to record the research outputs it generates.  It therefore falls into the metadata / private quadrant. (It can hold files, and has a public interface, but this is primarily for Open Access publications rather than research data).
  • DataShare is our open research data repository.  It holds and curates data (and associated metadata) for public consumption on behalf of the data creators.  It therefore falls into the data / public quadrant.

quadrants2

What about the other two quadrants?  Are there systems or infrastructure needed to fill these?  Is there a case where we need a public store of metadata about research data, or a private store of finished data sets?

The rest of this blog post will argue that there is a need for these, and will describe two pieces of infrastructure that could fill them.  Further blog posts will be written that start to unpick the requirements of these systems in more depth.

Public Metadata: Not only is it good practice for a research institution to know what research data it is creating, some research funders require us to do so.  In addition the University’s RDM policy requires

“Any data which is retained elsewhere, for example in an international data service or domain repository should be registered with the University.”

The following is an extract from the EPSRC’s expectations for research data management:

 “Research organisations will ensure that appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet; in each case the metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated, and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier (For example as available through the DataCite organisation – http://datacite.org/).”

This need can be fulfilled by the creation of a Data Asset Register.

Private Data: Whilst some data will be suitable for public sharing, for various reasons some will not, or will need to have access controlled by the data creator.  Therefore there is a need for a safe place for keeping data that will be kept secure, both in terms of access and change.  Once lodged/archived there, files should only be accessible by the data creator or data manager, and it should not be possible to change files, but only to create newer versions or to remove/delete them.

This need can be fulfilled by the creation of a Data Vault.

quadrants3

Systems however do not live in isolation, and become more powerful, more useful, and more likely to be used if they are able to integrate with each other.  With the ever-growing number of ‘systems’ provided by a large research-intensive university, the last thing that a research data management programme wants to do is to introduce further systems that need to be fed with duplicate information.  This means that some or all of the components will need to be integrated together.

There are three obvious integrations between these systems, as shown below:

quadrants4

First, because PURE is the master system for holding data and relationships about research outputs (THIS grant, funded THAT piece of equipment, which was used to create THIS data set, that was described in THESE journal articles), records of data sets need to exist within it.  However if some or all of these are being created in the Data Asset Register, then they will need to be pushed into PURE.  Equally if some data are being registered directly in PURE, it will be useful to pull this out of PURE and into the Data Asset Register.

Secondly, because the Data Asset Register may become the main user interface for entering details of data sets, it could also be the main administrative user interface for uploading files into the Data Vault.  If that is the case, then the Data Asset Register and the Vault will need to be integrated.

Finally, for instances where metadata is held in the Data Asset Register, corresponding files are held in the Data Vault, and the data owner decides to make the data openly available, then the Data Asset Register should be able to deposit these as a new item in the Data Repository.

The next challenge will be to describe the requirements for the Data Vault and Data Asset Register.  We have some early thoughts about this, and will share these in future blog.

Images available from http://dx.doi.org/10.6084/m9.figshare.873617

Related blog posts:

  • Thinking about Research Data Asset Registers
  • Thinking about a Data Vault

 

Stuart Lewis, Head of Research and Learning Services, Library & University Collections.

Posted in Data stewardship, Roadmap progress | Tagged data asset register, data curation, DataVault, metadata, private data | 1 Reply

Recent Posts

  • US Government Data: Lost and Found
  • DataShare Spotlight: The Hiberlink project, and anticipating change in a data repository
  • DataShare spotlight: Debates on slavery and abolition held by student debating societies at the University of Edinburgh, 1765-1870
  • DataShare spotlight: Human MotionLess Dataset (HuMoLs) and the creative potential of research data
  • Knowledge Exchange with Japan

Subscribe via email

Archives

Categories

Tags

arts data big data BITS clinical trials data code Collaboration Conference Data-X data asset register data curation data journals data reuse Data Safe Haven Data Science DataShare DataStore DataVault DCC Dealing with Data digital preservation Edinburgh DataShare Edinburgh DataVault ELNs Humanities data librarians MANTRA metadata methods MOOC open data open science postgraduate training private data qualitative data RDA RDM launch RDM services Research data Research Data Service research support scalable digital preservation environments staff training-kit videos Workshops

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
Creative Commons Licence

Licensed under a Creative Commons Attribution 4.0 International Licence.

Banner photo by Rene Böhmer on Unsplash

The University of Edinburgh is a charitable body, registered in Scotland, Reg. Number SC005336, VAT Reg. Number GB 592 9507 00.

This blog is managed by The Digital Library, on behalf of the Research Data Service at the University of Edinburgh.

Contact us

Privacy & Cookies

Website Accessibility

Proudly powered by WordPress