Graveyards and ghosts in web archiving

October 1969 was a busy month. Monty Python’s Flying Circus aired for the first time; Steve McQueen, Trey Parker and PJ Harvey were born; and on a dark, dark night (or about 10.30pm on the 29th), a 21-year-old UCLA student called Charley Kline started to transmit a message to the Stanford Research Institute using the Advanced Research Projects Agency Network. He meant to send the word ‘LOGIN’ – but the receiving system crashed at ‘LO’. And thus, the internet was born.

Well, kind of. It would take another 20 years for Tim Berners-Lee’s World Wide Web to be fully developed, but the 29th October marks the first day that an electronic message was sent over a network. Each year this date is commemorated as World Internet Day, and as time of year also sees us mark Halloween and World Digital Preservation Day on November 2nd, there couldn’t be a better time to share some of the web archiving work we’ve been doing to preserve the University of Edinburgh’s historical web content using tombstone pages. That’s right – spooky internet preservation!

There are lots of ghoulish metaphors that surround digital content. We talk about dead or dying platforms (there’s a graveyard, for example, of services that have been ‘killed off’ by Google); the term ‘digital zombie’ is used to describe someone unable to detach themselves from their online world at a cost to their IRL self; and there is a growing field of study exploring the concept of ‘digital afterlives’ – the potential for us to live on through our digital traces on social media long after we have died.

Perhaps archivists are professionally predisposed to morbid thoughts about what will survive of us into the future – after all, all heritage work requires imagining a future when we are long gone and subsequent generations are trying to understand what happened, how, and why – and web archives are no different.

What is web archiving, and why are we doing it?

Web archiving is the process of creating reliable copies of web-based content for long term preservation. Information on the web disappears at a rapid pace as content is changed, updated, and removed, and once something has been removed there isn’t always a reliable way of getting it back. How often have you clicked on a link only to reach a dreaded ‘404’ page? A 2015 study by the UK Web Archive reported that in just two years, 40% of websites collected in the national web archive had disappeared.

The University of Edinburgh has been publishing content on the internet since the late 1990s, but we can’t keep all of our content online for all time. There are lots of reasons why it’s important that the University regularly audits its web estate. Firstly, it’s just good digital housekeeping to regularly review what pages are currently online. Hosting a blog, for example, can take up a lot of space on our servers as we have to store each image, video, audio files etc and this storage all builds up.

Secondly, there’s an environmental impact to consider: storing and serving up images, files, and code necessary to render a website uses precious resources (did you know the internet in all its forms currently produces the same level of carbon emissions as the aviation industry?) Removing content that is no longer needed allows us to only host what we need and helps us towards reducing our carbon impact.

Thirdly, we don’t want outdated or incorrect information to remain on the live web. There’s nothing more frustrating than thinking you’ve finally found a page that answers your question – only to find out that it’s years old and the information is out of date. We want our webpages to provide students, staff, and the wider community with the most accurate and up to date information about our activities as possible, so that means taking down things that are no longer applicable.

But a lot of these older websites and webpages are a valuable record of the University’s past activities, and we don’t want to lose them completely! So how can we continue to provide access to older content without it taking up much needed resources and space on university’s servers?

Here lies old content, gone but not forgotten

A gravestone with the words 'here lies old content, GBNF' inscribed on it
This is where ‘tombstoning’ comes in. A tombstone is exactly what it sounds like: it’s a way of marking the place where ‘dead’ content used to ‘live’. A tombstone page replaces the content and points to the archived copy instead.

Tombstoning is a big part of the web archiving programme we’ve been developing over the last few months, and you may have already spotted some tombstone pages on the University’s website – like on the Centre for the History of the Book’s former site or the Languages, Literature and Cultures news page – and more will be coming in the next few months as we work to extend the web archiving programme to other areas of the web estate.

Ghosts in the machine

But what about the archived pages that we’re actually linking to? Continuing our spo0o0o0oky metaphor, these archive captures are our ghosts. All archives are ghosts – an echo or impression left behind by past activity – but this is especially true of web archives. When we archive a page, we’re not just taking a screenshot or saving the HTML. Instead, our goal is to make a copy of the site that will allow us to reproduce how the site functioned when it was live. In this way, web archives are ghosts in the machine that we can resurrect (recreate within a replayer) and through them, we can communicate with the past.

Does that make a replayer an Ouija board? Can you be haunted by the ghosts of content past that you thought had been removed? Can this metaphor possibly be stretched any further?

A planchette pointed at the address bar of a new browser tab

**EDIT: The UK Web Archive has sadly been affected by the massive tech outage suffered by the British Library. Reports that this has been caused by ghouls and gremlins in the system have yet to be confirmed…**