Tag Archives: Collections

IIIF Conference, Washington, May 2018

Washington Monument & Reflecting Pool
Washington Monument & Reflecting Pool

We (Joe Marshall (Head of Special Collections) and Scott Renton (Library Digital Development)) visited Washington DC for the IIIF Conference from 21st-25th May. This was a great opportunity for L&UC, not only to visit the Library of Congress- the mecca of our industry in some ways- but also to come back with a wealth of knowledge which we could use to inform how we operate.

Edinburgh gave two papers- the two of us delivering a talk on Special Collections discovery at the Library and how IIIF could make it all more comprehensible (including the Mahabharata Scroll), and Scott spoke with Terry Brady of Georgetown University showing how IIIF has improved our respective repository workflows.

From a purely practical level, it was great to meet face to face with colleagues from across the world- we have a very real example of a problem solved with Drake from LUNA, which we hope to be able to show very soon. It was also interesting to see how the API specs are developing- the presentation API will be enhanced with AV in version 3, and we can already see some use cases with which to try this out; search and discovery are APIs we’ve done nothing with, but these will help the ability to search within and across items, which is essential to our estate of systems, and 3D, while not having an API of its own, is also being addressed by IIIF, and it was fascinating to see the work that Universal Viewer and Sketchfab (which the DIU use) are doing to accommodate it.

The community groups are growing too, and we hope to increase our involvement with some of the less technical areas- Manuscripts, Museums, and the newly-proposed Archives group in the near future.

Among a wealth of great presentations, we’ve each identified one as our favourite:

Scott: Chifumi Nishioka – Kyoto University, Kiyonori Nagasaki – The University of Tokyo: Visualizing which parts of IIIF images are looked by users

This fascinating talk highlighted IIIF’s ability to work out which parts of an image, when zoomed in, are most popular. Often this is done by installing special tools such as eyetrackers, but the nature of IIIF- where the region is displayed as part of the URL- the same information can be visualised by interrogating Apache access logs. Chifumi and Kiyonori have been able to generate heatmaps of the most interesting regions on an item, and the code can be re-used if the logs can be supplied.

Joe: Kyle Rimkus – University of Illinois at Urbana-Champaign, Christopher J. Prom – University of Illinois at Urbana-Champaign: A Research Interface for Digital Records Using the IIIF Protocol

This talk showed the potential of IIIF in the context of digital preservation, providing large-scale public access to born-digital archive records without having to create exhaustive item-level metadata.  The IIIF world is encouraging this kind of blue-sky thinking which is going to challenge many of our traditional professional assumptions and allow us to be more creative with collections projects.

It was a terrific trip, which has filled us with enthusiasm for pushing on with IIIF beyond its already significant place in our set-up.

Joe Marshall & Scott Renton

Library Of Congress Exhibition
Library Of Congress Exhibition

Automated item data extraction from old documents

Overview

The Problem

We have a collection of historic papers from the Scottish Court of Session. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical search of likely volumes (if you’re lucky you might get an index at the start!).

Volumes of Session Papers in the Signet Library, Edinburgh
Volumes of Session Papers in the Signet Library, Edinburgh

The Aim

I am hoping to use computer vision techniques, OCR, and intelligent text analysis to automatically extract and parse case-level data in order to create an indexed, searchable digital resource for these items. The Digital Imaging Unit have digitised a small selection of the papers, which we will use as a pilot to assess the viability of the above aim.

Stage One – Image preparation

Using Python and OpenCV to extract text blocks

I am indebted to Dan Vanderkam‘s work in this area, especially his blog post ‘Finding blocks of text in an image using Python, OpenCV and numpy’ upon which this work is largely based.

The items in the Scottish Session Papers collection differ from the images that Dan was processing, being images of older works, which were printed with a letterpress rather than being typewritten.

The Session Papers images are lacking a delineating border, backing paper, and other features that were used to ease the image processing. In addition, the amount, density and layout of text items is incredibly varied across the corpus, further complicating the task.

The initial task is to find a crop of the image to pass to the OCR engine. We want to give it as much text as possible in as few pixels as possible!

Due to the nature of the images, there is often a small amount of text from the opposite page visible (John’s blog explains why) and so to save some hassle later, we’re going to start by cropping 50px from each horizontal side of the image, hopefully eliminating these bits of page overspill.

A cropped version of the page
A cropped version of the page

Now that we have the base image to work on, I’ve started with the simple steps of converting it to grayscale, and then applying an inverted binary threshold, turning everything above ~75% gray to white, and everything else to black. The inversion is to ease visual understanding of the process. You can view full size versions by clicking each image.

A grayscale version of the page
Grayscale
75% Threshold
75% Threshold

The ideal outcome is that we eliminate smudges and speckles, leaving only the clear printed letters. This entailed some experimenting with the threshold level, as you can see in the image above, a lot of speckling remains. Dropping the threshold to only leave pixels above ~60% gray was a large improvement, and to ~45% even more so:

60% Threshold
60% Threshold
45% Threshold
45% Threshold

At a threshold of 45%, some of the letters are also beginning to fade, but this should not be an issue, as we have successfully eliminated almost all the noise, which was the aim here.

We’re still left with a large block at the top, which was the black backing behind the edge of the original image. To eliminate this, I experimented with several approaches:

  • Also crop 50px from the top and bottom of the images – unfortunately this had too much “collateral damage” as a large amount of the images have text within this region.
  • Dynamic cropping based on removing any segments touching the top and bottom of the image – this was a more effective approach but the logic for determining the crop became a bit convoluted.
  • Using Dan’s technique of  applying Canny edge detection and then use a rank filter to remove ~1px edges – this was the most successful approach, although it still had some issues when the text had a non-standard layout.

I settled on the Canny/Rank filter approach to produce these results:

Result of Canny edge finder
Result of Canny edge finder
With rank filter
With rank filter

Next up, we want to find a set of masks that covers the remaining white pixels on the page. This is achieved by repeatedly dilating the image, until only a few connected components remain:

You can see here that the “faded” letters from the thresholding above have enough presence to be captured by the dilation process. These white blocks now give us a pretty good record of where the text is on the page, so we now move onto cropping the image.

Dan’s blog has a good explanation of solving the Subset Sum problem for a dilated image, so I will apply his technique (start with the largest white block, and add more if they improve the amount of white pixels at a favourable increase in total area size, with some tweaking to the exact ratio):

With final bounding
With final bounding

So finally, we apply this crop to the original image:

Final cropped version
Final cropped version

As you can see, we’ve now managed to accurately crop out the text from the image, helping to significantly reduce the work of the OCR engine.

My final modified version of Dan’s code can be found here: https://github.com/mbennett-uoe/sp-experiments/blob/master/sp_crop.py

In my next blog post, I’ll start to look at some OCR approaches and also go through some of the outliers and problem images and how I will look to tackle this.

Comments and questions are more than welcome 🙂

Mike Bennett – Digital Scholarship Developer

 

Library Digital Development investigate IIIF

Quick caveat: this post is a partner to the one Claire Knowles has written about our signing up to the IIIF Consortium, so the explanation of the acronym will not be explained here!

The Library Digital Development team decided to investigate the standard due to its appearance at every Cultural Heritage-related conference we’d attended in 2015, and we thought it would be apposite to update everyone with our progress.

First things first: we have managed to make some progress on displaying IIIF formatting to show what it does. Essentially, the standard allows us to display a remotely-served image on a web page, with our choice of size, rotation, mirroring and cropped section without needing to write CSS, HTML, or use Photoshop to manipulate the image; everything is done through the URL. The Digilib IIIF Server was very simple to get up and running (for those that are interested, it is distributed as a Java webapp that runs under Apache Tomcat), so here it is in action, using the standard IIIF URI syntax of [http://[server domain]/[webapp location]/[specific image identifier]/[region]/[size]/[mirror][rotation]/[quality].[format]]!

The URL for the following (image 0070025c.jpg/jp2) would be:

[domain]/0070025/full/full/0/default.jpg

Poster

This URL is saying, “give me image 0070025 (in this case an Art Collection poster), at full resolution, uncropped, unmirrored and unrotated: the standard image”.

[domain]/0070025/300,50,350,200/200,200/!236/default.jpg

posterbit

This URL says, “give me the same image, but this time show me co-ordinates 300px in from the left, 50 down from the top, to 350 in from the left, to 200 down from the top (of the original); return it at a resolution of 200px x 200px, rotate it at an angle of 236 degrees, and mirror it”.

The server software is only one part of the IIIF Image API: the viewer is very important too. There are a number of different viewers around which will serve up high-resolution zooming of IIIF images, and we tried integrating OpenSeaDragon with our Iconics collection to see how it could look when everything is up and running (this is not actually using IIIF interaction at this time, rather Microsoft DeepZoom surrogates, but it shows our intention). We cannot show you the test site, unfortunately, but our plan is that all our collections.ed.ac.uk sites, such as Art and Mimed, which have a link to the LUNA image platform, can have that replaced with an embedded high-res image like this. At that point, we will be able to hide the LUNA collection from the main site, thus saving us from having to maintain metadata in two places.

deepzoom

We have also met, as Claire says, the National Library’s technical department to see how they are doing with IIIF. They have implemented rather a lot using Klokan’s IIIFServer and we have investigated using this, with its integrated viewer on both Windows and Docker. We have only done this locally, so cannot show it here, but it is even easier to set up and configure than Digilib. Here’s a screenshot, to show we’re not lying.

eyes

Our plan to implement the IIIF Image API involves LUNA though. We already pay them for support and have a good working relationship with them. They are introducing IIIF in their next release so we intend to use that as a IIIF Server. It makes sense- we use LUNA for all our image management, it saves us having to build new systems, and because the software generates JP2K zoomable images, we don’t need to buy anything to do that (this process is not open, no matter how Open Source the main IIIF software may be!). We expect this to be available in the next month or so, and the above investigation has been really useful, as the experience with other servers will allow us to push back to LUNA to say “we think you need to implement this!”. Here’s a quick prospective screenshot of how to pick up a IIIF URL from the LUNA interface.

IIIFMenu

We still need to investigate more viewers (for practical use) and servers (for investigation), and we need to find out more about the Presentation API, annotations etc., but we feel we are making good progress nonetheless.

Scott Renton, Digital Developer

Bridging Gaps at the British Museum

IMG_1790The overwhelming setting of the British Museum played host to this year’s Museums Computer Group “Museums and the Web” Conference, and as usual, a big turnout from museums institutions all over the UK came, bursting with ideas and enthusiasm. The theme (“Bridging Gaps and Making Connections”) was intended to encourage thought about identifying creative spaces between physical museums collections and digital developments, where such spaces are perhaps too big, and how they can be exploited. As usual, there was far too much interesting content to cover fully in a blogpost- everything was thought-provoking, but I’ve picked out a few highlights.

Two projects highlighted collaboration between museums, which can be creatively explosive, and immediately improve engagement. Russell Dornan at The Wellcome Institute showed us #MuseumInstaSwap, where museums paired off and filled their social media feeds with the other museum’s content. Raphael Chanay at MuseoMix, meanwhile, arguably took this a step further by getting multiple institutions to bring their objects to a neutral location (Iron Bridge in Shropshire, Derby Silk Mill), and forming teams to build creative prototypes out of them across the digital and physical spaces. Could our museums collections be exploited in similar ways? Who could we partner up with?

I like to think that our “digital and physical” teams in L&UC collaborate very effectively. Keynote speaker John Coburn from TWAM (Tyne and Wear Archives and Museums) spoke of the importance of this intra-institution collaboration. You will (almost) never find a project that is run entirely from within the digital or physical sphere (Fiona Talbott from the HLF confirmed this- 510 of 512 recent bids had digital outputs relating to physical content), and the ability of the digital area and the content providers to communicate and work together is key. One very good example of this was the Tributaries app, built with sound artists, the history team, archives and so on, to put together an immersive audio experience of lost Tyneside voices from World War I. He also spoke of their TNT (Try New Things) initiative (also creatively explosive!) where staff sign up to do innovation with the collections, effectively in their spare time. With the Innovation Fund encouraging creativity, how do we work this into our daily lives? Can we? If not, how do we incentivise people to do it outwith their spare time? One of the gloomier observations of the day was that, with austerity, there is less and less money in the sector, which is likely to get worse after next month’s spending review. This austerity can breed creativity, though, and it’s good for digital, because people need to ‘work smarter’.

Another really interesting project is going on at the Tate, where they are combining their content with the Khan Academy learning platform. Rebecca Sinker and colleagues showed us how content can be levered and resurrected through a series of video tutorials around the content (be they archival, technical, biographical etc). Pushing the collaborative textual content from the comments area on the tutorials through to social media allows further engagement and new perspectives on the museum objects. Speaking personally, I have had little exposure to our VLE, but I’m quite sure that developing an interface between it and our collections sites could be highly beneficial.

That’s all the tip of the iceberg, though, so take a look at the programme link at the top to find out about lots of other interesting projects.

Outside of the lecture theatre, I had some really interesting conversations with people who have exactly the same problems as ourselves: building image management workflows, incorporating technological enhancements to content-driven websites, and thinking about beacon technology (the sponsors, Beacontent, deserver top marks for the name at least). Additionally, a tour of The Samsung Digital Discovery Centre– where state of the art technology meets British Museum content to improve the experience for children, teenagers, and families- was highly informative.

Scott Renton, Digital Developer