Tag Archives: technology

Automated item data extraction from old documents

Overview

The Problem

We have a collection of historic papers from the Scottish Court of Session. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical search of likely volumes (if you’re lucky you might get an index at the start!).

Volumes of Session Papers in the Signet Library, Edinburgh
Volumes of Session Papers in the Signet Library, Edinburgh

The Aim

I am hoping to use computer vision techniques, OCR, and intelligent text analysis to automatically extract and parse case-level data in order to create an indexed, searchable digital resource for these items. The Digital Imaging Unit have digitised a small selection of the papers, which we will use as a pilot to assess the viability of the above aim.

Stage One – Image preparation

Using Python and OpenCV to extract text blocks

I am indebted to Dan Vanderkam‘s work in this area, especially his blog post ‘Finding blocks of text in an image using Python, OpenCV and numpy’ upon which this work is largely based.

The items in the Scottish Session Papers collection differ from the images that Dan was processing, being images of older works, which were printed with a letterpress rather than being typewritten.

The Session Papers images are lacking a delineating border, backing paper, and other features that were used to ease the image processing. In addition, the amount, density and layout of text items is incredibly varied across the corpus, further complicating the task.

The initial task is to find a crop of the image to pass to the OCR engine. We want to give it as much text as possible in as few pixels as possible!

Due to the nature of the images, there is often a small amount of text from the opposite page visible (John’s blog explains why) and so to save some hassle later, we’re going to start by cropping 50px from each horizontal side of the image, hopefully eliminating these bits of page overspill.

A cropped version of the page
A cropped version of the page

Now that we have the base image to work on, I’ve started with the simple steps of converting it to grayscale, and then applying an inverted binary threshold, turning everything above ~75% gray to white, and everything else to black. The inversion is to ease visual understanding of the process. You can view full size versions by clicking each image.

A grayscale version of the page
Grayscale
75% Threshold
75% Threshold

The ideal outcome is that we eliminate smudges and speckles, leaving only the clear printed letters. This entailed some experimenting with the threshold level, as you can see in the image above, a lot of speckling remains. Dropping the threshold to only leave pixels above ~60% gray was a large improvement, and to ~45% even more so:

60% Threshold
60% Threshold
45% Threshold
45% Threshold

At a threshold of 45%, some of the letters are also beginning to fade, but this should not be an issue, as we have successfully eliminated almost all the noise, which was the aim here.

We’re still left with a large block at the top, which was the black backing behind the edge of the original image. To eliminate this, I experimented with several approaches:

  • Also crop 50px from the top and bottom of the images – unfortunately this had too much “collateral damage” as a large amount of the images have text within this region.
  • Dynamic cropping based on removing any segments touching the top and bottom of the image – this was a more effective approach but the logic for determining the crop became a bit convoluted.
  • Using Dan’s technique of  applying Canny edge detection and then use a rank filter to remove ~1px edges – this was the most successful approach, although it still had some issues when the text had a non-standard layout.

I settled on the Canny/Rank filter approach to produce these results:

Result of Canny edge finder
Result of Canny edge finder
With rank filter
With rank filter

Next up, we want to find a set of masks that covers the remaining white pixels on the page. This is achieved by repeatedly dilating the image, until only a few connected components remain:

You can see here that the “faded” letters from the thresholding above have enough presence to be captured by the dilation process. These white blocks now give us a pretty good record of where the text is on the page, so we now move onto cropping the image.

Dan’s blog has a good explanation of solving the Subset Sum problem for a dilated image, so I will apply his technique (start with the largest white block, and add more if they improve the amount of white pixels at a favourable increase in total area size, with some tweaking to the exact ratio):

With final bounding
With final bounding

So finally, we apply this crop to the original image:

Final cropped version
Final cropped version

As you can see, we’ve now managed to accurately crop out the text from the image, helping to significantly reduce the work of the OCR engine.

My final modified version of Dan’s code can be found here: https://github.com/mbennett-uoe/sp-experiments/blob/master/sp_crop.py

In my next blog post, I’ll start to look at some OCR approaches and also go through some of the outliers and problem images and how I will look to tackle this.

Comments and questions are more than welcome 🙂

Mike Bennett – Digital Scholarship Developer

 

‘Innovation’: the Emperor’s new clothes?

Scott and I travelled down to Cambridge last week to speak at the Museum Computer Group’s Spring Meeting, ‘Innovation’: the Emperor’s new clothes? It was a very informative day that began with Peter Pavement, SurfaceImpression, giving us a history of digital innovation in museums. Including the first audio guides and the Senster, which was the first robotic sculpture to be controlled by a computer.

First Museum Audio Guides from Loic Tallon Flickr

First Museum Audio Guides

Peter discussing the Hype Cycle, where would you place new technological innovations?

The Hype Cycle

Sejul Malde, Culture 24, followed on from Peter. He discussed using existing assets and content, as well as small ‘process focused’ innovation rather than innovation through giant leaps. His emphasis on creating a rhythm for change made me reflect on how short sprints enabled us to get Collections.ed online. (Looking at our Github commit history highlights sprint deadlines.)

Scott and I then discussed the work we have being doing at Edinburgh to get our collections online through Collections.ed, which has been an iterative process starting off with four online collections launched May 2014, we now have eight collections online following the recent launch of our Iconics collection. We have also recently made a first import into Collections.ed of  776 unique crowdsourced tags we have obtained through Library Labs Metadata Games and those entered into Tiltfactor‘s metadata games.

The tags can been seen online in these two examples:
Charles Darwin’s Class Card
Bond M., White House in Warm Perthshire Valley

The slides from our presentation are available on ERA http://hdl.handle.net/1842/10415 and have a film theme running through them.

The new Iconics home page (I think it is my favourite so far):

iconicswithborder

In the afternoon Lizzie Edwards, Samsung Digital Discovery Centre, British Museum, lead a practical session where we had to think about how we could use new technologies in Museums. Jessica Suess, Oxford University Museums, spoke about their ‘Innovation Fund’ programme and how it had led to new ways of working and new collaborations with colleagues. She mentioned one project using Ipads as Art Sketchbooks http://www.ashmolean.org/education/dsketchbooks/ which was also showcased in a lightning talk.

Lightning talks and a Q&A session with HLF and Nesta finished off the day, you can find out more from Liz Hide’s storify of the day: https://storify.com/TheMuseumOfLiz/the-emperor-s-new-clothes

Claire Knowles and Scott Renton, Library Digital Development Team

Google Glass, gaming and Gallimaufry

Last Monday was no typical day at the office: after an early start at the Imperial War Museum exploring its First World War exhibition with Google Glass, I finished the day trying to escape the British Library before the lights were switched off! In between, I was involved in the launch of a new initiative to make our images available through Tiltfactor’s Metadatagames crowdsourcing platform.

Google Glass at the IWM

Google_Glass_logo

The Imperial War Museum ran an experiment to see how its First World War Galleries could be enhanced with the use of Google Glass and invited heritage professionals to try out the technology. Information Services at the University of Edinburgh have recently acquired a few sets of Google Glass and announced a competition to see how students could use it to improve their learning, so I was keen to see how it could be used in a heritage setting. The concept was actually very simple: a Glass ‘tour’ had been uploaded to the device and, whenever a wearer approached one of several beacons installed throughout the exhibition, the user was fed additional relevant content onto their Glass screen. For example, one of the exhibits was an early tank – when I came within range, a short 1916 propaganda film appeared on my screen describing how the new invention would “bring an end to the war”.

I felt the museum did a good job of providing enough additional content through the Glass to complement existing exhibits without overwhelming the user with too much additional information. The device was surprisingly comfortable and the screen wasn’t overly intrusive. This experiment showed that Google Glass can work in a museum setting: there is definitely scope for using it in one of the Library’s exhibition spaces to provide another dimension to showcasing our collections.

Digital Conversations at the British Library

0039690c

There have been some fantastic initiatives recently in using heritage content as inspiration for video games – this event, part of the British Library’s Digital Conversations series, brought heritage professionals and games designers together for the formal launch of the 2015 ‘Off the Map’ competition for students to design games inspired by the BL’s collections. The theme for the competition, ‘Alice’s Adventures off the Map’, relates to next year’s 150th anniversary of the publication of Alice’s Adventures in Wonderland. The Library provides asset packs for games designers and facilitates access to original collections; the designers use these materials to create exciting and innovative computer games. Previous winners have included an underwater adventure through the long demolished, but now digitally restored, Fonthill Abbey, and a fully immersive 3D version of London from before the Great Fire of 1666.

There were also some really interesting talks at the event about the launch of the National Videogame Arcade in Nottingham, a discussion about how the V&A’s designer in residence built a successful mobile app using items from the museum’s collections, and a demonstration of how the British Museum used Minecraft to engage users with the building and its collections. The range of ideas on display gave food for thought – how can the University take inspiration from initiatives such as these to enhance access to and use of our own collections?

Gallimaufry games

Screen Shot 2014-12-05 at 14.25.36

Crowdsourcing is definitely one way we can do this! We’ve been working on creating a fun metadata tagging game to encourage games enthusiasts, and those with an interest in out collections, to ‘say what they see’ and tag our images. We took inspiration from Tiltfactor’s Metadatagames platform, and on Monday we uploaded around 2,500 images from our Gallimaufry collection to their site. You can now play addictive games such as ‘Zen Tag’, ‘Stupid Robot’ and ‘Guess What’ using a diverse number of images from our own collections!