Tag Archives: technology

Introducing Whiiif – Full text searching across image-based collections

July 3, 2019UncategorizedAPI, Collections, Development, digital scholarship, IIIF, images, library labs, OCR, open source, technology, whiiifMike Bennett

Background

For historical collections digitisation projects inside the library, we are increasingly looking to provide OCR transcriptions of the documents alongside the digital images. In many cases, this can enhance the usability of the images significantly. For example, the volumes in the Session Papers Project are large and often without index, making locating specific text inside difficult (see the previous blog post about this challenge here: http://libraryblogs.is.ed.ac.uk/librarylabs/2017/06/23/automated-item-data-extraction-from-old-manuscripts/) unless one is in possession of copious amounts of free time and dedication.

Our implementation of IIIF as the primary delivery method for digital images at Edinburgh (some highlights of our IIIF collections at the bottom of Scott’s blog here: http://libraryblogs.is.ed.ac.uk/librarylabs/2018/12/13/edinburgh-hosts-international-iiif-event/) opens up a vector for not just providing the OCR text alongside the images, but also to enable native searching within the volume images inside an IIIF viewer such as UniversalViewer or Mirador.

IIIF Search

Currently searching is usually performed before and outside of the viewing experience, with the chosen result then loaded in a viewer. Searching within a volume therefore offers different possibilities for the end-user during their journies across the collections. This is achieved by using a service that is capable of providing the IIIF Search API.

So far, not many such services exist in the open source world, with the IIIF Awesome list having just one entry under Content Search Services: NCSU Libraries’ Ocracoke project, which is a Rails-based full workflow solution that can also process and OCR the documents prior to serving them via IIIF. Whilst other institutions do provide IIIF Search on their holdings, these implementations can be an integral part of their digital delivery stack and not easily seperable for release, internal only projects, instances of Ocracoke, etc.

As the OCR here at Edinburgh falls under a different part of our workflow (of which more in a future blogpost) and we are primarily working with PHP and Python, I decided to implement a simple Python service capable of supporting the Search API. The project is written using Flask, a lightweight Python web framework and backed by Apache Solr to provide the text-searching. A simple service needed a simple name, and so Whiiif (Word Highlighting for IIIF) was born.

Whiiif v1

Initially, I adopted the model used by Ocracoke: indexing the text of each whole page in Solr, and using an array of word->co-ordinate mappings for each page image. When a search is made in Solr, each document is returned using the native Highlighting feature of Solr, which returns a fragment of text, with the matching words bracketed by <em> tags.

The word-co-ordinate mappings for each page are extracted from the ALTO-XML generated by the Tesseract OCR process and stored in Solr as JSON, alongside the raw text. Producing the IIIF Search API response then becomes a case of extracting the matched words from the Solr highlight result, popping the co-ordinates for each word, and generating the response JSON for the client. The initial version of Whiiif using this approach can be found on the Whiiif Github repo at commit af8a903.

This version was deployed and during testing raised issues when handling some of the documents in our collection:

Text-dense images, such as the Session Papers volumes, tended to have multiple instances of individual words on a page. Whilst this was not a problem for single-term searches (all instances would be found and the co-ordinates loaded), it caused a problem for phrase-based searching, where a word from the phrase could appear elsewhere on the page, before the match for the phrase, leading to incorrect word co-ordinates being retrieved from the array.
Some words were modified by Solr’s language processing during the ingest process, meaning matches were being returned for which the corresponding co-ordinates (generated from the pre-processed, raw text) could not be found.

There were approaches to solving these problems, such as forcing Solr to return the entire page text via the Highlighter, so that the correct instance of repeated words could be ascertained. However, this led to a significant increase in the processing time required to generate the response for each hit, as well as greater resource requirements for Solr and I decided to try a different approach.

Whiiif v2

For the second iteration of Whiiif, I decided to investigate how feasible it would be to have Solr return the matching fragment from the ALTO-XML document, which would mean having the co-ordinates for the hits already in the Solr response. This ran into difficulties, as Solr is designed for working with text, and will not easily index or search an XML document, in fact usually stripping all the XML data (that we wanted to try and preserve) by using the HTMLStripCharFilter during the indexing process. Even with the filtering removed from the processing chain, basic abilities such as phrase searching were lost due to the format of the text being searched being “word<xml fragment>word<xml fragment>word<xml fragment>…”, and false hits for words appearing inside the ALTO-XML format such as “page”, “line”, “word”, etc.

Via the IIIF Slack, I was pointed towards the work of Johannes Baiter and the MDZ Digital Library team at the Bavarian State Library, who are developing a Solr plugin to resolve these various issues (available at https://github.com/dbmdz/solr-ocrhighlighting). I reworked the Solr controller for Whiiif to use the functionality of this plugin, keeping the work already done to provide IIIF Search API responses.

Following a couple of weeks of testing, and some very useful collaborative bug-fixing work with Johannes (primarily fixing some regexp bugs and improving the handling of ALTO files) and thanks to his speedy implementation of a feature request, Whiiif v2 was moved into internal production for some in-development project websites.

I then implemented a secondary feature in Whiiif: the ability to search across a collection as a whole, and have document hits, with snippets of page images returned (complete with visual highlighting), to complement the existing “Search Within” functionality of the IIIF Search API. This feature is also powered by the OCR Highlighting plugin, but returns a custom JSON format (although similar to the IIIF Search response format), allowing the front end controller of a collections site to customise the display of results to fit each individual site design.

The version of Whiiif with these capabilities is currently available on the “withplugin” branch of the Whiiif github repo, although this is still in heavy development and will become the master branch in the future when it is a bit tidier!

Next Steps

The next steps with the Whiiif experiment are to prepare a formal release of Whiiif v2, with updated documentation, install instructions and full unit-test coverage, keep an eye out here or on the github repo for news. In the meantime, please feel free to clone the repo and experiment. Issues and PRs always welcome and you can also contact me on the IIIF slack (as mbennett) or via email: mike.bennett@ed.ac.uk.

I’d love to hear from anyone playing around with Whiiif, or suggestions for other features. Experimental support for the “hits” property of IIIF Search v1 will arrive shortly, along with some updates to make use of the latest features of the Solr plugin.

Until next time 🙂
Mike

Automated item data extraction from old documents

June 23, 2017Uncategorizedautomation, Collections, image processing, library labs, OCR, scottish session papers, technologyMike Bennett

Overview

The Problem

We have a collection of historic papers from the Scottish Court of Session. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical search of likely volumes (if you’re lucky you might get an index at the start!).

Volumes of Session Papers in the Signet Library, Edinburgh

The Aim

I am hoping to use computer vision techniques, OCR, and intelligent text analysis to automatically extract and parse case-level data in order to create an indexed, searchable digital resource for these items. The Digital Imaging Unit have digitised a small selection of the papers, which we will use as a pilot to assess the viability of the above aim.

Stage One – Image preparation

Using Python and OpenCV to extract text blocks

I am indebted to Dan Vanderkam‘s work in this area, especially his blog post ‘Finding blocks of text in an image using Python, OpenCV and numpy’ upon which this work is largely based.

The items in the Scottish Session Papers collection differ from the images that Dan was processing, being images of older works, which were printed with a letterpress rather than being typewritten.

The Session Papers images are lacking a delineating border, backing paper, and other features that were used to ease the image processing. In addition, the amount, density and layout of text items is incredibly varied across the corpus, further complicating the task.

The initial task is to find a crop of the image to pass to the OCR engine. We want to give it as much text as possible in as few pixels as possible!

Due to the nature of the images, there is often a small amount of text from the opposite page visible (John’s blog explains why) and so to save some hassle later, we’re going to start by cropping 50px from each horizontal side of the image, hopefully eliminating these bits of page overspill.

Now that we have the base image to work on, I’ve started with the simple steps of converting it to grayscale, and then applying an inverted binary threshold, turning everything above ~75% gray to white, and everything else to black. The inversion is to ease visual understanding of the process. You can view full size versions by clicking each image.

A grayscale version of the page — Grayscale

The ideal outcome is that we eliminate smudges and speckles, leaving only the clear printed letters. This entailed some experimenting with the threshold level, as you can see in the image above, a lot of speckling remains. Dropping the threshold to only leave pixels above ~60% gray was a large improvement, and to ~45% even more so:

At a threshold of 45%, some of the letters are also beginning to fade, but this should not be an issue, as we have successfully eliminated almost all the noise, which was the aim here.

We’re still left with a large block at the top, which was the black backing behind the edge of the original image. To eliminate this, I experimented with several approaches:

Also crop 50px from the top and bottom of the images – unfortunately this had too much “collateral damage” as a large amount of the images have text within this region.
Dynamic cropping based on removing any segments touching the top and bottom of the image – this was a more effective approach but the logic for determining the crop became a bit convoluted.
Using Dan’s technique of applying Canny edge detection and then use a rank filter to remove ~1px edges – this was the most successful approach, although it still had some issues when the text had a non-standard layout.

I settled on the Canny/Rank filter approach to produce these results:

Next up, we want to find a set of masks that covers the remaining white pixels on the page. This is achieved by repeatedly dilating the image, until only a few connected components remain:

You can see here that the “faded” letters from the thresholding above have enough presence to be captured by the dilation process. These white blocks now give us a pretty good record of where the text is on the page, so we now move onto cropping the image.

Dan’s blog has a good explanation of solving the Subset Sum problem for a dilated image, so I will apply his technique (start with the largest white block, and add more if they improve the amount of white pixels at a favourable increase in total area size, with some tweaking to the exact ratio):

So finally, we apply this crop to the original image:

As you can see, we’ve now managed to accurately crop out the text from the image, helping to significantly reduce the work of the OCR engine.

My final modified version of Dan’s code can be found here: https://github.com/mbennett-uoe/sp-experiments/blob/master/sp_crop.py

In my next blog post, I’ll start to look at some OCR approaches and also go through some of the outliers and problem images and how I will look to tackle this.

Comments and questions are more than welcome 🙂

Mike Bennett – Digital Scholarship Developer

CSV Mind-Blow(ve)n!

May 9, 2016UncategorizedConferences, crowdsourcing, csv, Development, Digital, images, Metadata Games, open source, spreadsheets, technologyScott Renton

PSV Eindhoven - Eredivisie champions — Never waste a football-related pun. It’s been a good week for both PSV and CSV.

I was lucky enough to have a paper accepted to Csv,conf,2* in Berlin on the 3rd-4th of May, which was great to do, but it also got me through the door to see loads of great things going on in data and its surrounding technology. Yes, there was heavy mention made of CSV and Spreadsheets; in fact, at times it was akin to an AA meeting ,with people guiltily admitting their love of Excel. This left me feeling- quite worryingly- vindicated in a lot of the things I do.

As is always the case with any conference review blogpost, it’s not viable to list every link or ruminate on the message of every talk, so I’ll just home in on a few highlights. The talks (available in slide or video form) are appearing over at Lanyrd.com, and they’ll give a lot more depth to what was spoken about.

My own reason for being there- as far as my talk was concerned- was to look at better ways of processing workflow, enriching data, and improving engagement with our collections. Afterwards, I had some interesting conversations: I was alerted to the tool NeuralTalk2 by Maciej Gryka of rainforestqa.com, a company that specialises in cleaning data using mechanical turks and crowdsourced test-cases. Neural Talk, though, is a captioning tool, which will attempt to visually recognise what your image is “of”. I’m sure it fails as much as it succeeds, but, as I pointed out, we’ve not really used this kind of tech to enhance our metadata, so there’d be no harm in running some of our images through and seeing what it comes up with. Another chat, with a lady from UC Santa Cruz, made it clear that we are quite liberal with our approach to crowdsourced data. Where we have generally decided it’s fine to surface as long as it is properly marked as such, they are proceeding rather slowly, due to a particularly strict metadata librarian.

The keynotes were deliberately intended to cover a range of disciplines that might be new to most people at this highly eclectic conference. Resultantly, there were interesting talks on technology and activism (including visualisations of the Ebola crisis and police brutality); ethics in technology and workflows to give your consent without clicking on unreadable terms and conditions (do you know what SmartBins are taking from you as you pass?); dealing with messy spreadsheets (the Enron crisis showed this institution to manage them terribly), and open data with neuroscience (a lot of mouse brains in action).

Some other tools that we could be looking at:

Zegami – great for exploring large banks of images and spotting patterns across them. Can it work with IIIF, I wonder?

OpenRefine– a tool that we perhaps should have been using for some time to rationalise spreadsheet data, which could certainly save lots of time. Our former colleague, Richard Jones of cottagelabs, is a great advocate of these kinds of tools, as his talk made clear.

DataBaker– created in collaboration between the Office of National Statistics and ScraperWiki. This Python application can convert any ‘pretty’ spreadsheet into usable source data in CSV.

CSV Rinse and Repeat– built in Paris by Mathieu Jacomy of the Paris MediaLab, this is a JavaScript tool which intends to cut down the distance between data, coding and visualisation. Basically you take your data, spot patterns, recode to surface interesting things, and generate a visualisation out the end, in an iterative process.

Wikipedia Googlesheets– I am not sure if we would have a use for this specifically, but it was fascinating to see a plugin which serves up spreadsheet formulae coded in Google Apps JavaScript, which can then be used to interrogate any Wikipedia pages. Particularly of note is the ability to combine category pages and pageviews, to see if real-time events are influenced by Wikipedia and vice-versa. Developed by Thomas Steiner at Google.

Finally, here are three interesting observations, which certainly struck a chord with me:

It is now deemed quite acceptable, as a symptom of rapid development, perhaps, that the CSV is used as the master dataset; perhaps the file-based database’s day is not over. I certainly found out about some interesting applications built in this way.
I heard no-one but myself talk about Excel macros- at a spreadsheet conference, no less! It is far more fashionable these days to read your data as csv, and code against it using JavaScript, R, or Python. I clearly need to get out of the 1990s.
EVERYONE suffers from problems with diacritics, glyphs, badly formatted data and what happens when you import a CSV into a spreadsheet tool that tries to be too clever. It is not just me.

All in all, an excellent couple of days, which have filled me with ideas for improvements for existing workflows. Hopefully the likes of the DIU will reap some benefits!

Berlin Dom und Fernsehturm — Insert standard caption regarding “old content meets new technology to surface it”!

Scott Renton, Digital Developer

*The commas are intentional, by the way!

Open Repositories 2015

June 19, 2015Uncategorizedcrowdsourcing, DSpace, gamification, Metadata Games, Open Repositories, technologycknowles

Last week I was very fortunate to attend the Open Repositories Conference in Indianapolis. Where I was co-authoring a presentation on metadata games with Tiltfactor, Dartmouth College, and also co-chairing the new Developer Track and Ideas Challenge at the conference with Adam Field from EPrints Solutions, University of Southampton.

The conference hotel in downtown Indianapolis

It was very packed few days as the conference has moved from five to four days with more tracks running simultaneously. This unfortunately meant I didn’t get to see all the presentations I want to see, but all the papers are online and the main conference room was recorded. The keynote speakers were Kaitlin Thaney (Mozilla Science Lab) and Anurag Acharya (Google Scholar) who both gave very interesting and very different presentations. Kaitlin managed to hold the whole room’s attention despite having to talk via Skype, due to problems with flights. Their slides are available here and the videos should be online soon. Anurag’s talk was very specific on how repositories need to be designed to work with Google Scholar and has already created lots of debate over the use of PDF cover pages.

The Developer Track was new for OR15, it was designed, along with the Ideas Challenge, to stop the developers who attend the conference being torn between writing code for a competition and attending conference sessions. There were lots of demos and it was great to see no one having to apologise for screens of code, XML and terminal windows. I found the sessions really practical and have a list of things to try now I am back in the office, starting with Hardy Pottinger’s Vagrant-DSpace.

Hardy demoing Vagrant-DSpace

The Ideas Challenge, was designed to be less time intensive that the previously run Developer Challenge but still encourage people to discuss issues they would like to resolve, meet new people, and have a fun session where audience participation was encouraged. Adam and I created an example challenge and solution based on the Sound of Music Idea’s Challenge Slides. We had 9 entries to the challenge this year and thanks to Adam’s scoresheet there were no long deliberations and a clear winner was identified after our judges handed in their scores. Congratulations to the winners Blog post about the challenge winners.

Slide from the Winning Entry – Email Deposit

My presentation on Crowdsourcing Metadata Session Slides was in front of a packed room. Helped by my talk being sandwiched between Rob Sanderson from Stanford University and Simeon Warner from Cornell University. I got some great questions and hopefully passed on the message that engagement is just as important as developing the tools. My co-authors at Tiltfactor launched their new OCR verification games last week http://beanstalkgame.org is very addictive (you have been warned).

Tiltfactor’s Beanstalk game

There was a lot of discussion during the Monday’s DSpace DCAT and Committer meeting and also during the DSpace Interest Group about the ambitious work planned for DSpace 7 in 2016. The University of Edinburgh manages DSpace repositories on behalf of the Scottish Digital Library Consortium and Stuart Lewis is a DSpace Steering Committee member. I hope that these plans get the support they require within the DSpace community to make them happen.

Hardy Pottinger has written a blogpost about the conference and his ’stuff to check out later’ to take away from the conference and try is similar to mine so I won’t repeat it. Although I would add Vagrant and implementations of IIIF. Both of which have been on my todo list for a while, but I really really want to find time to play with them now.

Thanks to Sukie at Tiltfactor and Adam for lots of Skype calls over the past few months, and to the Open Repositories organisers for a great week.

‘Innovation’: the Emperor’s new clothes?

May 20, 2015Uncategorizedcrowdsourcing, gamification, images, Museums, technologycknowles

Scott and I travelled down to Cambridge last week to speak at the Museum Computer Group’s Spring Meeting, ‘Innovation’: the Emperor’s new clothes? It was a very informative day that began with Peter Pavement, SurfaceImpression, giving us a history of digital innovation in museums. Including the first audio guides and the Senster, which was the first robotic sculpture to be controlled by a computer.

First Museum Audio Guides from Loic Tallon Flickr

Peter discussing the Hype Cycle, where would you place new technological innovations?

Sejul Malde, Culture 24, followed on from Peter. He discussed using existing assets and content, as well as small ‘process focused’ innovation rather than innovation through giant leaps. His emphasis on creating a rhythm for change made me reflect on how short sprints enabled us to get Collections.ed online. (Looking at our Github commit history highlights sprint deadlines.)

Scott and I then discussed the work we have being doing at Edinburgh to get our collections online through Collections.ed, which has been an iterative process starting off with four online collections launched May 2014, we now have eight collections online following the recent launch of our Iconics collection. We have also recently made a first import into Collections.ed of 776 unique crowdsourced tags we have obtained through Library Labs Metadata Games and those entered into Tiltfactor‘s metadata games.

The tags can been seen online in these two examples:
Charles Darwin’s Class Card
Bond M., White House in Warm Perthshire Valley

The slides from our presentation are available on ERA http://hdl.handle.net/1842/10415 and have a film theme running through them.

The new Iconics home page (I think it is my favourite so far):

In the afternoon Lizzie Edwards, Samsung Digital Discovery Centre, British Museum, lead a practical session where we had to think about how we could use new technologies in Museums. Jessica Suess, Oxford University Museums, spoke about their ‘Innovation Fund’ programme and how it had led to new ways of working and new collaborations with colleagues. She mentioned one project using Ipads as Art Sketchbooks http://www.ashmolean.org/education/dsketchbooks/ which was also showcased in a lightning talk.

Lightning talks and a Q&A session with HLF and Nesta finished off the day, you can find out more from Liz Hide’s storify of the day: https://storify.com/TheMuseumOfLiz/the-emperor-s-new-clothes

Claire Knowles and Scott Renton, Library Digital Development Team

Google Glass, gaming and Gallimaufry

December 16, 2014Uncategorizedgamification, Google Glass, technologyGavin Willshaw

Last Monday was no typical day at the office: after an early start at the Imperial War Museum exploring its First World War exhibition with Google Glass, I finished the day trying to escape the British Library before the lights were switched off! In between, I was involved in the launch of a new initiative to make our images available through Tiltfactor’s Metadatagames crowdsourcing platform.

Google Glass at the IWM

The Imperial War Museum ran an experiment to see how its First World War Galleries could be enhanced with the use of Google Glass and invited heritage professionals to try out the technology. Information Services at the University of Edinburgh have recently acquired a few sets of Google Glass and announced a competition to see how students could use it to improve their learning, so I was keen to see how it could be used in a heritage setting. The concept was actually very simple: a Glass ‘tour’ had been uploaded to the device and, whenever a wearer approached one of several beacons installed throughout the exhibition, the user was fed additional relevant content onto their Glass screen. For example, one of the exhibits was an early tank – when I came within range, a short 1916 propaganda film appeared on my screen describing how the new invention would “bring an end to the war”.

I felt the museum did a good job of providing enough additional content through the Glass to complement existing exhibits without overwhelming the user with too much additional information. The device was surprisingly comfortable and the screen wasn’t overly intrusive. This experiment showed that Google Glass can work in a museum setting: there is definitely scope for using it in one of the Library’s exhibition spaces to provide another dimension to showcasing our collections.

Digital Conversations at the British Library

There have been some fantastic initiatives recently in using heritage content as inspiration for video games – this event, part of the British Library’s Digital Conversations series, brought heritage professionals and games designers together for the formal launch of the 2015 ‘Off the Map’ competition for students to design games inspired by the BL’s collections. The theme for the competition, ‘Alice’s Adventures off the Map’, relates to next year’s 150^th anniversary of the publication of Alice’s Adventures in Wonderland. The Library provides asset packs for games designers and facilitates access to original collections; the designers use these materials to create exciting and innovative computer games. Previous winners have included an underwater adventure through the long demolished, but now digitally restored, Fonthill Abbey, and a fully immersive 3D version of London from before the Great Fire of 1666.

There were also some really interesting talks at the event about the launch of the National Videogame Arcade in Nottingham, a discussion about how the V&A’s designer in residence built a successful mobile app using items from the museum’s collections, and a demonstration of how the British Museum used Minecraft to engage users with the building and its collections. The range of ideas on display gave food for thought – how can the University take inspiration from initiatives such as these to enhance access to and use of our own collections?

Gallimaufry games

Crowdsourcing is definitely one way we can do this! We’ve been working on creating a fun metadata tagging game to encourage games enthusiasts, and those with an interest in out collections, to ‘say what they see’ and tag our images. We took inspiration from Tiltfactor’s Metadatagames platform, and on Monday we uploaded around 2,500 images from our Gallimaufry collection to their site. You can now play addictive games such as ‘Zen Tag’, ‘Stupid Robot’ and ‘Guess What’ using a diverse number of images from our own collections!

Library Labs Blog

University of Edinburgh Library Labs Blog