Tag Archives: library labs

Introducing Whiiif – Full text searching across image-based collections

Background

For historical collections digitisation projects inside the library, we are increasingly looking to provide OCR transcriptions of the documents alongside the digital images. In many cases, this can enhance the usability of the images significantly. For example, the volumes in the Session Papers Project are large and often without index, making locating specific text inside difficult (see the previous blog post about this challenge here: http://libraryblogs.is.ed.ac.uk/librarylabs/2017/06/23/automated-item-data-extraction-from-old-manuscripts/) unless one is in possession of copious amounts of free time and dedication.

Our implementation of IIIF as the primary delivery method for digital images at Edinburgh (some highlights of our IIIF collections at the bottom of Scott’s blog here: http://libraryblogs.is.ed.ac.uk/librarylabs/2018/12/13/edinburgh-hosts-international-iiif-event/) opens up a vector for not just providing the OCR text alongside the images, but also to enable native searching within the volume images inside an IIIF viewer such as UniversalViewer or Mirador.

IIIF Search

Currently searching is usually performed before and outside of the viewing experience, with the chosen result then loaded in a viewer. Searching within a volume therefore offers different possibilities for the end-user during their journies across the collections. This is achieved by using a service that is capable of providing the IIIF Search API.

So far, not many such services exist in the open source world, with the IIIF Awesome list having just one entry under Content Search Services: NCSU Libraries’ Ocracoke project, which is a Rails-based full workflow solution that can also process and OCR the documents prior to serving them via IIIF. Whilst other institutions do provide IIIF Search on their holdings, these implementations can be an integral part of their digital delivery stack and not easily seperable for release, internal only projects, instances of Ocracoke, etc.

As the OCR here at Edinburgh falls under a different part of our workflow (of which more in a future blogpost) and we are primarily working with PHP and Python, I decided to implement a simple Python service capable of supporting the Search API. The project is written using Flask, a lightweight Python web framework and backed by Apache Solr to provide the text-searching. A simple service needed a simple name, and so Whiiif (Word Highlighting for IIIF) was born.

Whiiif v1

Initially, I adopted the model used by Ocracoke: indexing the text of each whole page in Solr, and using an array of word->co-ordinate mappings for each page image. When a search is made in Solr, each document is returned using the native Highlighting feature of Solr, which returns a fragment of text, with the matching words bracketed by <em> tags.

The word-co-ordinate mappings for each page are extracted from the ALTO-XML generated by the Tesseract OCR process and stored in Solr as JSON, alongside the raw text. Producing the IIIF Search API response then becomes a case of extracting the matched words from the Solr highlight result, popping the co-ordinates for each word, and generating the response JSON for the client. The initial version of Whiiif using this approach can be found on the Whiiif Github repo at commit af8a903.

This version was deployed and during testing raised issues when handling some of the documents in our collection:

  • Text-dense images, such as the Session Papers volumes, tended to have multiple instances of individual words on a page. Whilst this was not a problem for single-term searches (all instances would be found and the co-ordinates loaded), it caused a problem for phrase-based searching, where a word from the phrase could appear elsewhere on the page, before the match for the phrase, leading to incorrect word co-ordinates being retrieved from the array.
  • Some words were modified by Solr’s language processing during the ingest process, meaning matches were being returned for which the corresponding co-ordinates (generated from the pre-processed, raw text) could not be found.

There were approaches to solving these problems, such as forcing Solr to return the entire page text via the Highlighter, so that the correct instance of repeated words could be ascertained. However, this led to a significant increase in the processing time required to generate the response for each hit, as well as greater resource requirements for Solr and I decided to try a different approach.

Whiiif v2

For the second iteration of Whiiif, I decided to investigate how feasible it would be to have Solr return the matching fragment from the ALTO-XML document, which would mean having the co-ordinates for the hits already in the Solr response. This ran into difficulties, as Solr is designed for working with text, and will not easily index or search an XML document, in fact usually stripping all the XML data (that we wanted to try and preserve) by using the HTMLStripCharFilter during the indexing process. Even with the filtering removed from the processing chain, basic abilities such as phrase searching were lost due to the format of the text being searched being “word<xml fragment>word<xml fragment>word<xml fragment>…”, and false hits for words appearing inside the ALTO-XML format such as “page”, “line”, “word”, etc.

Via the IIIF Slack, I was pointed towards the work of Johannes Baiter and the MDZ Digital Library team at the Bavarian State Library, who are developing a Solr plugin to resolve these various issues (available at https://github.com/dbmdz/solr-ocrhighlighting). I reworked the Solr controller for Whiiif to use the functionality of this plugin, keeping the work already done to provide IIIF Search API responses.

Following a couple of weeks of testing, and some very useful collaborative bug-fixing work with Johannes (primarily fixing some regexp bugs and improving the handling of ALTO files) and thanks to his speedy implementation of a feature request, Whiiif v2 was moved into internal production for some in-development project websites.

I then implemented a secondary feature in Whiiif: the ability to search across a collection as a whole, and have document hits, with snippets of page images returned (complete with visual highlighting), to complement the existing “Search Within” functionality of the IIIF Search API. This feature is also powered by the OCR Highlighting plugin, but returns a custom JSON format (although similar to the IIIF Search response format), allowing the front end controller of a collections site to customise the display of results to fit each individual site design.

The version of Whiiif with these capabilities is currently available on the “withplugin” branch of the Whiiif github repo, although this is still in heavy development and will become the master branch in the future when it is a bit tidier!

Next Steps

The next steps with the Whiiif experiment are to prepare a formal release of Whiiif v2, with updated documentation, install instructions and full unit-test coverage, keep an eye out here or on the github repo for news. In the meantime, please feel free to clone the repo and experiment. Issues and PRs always welcome and you can also contact me on the IIIF slack (as mbennett) or via email: mike.bennett@ed.ac.uk.

I’d love to hear from anyone playing around with Whiiif, or suggestions for other features. Experimental support for the “hits” property of IIIF Search v1 will arrive shortly, along with some updates to make use of the latest features of the Solr plugin.

Until next time ๐Ÿ™‚
Mike

Automated item data extraction from old documents

Overview

The Problem

We have a collection of historic papers from the Scottish Court of Session. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical search of likely volumes (if you’re lucky you might get an index at the start!).

Volumes of Session Papers in the Signet Library, Edinburgh
Volumes of Session Papers in the Signet Library, Edinburgh

The Aim

I am hoping to use computer vision techniques, OCR, and intelligent text analysis to automatically extract and parse case-level data in order to create an indexed, searchable digital resource for these items. The Digital Imaging Unit have digitised a small selection of the papers, which we will use as a pilot to assess the viability of the above aim.

Stage One – Image preparation

Using Python and OpenCV to extract text blocks

I am indebted to Dan Vanderkam‘s work in this area, especially his blog post ‘Finding blocks of text in an image using Python, OpenCV and numpy’ upon which this work is largely based.

The items in the Scottish Session Papers collection differ from the images that Dan was processing, being images of older works, which were printed with a letterpress rather than being typewritten.

The Session Papers images are lacking a delineating border, backing paper, and other features that were used to ease the image processing. In addition, the amount, density and layout of text items is incredibly varied across the corpus, further complicating the task.

The initial task is to find a crop of the image to pass to the OCR engine. We want to give it as much text as possible in as few pixels as possible!

Due to the nature of the images, there is often a small amount of text from the opposite page visible (John’s blog explains why) and so to save some hassle later, we’re going to start by cropping 50px from each horizontal side of the image, hopefully eliminating these bits of page overspill.

A cropped version of the page
A cropped version of the page

Now that we have the base image to work on, I’ve started with the simple steps of converting it to grayscale, and then applying an inverted binary threshold, turning everything above ~75% gray to white, and everything else to black. The inversion is to ease visual understanding of the process. You can view full size versions by clicking each image.

A grayscale version of the page
Grayscale
75% Threshold
75% Threshold

The ideal outcome is that we eliminate smudges and speckles, leaving only the clear printed letters. This entailed some experimenting with the threshold level, as you can see in the image above, a lot of speckling remains. Dropping the threshold to only leave pixels above ~60% gray was a large improvement, and to ~45% even more so:

60% Threshold
60% Threshold
45% Threshold
45% Threshold

At a threshold of 45%, some of the letters are also beginning to fade, but this should not be an issue, as we have successfully eliminated almost all the noise, which was the aim here.

We’re still left with a large block at the top, which was the black backing behind the edge of the original image. To eliminate this, I experimented with several approaches:

  • Also crop 50px from the top and bottom of the images – unfortunately this had too much “collateral damage” as a large amount of the images have text within this region.
  • Dynamic cropping based on removing any segments touching the top and bottom of the image – this was a more effective approach but the logic for determining the crop became a bit convoluted.
  • Using Dan’s technique ofย  applying Canny edge detection and then use a rank filter to remove ~1px edges – this was the most successful approach, although it still had some issues when the text had a non-standard layout.

I settled on the Canny/Rank filter approach to produce these results:

Result of Canny edge finder
Result of Canny edge finder
With rank filter
With rank filter

Next up, we want to find a set of masks that covers the remaining white pixels on the page. This is achieved by repeatedly dilating the image, until only a few connected components remain:

You can see here that the “faded” letters from the thresholding above have enough presence to be captured by the dilation process. These white blocks now give us a pretty good record of where the text is on the page, so we now move onto cropping the image.

Dan’s blog has a good explanation of solving the Subset Sum problem for a dilated image, so I will apply his technique (start with the largest white block, and add more if they improve the amount of white pixels at a favourable increase in total area size, with some tweaking to the exact ratio):

With final bounding
With final bounding

So finally, we apply this crop to the original image:

Final cropped version
Final cropped version

As you can see, we’ve now managed to accurately crop out the text from the image, helping to significantly reduce the work of the OCR engine.

My final modified version of Dan’s code can be found here: https://github.com/mbennett-uoe/sp-experiments/blob/master/sp_crop.py

In my next blog post, I’ll start to look at some OCR approaches and also go through some of the outliers and problem images and how I will look to tackle this.

Comments and questions are more than welcome ๐Ÿ™‚

Mike Bennett – Digital Scholarship Developer

 

Library Digital Development investigate IIIF

Quick caveat: this post is a partner to the one Claire Knowles has written about our signing up to the IIIF Consortium, so the explanation of the acronym will not be explained here!

The Library Digital Development team decided to investigate the standard due to its appearance at every Cultural Heritage-related conference we’d attended in 2015, and we thought it would be apposite to update everyone with our progress.

First things first: we have managed to make some progress on displaying IIIF formatting to show what it does. Essentially, the standard allows us to display a remotely-served image on a web page, with our choice of size, rotation, mirroring and cropped section without needing to write CSS, HTML, or use Photoshop to manipulate the image; everything is done through the URL. The Digilib IIIF Server was very simple to get up and running (for those that are interested, it is distributed as a Java webapp that runs under Apache Tomcat), so here it is in action, using the standard IIIF URI syntax of [http://[server domain]/[webapp location]/[specific image identifier]/[region]/[size]/[mirror][rotation]/[quality].[format]]!

The URL for the following (image 0070025c.jpg/jp2) would be:

[domain]/0070025/full/full/0/default.jpg

Poster

This URL is saying, “give me image 0070025 (in this case an Art Collection poster), at full resolution, uncropped, unmirrored and unrotated: the standard image”.

[domain]/0070025/300,50,350,200/200,200/!236/default.jpg

posterbit

This URL says, “give me the same image, but this time show me co-ordinates 300px in from the left, 50 down from the top, to 350 in from the left, to 200 down from the top (of the original); return it at a resolution of 200px x 200px, rotate it at an angle of 236 degrees, and mirror it”.

The server software is only one part of the IIIF Image API: the viewer is very important too. There are a number of different viewers around which will serve up high-resolution zooming of IIIF images, and we tried integrating OpenSeaDragon with our Iconics collection to see how it could look when everything is up and running (this is not actually using IIIF interaction at this time, rather Microsoft DeepZoom surrogates, but it shows our intention). We cannot show you the test site, unfortunately, but our plan is that all our collections.ed.ac.uk sites, such as Art and Mimed, which have a link to the LUNA image platform, can have that replaced with an embedded high-res image like this. At that point, we will be able to hide the LUNA collection from the main site, thus saving us from having to maintain metadata in two places.

deepzoom

We have also met, as Claire says, the National Library’s technical department to see how they are doing with IIIF. They have implemented rather a lot using Klokan’s IIIFServer and we have investigated using this, with its integrated viewer on both Windows and Docker. We have only done this locally, so cannot show it here, but it is even easier to set up and configure than Digilib. Here’s a screenshot, to show we’re not lying.

eyes

Our plan to implement the IIIF Image API involves LUNA though. We already pay them for support and have a good working relationship with them. They are introducing IIIF in their next release so we intend to use that as a IIIF Server. It makes sense- we use LUNA for all our image management, it saves us having to build new systems, and because the software generates JP2K zoomable images, we don’t need to buy anything to do that (this process is not open, no matter how Open Source the main IIIF software may be!). We expect this to be available in the next month or so, and the above investigation has been really useful, as the experience with other servers will allow us to push back to LUNA to say “we think you need to implement this!”. Here’s a quick prospective screenshot of how to pick up a IIIF URL from the LUNA interface.

IIIFMenu

We still need to investigate more viewers (for practical use) and servers (for investigation), and we need to find out more about the Presentation API, annotations etc., but we feel we are making good progress nonetheless.

Scott Renton, Digital Developer