Tag Archives: Development

Introducing Whiiif – Full text searching across image-based collections

Background

For historical collections digitisation projects inside the library, we are increasingly looking to provide OCR transcriptions of the documents alongside the digital images. In many cases, this can enhance the usability of the images significantly. For example, the volumes in the Session Papers Project are large and often without index, making locating specific text inside difficult (see the previous blog post about this challenge here: http://libraryblogs.is.ed.ac.uk/librarylabs/2017/06/23/automated-item-data-extraction-from-old-manuscripts/) unless one is in possession of copious amounts of free time and dedication.

Our implementation of IIIF as the primary delivery method for digital images at Edinburgh (some highlights of our IIIF collections at the bottom of Scott’s blog here: http://libraryblogs.is.ed.ac.uk/librarylabs/2018/12/13/edinburgh-hosts-international-iiif-event/) opens up a vector for not just providing the OCR text alongside the images, but also to enable native searching within the volume images inside an IIIF viewer such as UniversalViewer or Mirador.

IIIF Search

Currently searching is usually performed before and outside of the viewing experience, with the chosen result then loaded in a viewer. Searching within a volume therefore offers different possibilities for the end-user during their journies across the collections. This is achieved by using a service that is capable of providing the IIIF Search API.

So far, not many such services exist in the open source world, with the IIIF Awesome list having just one entry under Content Search Services: NCSU Libraries’ Ocracoke project, which is a Rails-based full workflow solution that can also process and OCR the documents prior to serving them via IIIF. Whilst other institutions do provide IIIF Search on their holdings, these implementations can be an integral part of their digital delivery stack and not easily seperable for release, internal only projects, instances of Ocracoke, etc.

As the OCR here at Edinburgh falls under a different part of our workflow (of which more in a future blogpost) and we are primarily working with PHP and Python, I decided to implement a simple Python service capable of supporting the Search API. The project is written using Flask, a lightweight Python web framework and backed by Apache Solr to provide the text-searching. A simple service needed a simple name, and so Whiiif (Word Highlighting for IIIF) was born.

Whiiif v1

Initially, I adopted the model used by Ocracoke: indexing the text of each whole page in Solr, and using an array of word->co-ordinate mappings for each page image. When a search is made in Solr, each document is returned using the native Highlighting feature of Solr, which returns a fragment of text, with the matching words bracketed by <em> tags.

The word-co-ordinate mappings for each page are extracted from the ALTO-XML generated by the Tesseract OCR process and stored in Solr as JSON, alongside the raw text. Producing the IIIF Search API response then becomes a case of extracting the matched words from the Solr highlight result, popping the co-ordinates for each word, and generating the response JSON for the client. The initial version of Whiiif using this approach can be found on the Whiiif Github repo at commit af8a903.

This version was deployed and during testing raised issues when handling some of the documents in our collection:

  • Text-dense images, such as the Session Papers volumes, tended to have multiple instances of individual words on a page. Whilst this was not a problem for single-term searches (all instances would be found and the co-ordinates loaded), it caused a problem for phrase-based searching, where a word from the phrase could appear elsewhere on the page, before the match for the phrase, leading to incorrect word co-ordinates being retrieved from the array.
  • Some words were modified by Solr’s language processing during the ingest process, meaning matches were being returned for which the corresponding co-ordinates (generated from the pre-processed, raw text) could not be found.

There were approaches to solving these problems, such as forcing Solr to return the entire page text via the Highlighter, so that the correct instance of repeated words could be ascertained. However, this led to a significant increase in the processing time required to generate the response for each hit, as well as greater resource requirements for Solr and I decided to try a different approach.

Whiiif v2

For the second iteration of Whiiif, I decided to investigate how feasible it would be to have Solr return the matching fragment from the ALTO-XML document, which would mean having the co-ordinates for the hits already in the Solr response. This ran into difficulties, as Solr is designed for working with text, and will not easily index or search an XML document, in fact usually stripping all the XML data (that we wanted to try and preserve) by using the HTMLStripCharFilter during the indexing process. Even with the filtering removed from the processing chain, basic abilities such as phrase searching were lost due to the format of the text being searched being “word<xml fragment>word<xml fragment>word<xml fragment>…”, and false hits for words appearing inside the ALTO-XML format such as “page”, “line”, “word”, etc.

Via the IIIF Slack, I was pointed towards the work of Johannes Baiter and the MDZ Digital Library team at the Bavarian State Library, who are developing a Solr plugin to resolve these various issues (available at https://github.com/dbmdz/solr-ocrhighlighting). I reworked the Solr controller for Whiiif to use the functionality of this plugin, keeping the work already done to provide IIIF Search API responses.

Following a couple of weeks of testing, and some very useful collaborative bug-fixing work with Johannes (primarily fixing some regexp bugs and improving the handling of ALTO files) and thanks to his speedy implementation of a feature request, Whiiif v2 was moved into internal production for some in-development project websites.

I then implemented a secondary feature in Whiiif: the ability to search across a collection as a whole, and have document hits, with snippets of page images returned (complete with visual highlighting), to complement the existing “Search Within” functionality of the IIIF Search API. This feature is also powered by the OCR Highlighting plugin, but returns a custom JSON format (although similar to the IIIF Search response format), allowing the front end controller of a collections site to customise the display of results to fit each individual site design.

The version of Whiiif with these capabilities is currently available on the “withplugin” branch of the Whiiif github repo, although this is still in heavy development and will become the master branch in the future when it is a bit tidier!

Next Steps

The next steps with the Whiiif experiment are to prepare a formal release of Whiiif v2, with updated documentation, install instructions and full unit-test coverage, keep an eye out here or on the github repo for news. In the meantime, please feel free to clone the repo and experiment. Issues and PRs always welcome and you can also contact me on the IIIF slack (as mbennett) or via email: mike.bennett@ed.ac.uk.

I’d love to hear from anyone playing around with Whiiif, or suggestions for other features. Experimental support for the “hits” property of IIIF Search v1 will arrive shortly, along with some updates to make use of the latest features of the Solr plugin.

Until next time 🙂
Mike

Edinburgh hosts international IIIF event

Jointly with the National Library of Scotland, the University hosted the annual IIIF Showcase and Working Meeting from December 3-6. As consortial members, it was a good opportunity for both institutions to raise their profile within this fast-growing community, and for delegates from all over the world to see Edinburgh in winter while making the most of face-to-face discussions regarding recent developments and the future direction of the framework.

The Showcase took place in the Royal Society of Edinburgh, and this reasonably light-touch session offered an introduction to the concepts and tools and for the host institutions to talk about what they’ve produced so far. It was also IIIF’s new managing director Josh Hadro’s first week in the job: a great way for him to see the community in action! The afternoon saw candidates repair to the NLS and Main Library for breakout sessions in key content areas (Archives, Museums, Digital Scholarship) as well as deeply technical and ‘getting started’ sessions. To finish, everyone then made for St Cecilia’s Hall for a round-up of the day; this was an appropriate setting, as we’ve employed IIIF in the museum’s corresponding collections site.

The Working Group meeting ran over the succeeding three days, in the ECCI and Main Library. This was a smaller undertaking than the Showcase, but it still attracted 70 delegates. There were some really meaty discussions about the direction of the framework: cookbooks and use cases; updates to the Mirador viewer; enhancing the APIs and registries (including more work on authentication and various types of search), and looking at the amazing potential of 3D and AV (e.g. subtitle support, musical notation written as a piece plays), which is something we at the University are well placed to start work on. Discussions about the direction of the community and outreach group took place, as well; this session was led by our (until very recently) very own Claire Knowles, now Assistant Director at Leeds University Library. The first meeting of the Technical Review Committee, which rubber-stamps the specs, took place at the event too, in the huge Dining Room at Teviot.

With increasing engagement across the industry, IIIF’s future looks very bright indeed.

Thanks to everyone that helped out over the week, with a particularly big round of applause to IIIF’s Technical Co-ordinator Glen Robson, who is well-known to many people in the Library due to his previous incarnation as Development Manager at the National Library of Wales.

To (self-indulgently) end the post, here is a little hi-res illustration of the work that we have done at Edinburgh with IIIF.

This is heavily annotated! If you click the speech bubbles, you will turn on annotations, some of which link out to relevant websites (links have a dotted line under the text). Also, the Mirador viewer does comparison very well, so if you

  • click the four-square icon in the top left
  • select ‘Add Slot Right’
  • click ‘Add Item’
  • double click the manifest (‘IIIF Highlights…’)
  • select the right image

…you can see the previous version of this picture to see where improvements were made. All of this will go better if you make it full-screen!

IIIF Conference, Washington, May 2018

Washington Monument & Reflecting Pool
Washington Monument & Reflecting Pool

We (Joe Marshall (Head of Special Collections) and Scott Renton (Library Digital Development)) visited Washington DC for the IIIF Conference from 21st-25th May. This was a great opportunity for L&UC, not only to visit the Library of Congress- the mecca of our industry in some ways- but also to come back with a wealth of knowledge which we could use to inform how we operate.

Edinburgh gave two papers- the two of us delivering a talk on Special Collections discovery at the Library and how IIIF could make it all more comprehensible (including the Mahabharata Scroll), and Scott spoke with Terry Brady of Georgetown University showing how IIIF has improved our respective repository workflows.

From a purely practical level, it was great to meet face to face with colleagues from across the world- we have a very real example of a problem solved with Drake from LUNA, which we hope to be able to show very soon. It was also interesting to see how the API specs are developing- the presentation API will be enhanced with AV in version 3, and we can already see some use cases with which to try this out; search and discovery are APIs we’ve done nothing with, but these will help the ability to search within and across items, which is essential to our estate of systems, and 3D, while not having an API of its own, is also being addressed by IIIF, and it was fascinating to see the work that Universal Viewer and Sketchfab (which the DIU use) are doing to accommodate it.

The community groups are growing too, and we hope to increase our involvement with some of the less technical areas- Manuscripts, Museums, and the newly-proposed Archives group in the near future.

Among a wealth of great presentations, we’ve each identified one as our favourite:

Scott: Chifumi Nishioka – Kyoto University, Kiyonori Nagasaki – The University of Tokyo: Visualizing which parts of IIIF images are looked by users

This fascinating talk highlighted IIIF’s ability to work out which parts of an image, when zoomed in, are most popular. Often this is done by installing special tools such as eyetrackers, but the nature of IIIF- where the region is displayed as part of the URL- the same information can be visualised by interrogating Apache access logs. Chifumi and Kiyonori have been able to generate heatmaps of the most interesting regions on an item, and the code can be re-used if the logs can be supplied.

Joe: Kyle Rimkus – University of Illinois at Urbana-Champaign, Christopher J. Prom – University of Illinois at Urbana-Champaign: A Research Interface for Digital Records Using the IIIF Protocol

This talk showed the potential of IIIF in the context of digital preservation, providing large-scale public access to born-digital archive records without having to create exhaustive item-level metadata.  The IIIF world is encouraging this kind of blue-sky thinking which is going to challenge many of our traditional professional assumptions and allow us to be more creative with collections projects.

It was a terrific trip, which has filled us with enthusiasm for pushing on with IIIF beyond its already significant place in our set-up.

Joe Marshall & Scott Renton

Library Of Congress Exhibition
Library Of Congress Exhibition

Library Digital Development investigate IIIF

Quick caveat: this post is a partner to the one Claire Knowles has written about our signing up to the IIIF Consortium, so the explanation of the acronym will not be explained here!

The Library Digital Development team decided to investigate the standard due to its appearance at every Cultural Heritage-related conference we’d attended in 2015, and we thought it would be apposite to update everyone with our progress.

First things first: we have managed to make some progress on displaying IIIF formatting to show what it does. Essentially, the standard allows us to display a remotely-served image on a web page, with our choice of size, rotation, mirroring and cropped section without needing to write CSS, HTML, or use Photoshop to manipulate the image; everything is done through the URL. The Digilib IIIF Server was very simple to get up and running (for those that are interested, it is distributed as a Java webapp that runs under Apache Tomcat), so here it is in action, using the standard IIIF URI syntax of [http://[server domain]/[webapp location]/[specific image identifier]/[region]/[size]/[mirror][rotation]/[quality].[format]]!

The URL for the following (image 0070025c.jpg/jp2) would be:

[domain]/0070025/full/full/0/default.jpg

Poster

This URL is saying, “give me image 0070025 (in this case an Art Collection poster), at full resolution, uncropped, unmirrored and unrotated: the standard image”.

[domain]/0070025/300,50,350,200/200,200/!236/default.jpg

posterbit

This URL says, “give me the same image, but this time show me co-ordinates 300px in from the left, 50 down from the top, to 350 in from the left, to 200 down from the top (of the original); return it at a resolution of 200px x 200px, rotate it at an angle of 236 degrees, and mirror it”.

The server software is only one part of the IIIF Image API: the viewer is very important too. There are a number of different viewers around which will serve up high-resolution zooming of IIIF images, and we tried integrating OpenSeaDragon with our Iconics collection to see how it could look when everything is up and running (this is not actually using IIIF interaction at this time, rather Microsoft DeepZoom surrogates, but it shows our intention). We cannot show you the test site, unfortunately, but our plan is that all our collections.ed.ac.uk sites, such as Art and Mimed, which have a link to the LUNA image platform, can have that replaced with an embedded high-res image like this. At that point, we will be able to hide the LUNA collection from the main site, thus saving us from having to maintain metadata in two places.

deepzoom

We have also met, as Claire says, the National Library’s technical department to see how they are doing with IIIF. They have implemented rather a lot using Klokan’s IIIFServer and we have investigated using this, with its integrated viewer on both Windows and Docker. We have only done this locally, so cannot show it here, but it is even easier to set up and configure than Digilib. Here’s a screenshot, to show we’re not lying.

eyes

Our plan to implement the IIIF Image API involves LUNA though. We already pay them for support and have a good working relationship with them. They are introducing IIIF in their next release so we intend to use that as a IIIF Server. It makes sense- we use LUNA for all our image management, it saves us having to build new systems, and because the software generates JP2K zoomable images, we don’t need to buy anything to do that (this process is not open, no matter how Open Source the main IIIF software may be!). We expect this to be available in the next month or so, and the above investigation has been really useful, as the experience with other servers will allow us to push back to LUNA to say “we think you need to implement this!”. Here’s a quick prospective screenshot of how to pick up a IIIF URL from the LUNA interface.

IIIFMenu

We still need to investigate more viewers (for practical use) and servers (for investigation), and we need to find out more about the Presentation API, annotations etc., but we feel we are making good progress nonetheless.

Scott Renton, Digital Developer

IIIF – International Image Interoperability Framework

The next big thing
Inspirational quote on the side of a University of Ghent building, St. Pietersnieuwstraat 33.

The adoption of IIIF (International Image Interoperability Framework) has been gaining momentum over the past few years for digitised images. Adoption of IIIF for serving images allows users to rotate, zoom, crop, and compare images from different institutions side by side. Scott and I attended the IIIF conference in Ghent earlier this month to learn more about IIIF, so we can decide how we can move forward at the University of Edinburgh to adopt IIIF for our images.

On the Monday we attended a technical meeting at the University of Ghent Library, this session really helped us to understand the architecture of the two IIIF APIs (image and presentation) and speak to others who have implemented IIIF at their institutions.

The main event was on Tuesday at the beautiful Ghent Opera House, where there were lots of short presentations about different use-cases for IIIF adoption and the different applications that have been developed. If you are interested in adoption IIIF at your institution I recommend looking at Glen Robson’s slides on how the National Library of Wales has implemented IIIF. I can see myself coming back to these slides again and again, along with those on the two APIs.

Whilst we were in Ghent there was a timely update from LUNA Imaging, whose application we use as an imaging repository on their plans to support IIIF.

Thanks to everyone we met in Ghent who was willing to share with us their experiences of implementing IIIF and to the organisers for a great event in a beautiful city (and our stickers).

IIIF Meeting in Ghent Opera House
IIIF Meeting in Ghent Opera House

If you want to keep up to date with IIIF development please join the Google Group iiif-discuss@googlegroups.com

Claire Knowles and Scott Renton

Library Digital Development Team

 

 

Bridging Gaps at the British Museum

IMG_1790The overwhelming setting of the British Museum played host to this year’s Museums Computer Group “Museums and the Web” Conference, and as usual, a big turnout from museums institutions all over the UK came, bursting with ideas and enthusiasm. The theme (“Bridging Gaps and Making Connections”) was intended to encourage thought about identifying creative spaces between physical museums collections and digital developments, where such spaces are perhaps too big, and how they can be exploited. As usual, there was far too much interesting content to cover fully in a blogpost- everything was thought-provoking, but I’ve picked out a few highlights.

Two projects highlighted collaboration between museums, which can be creatively explosive, and immediately improve engagement. Russell Dornan at The Wellcome Institute showed us #MuseumInstaSwap, where museums paired off and filled their social media feeds with the other museum’s content. Raphael Chanay at MuseoMix, meanwhile, arguably took this a step further by getting multiple institutions to bring their objects to a neutral location (Iron Bridge in Shropshire, Derby Silk Mill), and forming teams to build creative prototypes out of them across the digital and physical spaces. Could our museums collections be exploited in similar ways? Who could we partner up with?

I like to think that our “digital and physical” teams in L&UC collaborate very effectively. Keynote speaker John Coburn from TWAM (Tyne and Wear Archives and Museums) spoke of the importance of this intra-institution collaboration. You will (almost) never find a project that is run entirely from within the digital or physical sphere (Fiona Talbott from the HLF confirmed this- 510 of 512 recent bids had digital outputs relating to physical content), and the ability of the digital area and the content providers to communicate and work together is key. One very good example of this was the Tributaries app, built with sound artists, the history team, archives and so on, to put together an immersive audio experience of lost Tyneside voices from World War I. He also spoke of their TNT (Try New Things) initiative (also creatively explosive!) where staff sign up to do innovation with the collections, effectively in their spare time. With the Innovation Fund encouraging creativity, how do we work this into our daily lives? Can we? If not, how do we incentivise people to do it outwith their spare time? One of the gloomier observations of the day was that, with austerity, there is less and less money in the sector, which is likely to get worse after next month’s spending review. This austerity can breed creativity, though, and it’s good for digital, because people need to ‘work smarter’.

Another really interesting project is going on at the Tate, where they are combining their content with the Khan Academy learning platform. Rebecca Sinker and colleagues showed us how content can be levered and resurrected through a series of video tutorials around the content (be they archival, technical, biographical etc). Pushing the collaborative textual content from the comments area on the tutorials through to social media allows further engagement and new perspectives on the museum objects. Speaking personally, I have had little exposure to our VLE, but I’m quite sure that developing an interface between it and our collections sites could be highly beneficial.

That’s all the tip of the iceberg, though, so take a look at the programme link at the top to find out about lots of other interesting projects.

Outside of the lecture theatre, I had some really interesting conversations with people who have exactly the same problems as ourselves: building image management workflows, incorporating technological enhancements to content-driven websites, and thinking about beacon technology (the sponsors, Beacontent, deserver top marks for the name at least). Additionally, a tour of The Samsung Digital Discovery Centre– where state of the art technology meets British Museum content to improve the experience for children, teenagers, and families- was highly informative.

Scott Renton, Digital Developer

ArchivesSpace at the University of Edinburgh – the techie side

Introducing ArchivesSpace for researchers and public users, as well as the administrative side for our Archives Team within the Centre for Research Collections, has been an ongoing project for the last 18 months. It has taken us a while to get the service live for a number of reasons and we have learnt lots along the way.

ArchivesSpace is free open source software and is easy to set-up using Jetty and MySQL, however some of our requirements have meant getting to grips with the underlying set-up and APIs of the system. We have also joined ArchivesSpace as paid members as this enables us to get additional support through documentation and mailing lists.

Import of authority controls
We had an existing MySQL database containing thousands of authority terms collected by the Archives Team. It was very important for us to keep these and import them into our ArchivesSpace instance. We imported the subjects using the ArchivesSpace API. Learning how to use the API was made easier by the Hudson Molonglo Youtube videos. We have written simple PHP scripts to allow us to connect to the ArchivesSpace backend and import the subjects and agents from MySQL database exports of our existing authority terms. After some trial and error we have imported 9275 subjects and 13703 agents into ArchivesSpace.

For a while the authorities were not linking with the  resources migrated into ArchivesSpace by the Archives Team,  via the EAD importer. To enable the authorities to link we had to make modifications to the EAD importer in the plugins. The changes are available to view on our Github code repository. We also made changes to the importer to allow us to get a greater understanding of why EAD imports were failing. The reasons why EAD failed to import have changed as new versions of ArchivesSpace were released and the EAD importer is quite strict. The Archives Team migrated 16836 resources (including components) for launch on 9th June.

API for other things
We have also used the API to run through all resources imported from EAD and publish them. By default they were not all published and a lot of the notes and details of the resources were hidden from the public interface. Therefore being able to script the publishing was a great time saver.

Tomcat set-up
We decided to run ArchivesSpace under Tomcat as it is a web server that we have a lot of experience with. However, ArchivesSpace runs easily under Jetty and running it under Tomcat has caused us some headaches, due to URLs issues and the fact that the Tomcat installation script adds a lot of files to Tomcat and not just the web apps.

Customisation
We have customised the user interface for the administrative and public front ends of ArchivesSpace. These changes were made within the local plugin. The look and feel has been made to fit in with our other services such as collections.ed and the colour scheme of the University. This was relatively straightforward as ArchivesSpace UI is based on Twitter Bootstrap. Unfortunately the public UI images were displaying when running in Jetty but not in Tomcat. After some copying of files the images appeared.

ArchivesSpace at University of EdinburghThe Public ArchivesSpace Portal http://archives.collections.ed.ac.uk

Early Adopters
It has taken longer than we had initially hoped to launch ArchivesSpace for a number of reasons. Primarily as early adopters of software there were issues that we did not foresee when the initial version was made available. The ArchivesSpace members mailing list is very active, as it is a new system there are lots of shared questions from those getting to grips with the system and working through their implementation.  ArchivesSpace, particularly Chris Fitzpatrick, have helped steer us in the right direction and shared code. The migration of EAD has been a huge task that has been undertaken by Deputy Archives Manager, Grant Buttars, it has been great to work with him and to get a greater understanding of the format of EAD when resolving issues with failing imports.

We still have lots to do with the system to leverage its full functionality and fully showcase our amazing archives collection through links to http://collections.ed.ac.uk and our image repository. So watch this space.

This post follows on from Grant’s post http://libraryblogs.is.ed.ac.uk/edinburghuniversityarchives/2015/06/22/implementing-archivesspace/

Claire Knowles
Library Digital Development Team