Edinburgh Research Explorer Statistics: June 2019

Posted in Green OA, Open Access | Tagged , , | Comments Off on Edinburgh Research Explorer Statistics: June 2019

Introducing Whiiif – Full text searching across image-based collections

Background

For historical collections digitisation projects inside the library, we are increasingly looking to provide OCR transcriptions of the documents alongside the digital images. In many cases, this can enhance the usability of the images significantly. For example, the volumes in the Session Papers Project are large and often without index, making locating specific text inside difficult (see the previous blog post about this challenge here: https://libraryblogs.is.ed.ac.uk/librarylabs/2017/06/23/automated-item-data-extraction-from-old-manuscripts/) unless one is in possession of copious amounts of free time and dedication.

Our implementation of IIIF as the primary delivery method for digital images at Edinburgh (some highlights of our IIIF collections at the bottom of Scott’s blog here: https://libraryblogs.is.ed.ac.uk/librarylabs/2018/12/13/edinburgh-hosts-international-iiif-event/) opens up a vector for not just providing the OCR text alongside the images, but also to enable native searching within the volume images inside an IIIF viewer such as UniversalViewer or Mirador.

IIIF Search

Currently searching is usually performed before and outside of the viewing experience, with the chosen result then loaded in a viewer. Searching within a volume therefore offers different possibilities for the end-user during their journies across the collections. This is achieved by using a service that is capable of providing the IIIF Search API.

So far, not many such services exist in the open source world, with the IIIF Awesome list having just one entry under Content Search Services: NCSU Libraries’ Ocracoke project, which is a Rails-based full workflow solution that can also process and OCR the documents prior to serving them via IIIF. Whilst other institutions do provide IIIF Search on their holdings, these implementations can be an integral part of their digital delivery stack and not easily seperable for release, internal only projects, instances of Ocracoke, etc.

As the OCR here at Edinburgh falls under a different part of our workflow (of which more in a future blogpost) and we are primarily working with PHP and Python, I decided to implement a simple Python service capable of supporting the Search API. The project is written using Flask, a lightweight Python web framework and backed by Apache Solr to provide the text-searching. A simple service needed a simple name, and so Whiiif (Word Highlighting for IIIF) was born.

Whiiif v1

Initially, I adopted the model used by Ocracoke: indexing the text of each whole page in Solr, and using an array of word->co-ordinate mappings for each page image. When a search is made in Solr, each document is returned using the native Highlighting feature of Solr, which returns a fragment of text, with the matching words bracketed by <em> tags.

The word-co-ordinate mappings for each page are extracted from the ALTO-XML generated by the Tesseract OCR process and stored in Solr as JSON, alongside the raw text. Producing the IIIF Search API response then becomes a case of extracting the matched words from the Solr highlight result, popping the co-ordinates for each word, and generating the response JSON for the client. The initial version of Whiiif using this approach can be found on the Whiiif Github repo at commit af8a903.

This version was deployed and during testing raised issues when handling some of the documents in our collection:

  • Text-dense images, such as the Session Papers volumes, tended to have multiple instances of individual words on a page. Whilst this was not a problem for single-term searches (all instances would be found and the co-ordinates loaded), it caused a problem for phrase-based searching, where a word from the phrase could appear elsewhere on the page, before the match for the phrase, leading to incorrect word co-ordinates being retrieved from the array.
  • Some words were modified by Solr’s language processing during the ingest process, meaning matches were being returned for which the corresponding co-ordinates (generated from the pre-processed, raw text) could not be found.

There were approaches to solving these problems, such as forcing Solr to return the entire page text via the Highlighter, so that the correct instance of repeated words could be ascertained. However, this led to a significant increase in the processing time required to generate the response for each hit, as well as greater resource requirements for Solr and I decided to try a different approach.

Whiiif v2

For the second iteration of Whiiif, I decided to investigate how feasible it would be to have Solr return the matching fragment from the ALTO-XML document, which would mean having the co-ordinates for the hits already in the Solr response. This ran into difficulties, as Solr is designed for working with text, and will not easily index or search an XML document, in fact usually stripping all the XML data (that we wanted to try and preserve) by using the HTMLStripCharFilter during the indexing process. Even with the filtering removed from the processing chain, basic abilities such as phrase searching were lost due to the format of the text being searched being “word<xml fragment>word<xml fragment>word<xml fragment>…”, and false hits for words appearing inside the ALTO-XML format such as “page”, “line”, “word”, etc.

Via the IIIF Slack, I was pointed towards the work of Johannes Baiter and the MDZ Digital Library team at the Bavarian State Library, who are developing a Solr plugin to resolve these various issues (available at https://github.com/dbmdz/solr-ocrhighlighting). I reworked the Solr controller for Whiiif to use the functionality of this plugin, keeping the work already done to provide IIIF Search API responses.

Following a couple of weeks of testing, and some very useful collaborative bug-fixing work with Johannes (primarily fixing some regexp bugs and improving the handling of ALTO files) and thanks to his speedy implementation of a feature request, Whiiif v2 was moved into internal production for some in-development project websites.

I then implemented a secondary feature in Whiiif: the ability to search across a collection as a whole, and have document hits, with snippets of page images returned (complete with visual highlighting), to complement the existing “Search Within” functionality of the IIIF Search API. This feature is also powered by the OCR Highlighting plugin, but returns a custom JSON format (although similar to the IIIF Search response format), allowing the front end controller of a collections site to customise the display of results to fit each individual site design.

The version of Whiiif with these capabilities is currently available on the “withplugin” branch of the Whiiif github repo, although this is still in heavy development and will become the master branch in the future when it is a bit tidier!

Next Steps

The next steps with the Whiiif experiment are to prepare a formal release of Whiiif v2, with updated documentation, install instructions and full unit-test coverage, keep an eye out here or on the github repo for news. In the meantime, please feel free to clone the repo and experiment. Issues and PRs always welcome and you can also contact me on the IIIF slack (as mbennett) or via email: mike.bennett@ed.ac.uk.

I’d love to hear from anyone playing around with Whiiif, or suggestions for other features. Experimental support for the “hits” property of IIIF Search v1 will arrive shortly, along with some updates to make use of the latest features of the Solr plugin.

Until next time 🙂
Mike

Posted in Uncategorized | Tagged , , , , , , , , , , | Comments Off on Introducing Whiiif – Full text searching across image-based collections

New E-Journal – North American Journal of Celtic Studies

The Library now subscribes to the North American Journal of Celtic Studies.

The North American journal of Celtic studies (NAJCS) is devoted to the study of all of the disciplines that fall under the purview of the field of Celtic studies, including, but not limited to, archeology, art, folklore, history, law, linguistics, literature, manuscript studies, mythology, and politics.  Contributions are welcome for all time periods from the ancient world to the present.

Access this journal via DiscoverEd or our e-journals AZ list.

Posted in New e-resources | Tagged , | Comments Off on New E-Journal – North American Journal of Celtic Studies

Scientific Analysis of Heritage Collections using XRF – Employ.ed Internship 2019

This week’s blog post comes from Cameron Perumal who recently began a 10-week Employ.ed internship in the Conservation Studio at the CRC… 

Two weeks into my Employ.ed internship, and I have already learned so much about conservation, and X-ray Fluorescence (XRF) spectrometry! I am currently an undergraduate Astrophysics student, and my internship entails me working with Emily Hick, the Special Collections Conservator, to research ways in which XRF can help us understand more about the collections. I’ll also be doing outreach to increase awareness on XRF and how it can be used in conservation to improve the condition and understanding of the collections held by the University of Edinburgh.

By the end of my first week, I had started my radiation training, seen the XRF in action being used by another intern, Despoina, to analyse pigments of a painting on the soundboard of a harpsichord, and been able to see the various (frankly, quite beautiful) collections stored by the University.

Intern Despoina using the new XRF machine to analyse the pigments used on the soundboard paintings of harpsichords made by the Ruckers family

Read More

Posted in Internships | Tagged , , | Comments Off on Scientific Analysis of Heritage Collections using XRF – Employ.ed Internship 2019

Library & University Collections Journal Club 2019-20

Thank you to everyone who joined us for our Library & University Collections Journal Club meetings in 2019. We’ve now planned a further programme of dates for the 2019/20 academic year.

Come and join us to talk about a recently published article each month from the field of Library & Information Science. We’re aiming to keep informed about practitioner research, and reflect on how theory relates to our practice. This is a great way for staff to develop their knowledge of the wider professional context for their continuing professional development.

This is an informal session open to staff from across Information Services Group as well as interns, volunteers and students on library & information related courses. We meet on the first Wednesday of each month, 12-1pm, alternately at Argyle House and the Digital Scholarship Centre, Main Library.  You can see the articles proposed for discussion on our Journal Club Resource List, and you can suggest articles to discuss each month.  Please come ready with your questions, comments and complaints about the article of the month.

Dates Location
04 September 2019
Argyle House Meeting Room 8
02 October 2019 Digital Scholarship Centre, Main Library
06 November 2019 Argyle House Meeting Room 8
04 December 2019 Digital Scholarship Centre, Main Library
08 January 2020 Argyle House Meeting Room 8
05 February 2020 Digital Scholarship Centre, Main Library
04 March 2020 Argyle House Meeting Room 8
01 April 2020 Digital Scholarship Centre, Main Library
06 May 2020 Argyle House Meeting Room 8
03 June 2020 Digital Scholarship Centre, Main Library
Posted in Library, Staff development | Comments Off on Library & University Collections Journal Club 2019-20

Baltic, Books and Solidarity: Gdańsk University of Technology (GUT) International Staff Week

Gdańsk University of Technology

I was delighted to be able to participate in the 4th International Staff Week at the Biblioteki Politechniki Gdańskiej recently. I work as the Senior Photographer for Edinburgh University’s Library and University Collections, so when I saw that the programme included a visit to the Pomeranian Digital Library it looked like a great opportunity. Additionally, this was the home institution of one of the delegates on our own Knowledge Exchange Week in 2018, allowing further development of previous Erasmus links.

Read More

Posted in Library, LLC general, News | Comments Off on Baltic, Books and Solidarity: Gdańsk University of Technology (GUT) International Staff Week

Edinburgh Research Archive Statistics: May 2019

Posted in Collections, Library, Open Access | Tagged , , | Comments Off on Edinburgh Research Archive Statistics: May 2019

Edinburgh Research Explorer Statistics: May 2019

Posted in Green OA, Open Access | Tagged , , | Comments Off on Edinburgh Research Explorer Statistics: May 2019

‘This Single Song of Two’: Centenary of the Marriage of Edwin and Willa Muir

7 June 2019 marks the centenary of the marriage of Edwin and Willa Muir, one of Scottish literature’s great creative partnerships. Acclaimed in their own right as poet and novelist respectively, they worked together as a translating team to bring the novels and stories of Franz Kafka to an English-speaking audience.

Edinburgh University holds a number of remarkable documents, bearing witness to their long and exceptionally close union.
Read More

Posted in Personal Papers, Scottish Literary Collections, Uncategorized | Tagged , , | Comments Off on ‘This Single Song of Two’: Centenary of the Marriage of Edwin and Willa Muir

Normandy landings: through our digital primary sources

On this day, 6 June, 75 years ago the Normandy landings took place. This was part of a major combined naval, air and land assault on German-occupied France by Allied forces, codenamed Operation ‘Overlord’. The D-Day landings saw around 150,000 Allied troops land on French soil but it was just the start of a much longer operation to liberate France. In this week’s blog post I have pulled together just a small selection of our digital library resources that will help you explore the Normandy landings, the events leading up to it and the aftermath. And you can use many of these to find out more about the many other events happening around this time that contributed to the end of the Second World War.

D-Day For the Second Front, ‘Illustrated London News’, Saturday 10 June 1944, pp. 644-645. From Illustrated London News Archive.

What did the papers say?

Operation Overlord was top secret, so it wasn’t until the 6th June that news of the invasion began to filter through. Reports of the Normandy landings does appear in some late editions of newspapers from that day but it is mostly covered in issues published the next day, 7th June, or on next subsequent publication date.

Front page of the ‘Daily Express’, Wednesday 7 June 1944. From UK Press Online.

The Library subscribes to a large number of digitised newspaper archives that will allow you to see what events were being reported on at the time and how they were being reported. Read full text articles, compare how different newspapers were covering the same issues and stories and track coverage of Operation Overlord from the Normandy landings onwards. Read More

Posted in Library, Online resource, Primary sources | Tagged , , , , , , , , , , , , , , , , , , , , , , | Comments Off on Normandy landings: through our digital primary sources

Follow @EdUniLibraries on Twitter

Collections

Default utility Image Hill and Adamson Collection: an insight into Edinburgh’s past My name is Phoebe Kirkland, I am an MSc East Asian Studies student, and for...
Default utility Image Cataloguing the private papers of Archibald Hunter Campbell: A Journey Through Correspondence My name is Pauline Vincent, I am a student in my last year of a...

Projects

Default utility Image Cataloguing the private papers of Archibald Hunter Campbell: A Journey Through Correspondence My name is Pauline Vincent, I am a student in my last year of a...
Default utility Image Archival Provenance Research Project: Lishan’s Experience Presentation My name is Lishan Zou, I am a fourth year History and Politics student....

Archives

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.