Category Archives: Uncategorized

Introducing Whiiif – Full text searching across image-based collections

Background

For historical collections digitisation projects inside the library, we are increasingly looking to provide OCR transcriptions of the documents alongside the digital images. In many cases, this can enhance the usability of the images significantly. For example, the volumes in the Session Papers Project are large and often without index, making locating specific text inside difficult (see the previous blog post about this challenge here: http://libraryblogs.is.ed.ac.uk/librarylabs/2017/06/23/automated-item-data-extraction-from-old-manuscripts/) unless one is in possession of copious amounts of free time and dedication.

Our implementation of IIIF as the primary delivery method for digital images at Edinburgh (some highlights of our IIIF collections at the bottom of Scott’s blog here: http://libraryblogs.is.ed.ac.uk/librarylabs/2018/12/13/edinburgh-hosts-international-iiif-event/) opens up a vector for not just providing the OCR text alongside the images, but also to enable native searching within the volume images inside an IIIF viewer such as UniversalViewer or Mirador.

IIIF Search

Currently searching is usually performed before and outside of the viewing experience, with the chosen result then loaded in a viewer. Searching within a volume therefore offers different possibilities for the end-user during their journies across the collections. This is achieved by using a service that is capable of providing the IIIF Search API.

So far, not many such services exist in the open source world, with the IIIF Awesome list having just one entry under Content Search Services: NCSU Libraries’ Ocracoke project, which is a Rails-based full workflow solution that can also process and OCR the documents prior to serving them via IIIF. Whilst other institutions do provide IIIF Search on their holdings, these implementations can be an integral part of their digital delivery stack and not easily seperable for release, internal only projects, instances of Ocracoke, etc.

As the OCR here at Edinburgh falls under a different part of our workflow (of which more in a future blogpost) and we are primarily working with PHP and Python, I decided to implement a simple Python service capable of supporting the Search API. The project is written using Flask, a lightweight Python web framework and backed by Apache Solr to provide the text-searching. A simple service needed a simple name, and so Whiiif (Word Highlighting for IIIF) was born.

Whiiif v1

Initially, I adopted the model used by Ocracoke: indexing the text of each whole page in Solr, and using an array of word->co-ordinate mappings for each page image. When a search is made in Solr, each document is returned using the native Highlighting feature of Solr, which returns a fragment of text, with the matching words bracketed by <em> tags.

The word-co-ordinate mappings for each page are extracted from the ALTO-XML generated by the Tesseract OCR process and stored in Solr as JSON, alongside the raw text. Producing the IIIF Search API response then becomes a case of extracting the matched words from the Solr highlight result, popping the co-ordinates for each word, and generating the response JSON for the client. The initial version of Whiiif using this approach can be found on the Whiiif Github repo at commit af8a903.

This version was deployed and during testing raised issues when handling some of the documents in our collection:

  • Text-dense images, such as the Session Papers volumes, tended to have multiple instances of individual words on a page. Whilst this was not a problem for single-term searches (all instances would be found and the co-ordinates loaded), it caused a problem for phrase-based searching, where a word from the phrase could appear elsewhere on the page, before the match for the phrase, leading to incorrect word co-ordinates being retrieved from the array.
  • Some words were modified by Solr’s language processing during the ingest process, meaning matches were being returned for which the corresponding co-ordinates (generated from the pre-processed, raw text) could not be found.

There were approaches to solving these problems, such as forcing Solr to return the entire page text via the Highlighter, so that the correct instance of repeated words could be ascertained. However, this led to a significant increase in the processing time required to generate the response for each hit, as well as greater resource requirements for Solr and I decided to try a different approach.

Whiiif v2

For the second iteration of Whiiif, I decided to investigate how feasible it would be to have Solr return the matching fragment from the ALTO-XML document, which would mean having the co-ordinates for the hits already in the Solr response. This ran into difficulties, as Solr is designed for working with text, and will not easily index or search an XML document, in fact usually stripping all the XML data (that we wanted to try and preserve) by using the HTMLStripCharFilter during the indexing process. Even with the filtering removed from the processing chain, basic abilities such as phrase searching were lost due to the format of the text being searched being “word<xml fragment>word<xml fragment>word<xml fragment>…”, and false hits for words appearing inside the ALTO-XML format such as “page”, “line”, “word”, etc.

Via the IIIF Slack, I was pointed towards the work of Johannes Baiter and the MDZ Digital Library team at the Bavarian State Library, who are developing a Solr plugin to resolve these various issues (available at https://github.com/dbmdz/solr-ocrhighlighting). I reworked the Solr controller for Whiiif to use the functionality of this plugin, keeping the work already done to provide IIIF Search API responses.

Following a couple of weeks of testing, and some very useful collaborative bug-fixing work with Johannes (primarily fixing some regexp bugs and improving the handling of ALTO files) and thanks to his speedy implementation of a feature request, Whiiif v2 was moved into internal production for some in-development project websites.

I then implemented a secondary feature in Whiiif: the ability to search across a collection as a whole, and have document hits, with snippets of page images returned (complete with visual highlighting), to complement the existing “Search Within” functionality of the IIIF Search API. This feature is also powered by the OCR Highlighting plugin, but returns a custom JSON format (although similar to the IIIF Search response format), allowing the front end controller of a collections site to customise the display of results to fit each individual site design.

The version of Whiiif with these capabilities is currently available on the “withplugin” branch of the Whiiif github repo, although this is still in heavy development and will become the master branch in the future when it is a bit tidier!

Next Steps

The next steps with the Whiiif experiment are to prepare a formal release of Whiiif v2, with updated documentation, install instructions and full unit-test coverage, keep an eye out here or on the github repo for news. In the meantime, please feel free to clone the repo and experiment. Issues and PRs always welcome and you can also contact me on the IIIF slack (as mbennett) or via email: mike.bennett@ed.ac.uk.

I’d love to hear from anyone playing around with Whiiif, or suggestions for other features. Experimental support for the “hits” property of IIIF Search v1 will arrive shortly, along with some updates to make use of the latest features of the Solr plugin.

Until next time 🙂
Mike

Edinburgh hosts international IIIF event

Jointly with the National Library of Scotland, the University hosted the annual IIIF Showcase and Working Meeting from December 3-6. As consortial members, it was a good opportunity for both institutions to raise their profile within this fast-growing community, and for delegates from all over the world to see Edinburgh in winter while making the most of face-to-face discussions regarding recent developments and the future direction of the framework.

The Showcase took place in the Royal Society of Edinburgh, and this reasonably light-touch session offered an introduction to the concepts and tools and for the host institutions to talk about what they’ve produced so far. It was also IIIF’s new managing director Josh Hadro’s first week in the job: a great way for him to see the community in action! The afternoon saw candidates repair to the NLS and Main Library for breakout sessions in key content areas (Archives, Museums, Digital Scholarship) as well as deeply technical and ‘getting started’ sessions. To finish, everyone then made for St Cecilia’s Hall for a round-up of the day; this was an appropriate setting, as we’ve employed IIIF in the museum’s corresponding collections site.

The Working Group meeting ran over the succeeding three days, in the ECCI and Main Library. This was a smaller undertaking than the Showcase, but it still attracted 70 delegates. There were some really meaty discussions about the direction of the framework: cookbooks and use cases; updates to the Mirador viewer; enhancing the APIs and registries (including more work on authentication and various types of search), and looking at the amazing potential of 3D and AV (e.g. subtitle support, musical notation written as a piece plays), which is something we at the University are well placed to start work on. Discussions about the direction of the community and outreach group took place, as well; this session was led by our (until very recently) very own Claire Knowles, now Assistant Director at Leeds University Library. The first meeting of the Technical Review Committee, which rubber-stamps the specs, took place at the event too, in the huge Dining Room at Teviot.

With increasing engagement across the industry, IIIF’s future looks very bright indeed.

Thanks to everyone that helped out over the week, with a particularly big round of applause to IIIF’s Technical Co-ordinator Glen Robson, who is well-known to many people in the Library due to his previous incarnation as Development Manager at the National Library of Wales.

To (self-indulgently) end the post, here is a little hi-res illustration of the work that we have done at Edinburgh with IIIF.

This is heavily annotated! If you click the speech bubbles, you will turn on annotations, some of which link out to relevant websites (links have a dotted line under the text). Also, the Mirador viewer does comparison very well, so if you

  • click the four-square icon in the top left
  • select ‘Add Slot Right’
  • click ‘Add Item’
  • double click the manifest (‘IIIF Highlights…’)
  • select the right image

…you can see the previous version of this picture to see where improvements were made. All of this will go better if you make it full-screen!

Marking up collections sites with Schema.org – Blog 2

This blog was written by Nandini Tyagi, who was working with Holly Coulson on this project.

This blog follows the blog post written by Holly Coulson, my fellow intern at the Library Metadata internship with the Digital Development team at Argyle House. For the benefit of those who are directly reading the second blog, I’ll quickly recap what the project was about. The Library Digital Development team have built a number of data‐driven websites to surface the collections content over the past four years; the sites cover a range of disciplines, from Art to Collections Level Descriptions, Musical Instruments to Online Exhibitions. While they work with metadata standards (Spectrum and Dublin Core), and accepted retrieval frameworks (SOLR indices), they are not particularly rich in a semantic sense, and they do not benefit from advances in browser ‘awareness’; specifically, they do not use linked data. The goal of the project was to embed schema.org metadata within the online records of items within Library and University Collections to bring an aspect of linked data into the site’s functionality, and resultantly, increase the sites’ discoverability through generic search engines (Google etc.) and, more locally, the University’s own website search (Funnelback).

In this blog, I’ll share my experience from the implementation perspective i.e. how were the documented mappings translated into action and what are the key findings from this project. Before I dive into implementation of Schema, it is important to know what Schema is and how does it benefit us. Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. Schema.org vocabulary can be used with many different encodings, including RDFa, Microdata and JSON-LD. These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model. I’ll explain the need for Schema better through an example. Most webmasters are familiar with HTML tags on their pages. Usually, HTML tags tell the browser how to display the information included in the tag. For example, <h1>Titanic</h1> tells the browser to display the text string “Titanic” in a heading 1 format. However, the HTML tag doesn’t give any information about what that text string means—” Titanic” could refer to the hugely successful movie, or it could refer to the name of a ship—and this can make it more difficult for search engines to intelligently display relevant content to a user. In the collections for example, when the search engine would be crawling the pages, they will not understand if a certain title refers to a painting, sculpture or an instrument. Using Schema, we can help the search engine make sense of the information on the webpage and it can list our website at a higher rank for the relevant queries.

Before: No sense of linked data. Search engines would read the title and never understand that these pages refer to items that are a part of creative work collections.

Anatomical Figure of a Horse (ecorche), part of Torrie Collection
Anatomical Figure of a Horse (ecorche), part of Torrie Collection (before)

Keeping this in mind we started mapping the schema specification in the record files of the website. There were certain fields such as ‘Inscriptions’ and ‘Provenance’ that could not be mapped because Schema does not support a specification for them yet. We are documenting all such findings and plan to make suggestions to Schema.org regarding the same.

Implementation of mapping in the configuration file of Art collections
Implementation of mapping in the configuration file of Art collections

This was followed by implementing changes directly in the record files of the collection. There were a lot of challenges such as marking up images, videos and audios.  Especially from images point of view, some websites used the IIIF format and others used the bitstream which required making wise decisions in how to mark up such websites. With a lot of help and guidance from Scott we were able to resolve these issues and the end result of the efforts was absolutely rewarding.

After: Using the CreativeWorks class of Schema the websites have been marked up. Now, search engines can see that these items refer to creative work category such as art, sculpture, instrument etc. They are rich in other details such as name of creator, name of collection, description etc. These huge changes are bound to increase the discoverability of collections website.

Anatomical Figure of a Horse (ecorche), part of Torrie Collection
Anatomical Figure of a Horse (ecorche), part of Torrie Collection (after)

I was very fortunate that Google Analytics and Search Engine Optimisation (SEO) trainings were held at the Argyle House during my internship period and I was able to get insights from these trainings. They illuminated a whole new direction and lend a different viewpoint. The SEO workshops in particular gave ideas about optimizing the overall content of the websites and making St.Cecilias more discoverable. I realized the benefits of SEO are seen in the form of increased traffic, Return on Investment (ROI), cost effectiveness, increased site usability and brand awareness. These tools combined with schema can add significant value to the library services everywhere. It is a feeling of immense pride that our university is among the very few universities that have employed schema for their collections. We are confident that schema will make the collections more accessible not only to students but also to the people worldwide who want to discover the jewels held in these collections. The 13 collections website that we worked on during our internship are nearing completion from implementation perspective and will be live soon with Schema markup.

To conclude, my internship has been an amazing experience, both in terms of meaningful strides in the direction of marking up the collections website and the fun, conducive work-culture. Having never worked on the web development side before, I got the opportunity to understand first-hand the intricacies associated in anticipating the needs of users and delivering a perfect information-rich website experience to the users. As the project culminates in July, I am happy to have learned and contributed much more than I imagined I would in the course of my internship.

Nandini Tyagi (Library Digital Development)

Thanks very much to both Nandini and Holly on their sterling work on this project. We’ve implemented schema in 3 sites already (look at https://collections.ed.ac.uk/art for an example), and we have another 6 in github ready to be released. The interns covered a wealth of data, and we think we’re in a position to advise 1) LTW that this material can now be better used in the University website’s search and 2) prospective developers on how to apply this concept to their sites.

Scott Renton (Library Digital Development)

Marking up collections sites with Schema.org

This blog was written by Holly Coulson, who is working with Nandini Tyagi  on this project.

For the last three months, I have been undergoing a Library Metadata internship with the Digital Development team over in Argyle House. Having the opportunity to see behind the scenes into the more technical side of the library, I gained far more programming experience I ever imagined I would, and worked hands on with the collections websites themselves.

Most people’s access to University of Edinburgh’s holdings is primarily through DiscoverEd. But the Collections websites exist as a complete archive of all of the materials that Edinburgh University have, from historical alumni to part of the St Cecilia’s Hall musical instrument collection. These sites are currently going through an upgrade, with new interfaces and front pages.

My role in all of this, along with my fellow intern Nandini, was to analyse and improve how the behind-the-scenes metadata works in the grander scheme of things. As the technological world continually moves towards linked data and the interconnectivity of the Internet, developers are consciously having to update their websites to include more structured data and mark-up. That was our job.

Our primary aim was to implement Schema.org into the collections.ed.ac.uk spaces. Schema, an open access structured data vocabulary, allows for major search engines, such as Google and Yahoo to pull data and understand what it represents. For collections, you can label the title as a ‘schema:name’ and even individual ID numbers as ‘schema:Identifier’. This allows for far more reliable searching, and allows websites to work in conjunction with a standardised system that can be parsed by search engines. This is ultimately with the aim to optimise searches, both in the ed.ac.uk domain, and the larger search engines.

Our first major task was the research. As two interns, both doing our Masters, we had never heard of Schema.org, let alone seen it implemented. We analysed various collections sites around the world, and saw their use of Schema was minimal. Even large sites, such as the Louvre, or the National Gallery, didn’t include any schema within their record pages.

With minimal examples to go off, we decided to just start mapping our metadata inputs to the schema vocabulary, to get a handle on how Schema.org worked in relation to collections. There were some ideas that were very basic, such as title, author, and date. The basics of meta-data was relatively easy to map, and this allowed us to quickly work through the 11 sites that we were focusing on. There were, however, more challenging aspects to the mapping process, which took far longer to figure out. Schema is rather limited for its specific collection framework. While bib.schema exists, an extension that specifically exists for bibliographic information, there is little scope for more specific collection parameters.  There were many debates on whether ‘bib.Collection’ or ‘isPartOf’ worked better for describing a collection, and if it was viable to have 4 separate ‘description’ fields, for different variations of abstracts, item descriptions, and other general information.

The initial mappings for /art, showing relatively logical and simple fields
More complicated mappings in /stcecilias, with a lot of repetition and similar fields, and ones that were simply not possible, or weren’t required.

There were other, more specific, fields we had to deal with, that threw up particular problems. The ‘dimensions’ field is always a combined height and width. Schema.org, with regards to specific dimensions, only deals with individual values: a separate height and a separate width value. It wasn’t until we’d mapped everything that we realised this, and had to rethink our mapping. There was also many considerations for if Schema.org was actually required for every piece of metadata. Would linking specific cataloguing and preservation descriptions actually be useful? How often would users actually search for these items? Would using schema.org actually help these fields? We continually had to consider how users actually searched and explored the websites, and whether adding the schema would aid search results. By using a sample of Google Analytics data, we were able to narrow down what should actually be included in the mapping. We ended up with 11 tables, for 11 websites (see above), offering a precise mapping that we could implement straight away.

The next stage was figuring out how to build the schema into the record pages themselves. Much of the Collections website is run on PHP, which takes the information directly from the metadata files and places them in a table when you click on an individual item. Schema.org, in its simplest form, is in HTML, but it would be impossible to go through every single record manually and mark it up. Instead, we had to work with the record configuration files to allow for variables to be tested for schema. If they were a field with a schema definition, the schema.org tag is printed around it, as an automated process. This was further complicated by filters that are used, meaning several sets of code were often required to formulate all the information on the page. As someone who had never worked with PHP before, it was definitely a learning curve. Aided by various courses on search engine optimisation and Google Analytics, we were becoming increasingly confident in our work.

Our first successes were uploading both the /art and /mimed collections schema onto the test servers, only receiving two minimal errors. This proved that our code worked, and that we were returning positive results. By using a handy plugin in Chrome, we were able to see if the code was actually offering readable Schema.org that linked all of our data together.

The plugin, OpenLink Structured Data Sniffer, showing the schema and details attributed to the individual record. In this case, Sir Henry Raeburn’s painting of John Robison.

As we come to the final few weeks of our internship, we’ve learnt far more about linked data and search engine optimisation than we imagined. Being able to directly work with the collections websites gave us a greater understanding of how the library works overall. I am personally in the MSc Book History and Material Culture programme, a very hands-on, physical programme, and exploring the technical and digital side of collections has been an amazingly rewarding experience that has aided my studies. Some of our schema.org coding will be rolled out to go live before we finish, and we have realised the possibilities that structured data can bring to library services. We hope we have allowed one more person to find the artwork or musical instrument they were looking for. While time will tell how effective schema.org is in the search results pages, we are confident that we have helped the collections become even more accessible and searchable.

Holly Coulson (Library Digital Development)

IIIF Conference, Washington, May 2018

Washington Monument & Reflecting Pool
Washington Monument & Reflecting Pool

We (Joe Marshall (Head of Special Collections) and Scott Renton (Library Digital Development)) visited Washington DC for the IIIF Conference from 21st-25th May. This was a great opportunity for L&UC, not only to visit the Library of Congress- the mecca of our industry in some ways- but also to come back with a wealth of knowledge which we could use to inform how we operate.

Edinburgh gave two papers- the two of us delivering a talk on Special Collections discovery at the Library and how IIIF could make it all more comprehensible (including the Mahabharata Scroll), and Scott spoke with Terry Brady of Georgetown University showing how IIIF has improved our respective repository workflows.

From a purely practical level, it was great to meet face to face with colleagues from across the world- we have a very real example of a problem solved with Drake from LUNA, which we hope to be able to show very soon. It was also interesting to see how the API specs are developing- the presentation API will be enhanced with AV in version 3, and we can already see some use cases with which to try this out; search and discovery are APIs we’ve done nothing with, but these will help the ability to search within and across items, which is essential to our estate of systems, and 3D, while not having an API of its own, is also being addressed by IIIF, and it was fascinating to see the work that Universal Viewer and Sketchfab (which the DIU use) are doing to accommodate it.

The community groups are growing too, and we hope to increase our involvement with some of the less technical areas- Manuscripts, Museums, and the newly-proposed Archives group in the near future.

Among a wealth of great presentations, we’ve each identified one as our favourite:

Scott: Chifumi Nishioka – Kyoto University, Kiyonori Nagasaki – The University of Tokyo: Visualizing which parts of IIIF images are looked by users

This fascinating talk highlighted IIIF’s ability to work out which parts of an image, when zoomed in, are most popular. Often this is done by installing special tools such as eyetrackers, but the nature of IIIF- where the region is displayed as part of the URL- the same information can be visualised by interrogating Apache access logs. Chifumi and Kiyonori have been able to generate heatmaps of the most interesting regions on an item, and the code can be re-used if the logs can be supplied.

Joe: Kyle Rimkus – University of Illinois at Urbana-Champaign, Christopher J. Prom – University of Illinois at Urbana-Champaign: A Research Interface for Digital Records Using the IIIF Protocol

This talk showed the potential of IIIF in the context of digital preservation, providing large-scale public access to born-digital archive records without having to create exhaustive item-level metadata.  The IIIF world is encouraging this kind of blue-sky thinking which is going to challenge many of our traditional professional assumptions and allow us to be more creative with collections projects.

It was a terrific trip, which has filled us with enthusiasm for pushing on with IIIF beyond its already significant place in our set-up.

Joe Marshall & Scott Renton

Library Of Congress Exhibition
Library Of Congress Exhibition

Museums sites go IIIF

The main portal into Library and University Collections’ Special Collections content, collections.ed.ac.uk, is changing. A design overhaul which will improve discovery both logically and aesthetically is coming very soon, but in advance, we’ve implemented an important element of functionality, namely the IIIF approach to images.

Two sites have been affected to that end: Art (https://collections.ed.ac.uk/art– 2859 IIIF images in 2433 manifests across 4715 items) and Musical Instruments (https://collections.ed.ac.uk/mimed– 8070 IIIF images in 4097 manifests across 5105 items)) now feature direct IIIF thumbnails, embedded image zooming and manifest availability. A third site, the St Cecilia’s Hall collection (https://collections.ed.ac.uk/stcecilias) already had the first two elements, but manifests for its items are now available to the user.

What does this all mean? To take each element in turn:

Direct IIIF thumbnails

The search results pages on the site no longer directly reference images on the collections.ed servers, but bring in a LUNA URL using the IIIF image API format, which offers the user flexibility on size, region, rotation and quality.

Embedded image zooming

Using IIIF images served from the LUNA server and the OpenSeadragon viewer, images can now be zoomed directly on the page, where previously we needed an additional link out to the LUNA repository.

Manifest availability

Based on the images attached to the record in the Vernon CMS, we have built IIIF manifests and made them available, one per object. Manifests are a set of presentation instructions to render a set of images according to curatorial choice, and they can be dropped into standard IIIF viewers. We have created a button to present them in Universal Viewer (UV), and will be adding another to bring in Mirador in due course.

Watch this space for more development on these sites in the very near future. The look-and-feel will change significantly, but the task will be made easier with IIIF as a foundation.

Scott Renton, Digital Development

Automated item data extraction from old documents

Overview

The Problem

We have a collection of historic papers from the Scottish Court of Session. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical search of likely volumes (if you’re lucky you might get an index at the start!).

Volumes of Session Papers in the Signet Library, Edinburgh
Volumes of Session Papers in the Signet Library, Edinburgh

The Aim

I am hoping to use computer vision techniques, OCR, and intelligent text analysis to automatically extract and parse case-level data in order to create an indexed, searchable digital resource for these items. The Digital Imaging Unit have digitised a small selection of the papers, which we will use as a pilot to assess the viability of the above aim.

Stage One – Image preparation

Using Python and OpenCV to extract text blocks

I am indebted to Dan Vanderkam‘s work in this area, especially his blog post ‘Finding blocks of text in an image using Python, OpenCV and numpy’ upon which this work is largely based.

The items in the Scottish Session Papers collection differ from the images that Dan was processing, being images of older works, which were printed with a letterpress rather than being typewritten.

The Session Papers images are lacking a delineating border, backing paper, and other features that were used to ease the image processing. In addition, the amount, density and layout of text items is incredibly varied across the corpus, further complicating the task.

The initial task is to find a crop of the image to pass to the OCR engine. We want to give it as much text as possible in as few pixels as possible!

Due to the nature of the images, there is often a small amount of text from the opposite page visible (John’s blog explains why) and so to save some hassle later, we’re going to start by cropping 50px from each horizontal side of the image, hopefully eliminating these bits of page overspill.

A cropped version of the page
A cropped version of the page

Now that we have the base image to work on, I’ve started with the simple steps of converting it to grayscale, and then applying an inverted binary threshold, turning everything above ~75% gray to white, and everything else to black. The inversion is to ease visual understanding of the process. You can view full size versions by clicking each image.

A grayscale version of the page
Grayscale
75% Threshold
75% Threshold

The ideal outcome is that we eliminate smudges and speckles, leaving only the clear printed letters. This entailed some experimenting with the threshold level, as you can see in the image above, a lot of speckling remains. Dropping the threshold to only leave pixels above ~60% gray was a large improvement, and to ~45% even more so:

60% Threshold
60% Threshold
45% Threshold
45% Threshold

At a threshold of 45%, some of the letters are also beginning to fade, but this should not be an issue, as we have successfully eliminated almost all the noise, which was the aim here.

We’re still left with a large block at the top, which was the black backing behind the edge of the original image. To eliminate this, I experimented with several approaches:

  • Also crop 50px from the top and bottom of the images – unfortunately this had too much “collateral damage” as a large amount of the images have text within this region.
  • Dynamic cropping based on removing any segments touching the top and bottom of the image – this was a more effective approach but the logic for determining the crop became a bit convoluted.
  • Using Dan’s technique of  applying Canny edge detection and then use a rank filter to remove ~1px edges – this was the most successful approach, although it still had some issues when the text had a non-standard layout.

I settled on the Canny/Rank filter approach to produce these results:

Result of Canny edge finder
Result of Canny edge finder
With rank filter
With rank filter

Next up, we want to find a set of masks that covers the remaining white pixels on the page. This is achieved by repeatedly dilating the image, until only a few connected components remain:

You can see here that the “faded” letters from the thresholding above have enough presence to be captured by the dilation process. These white blocks now give us a pretty good record of where the text is on the page, so we now move onto cropping the image.

Dan’s blog has a good explanation of solving the Subset Sum problem for a dilated image, so I will apply his technique (start with the largest white block, and add more if they improve the amount of white pixels at a favourable increase in total area size, with some tweaking to the exact ratio):

With final bounding
With final bounding

So finally, we apply this crop to the original image:

Final cropped version
Final cropped version

As you can see, we’ve now managed to accurately crop out the text from the image, helping to significantly reduce the work of the OCR engine.

My final modified version of Dan’s code can be found here: https://github.com/mbennett-uoe/sp-experiments/blob/master/sp_crop.py

In my next blog post, I’ll start to look at some OCR approaches and also go through some of the outliers and problem images and how I will look to tackle this.

Comments and questions are more than welcome 🙂

Mike Bennett – Digital Scholarship Developer

 

IIIF Technical Workshop and Showcase March 2017

Improving Access to Image Collections

On 16th and 17th March the University of Edinburgh and National Library of Scotland will be hosting two International Image Interoperability Framework events.

IIIF Showcase

The IIIF Showcase brings together developers and early adopters to explain the background and value of IIIF, its growing community, and the potential of the Framework and the innovative ways in which it can be used to present digital image collections. There will be presentations from Edinburgh University Library, National Library of Scotland, National Library of Wales, Durham University, University College Dublin, The Bodleian Library, Digirati, Cogapp and others.

Logistics

  • Registration: Registration is free but capacity is limited.
  • Date: Friday, March 17, 2017
  • Location: National Library of Scotland (NLS) Boardroom on George IV Bridge (see map)
  • Audience: Individuals and institutional representatives interested in learning more about IIIF
  • Code of Conduct: The IIIF Code of Conduct applies to all IIIF events and related activities.
  • Social Media: Tweets about the event should use #iiif and @iiif_io.

IIIF Technical Workshop

The IIIF Technical Workshop unconference, hosted by the University of Edinburgh at Argyle House, will bring together colleagues who have implemented IIIF services, are developing the Framework and associated tools, and working on community initiatives. The workshop will provide opportunities to discuss implementations, issues, initiatives and developments and the forthcoming Annual IIIF conference in June.

Logistics

  • Registration: Registration is free but capacity is limited.
  • Date: Thursday, March 16, 2017
  • Location: University of Edinburgh Argyle House (see map)
  • Audience: Developers already working with IIIF or considering an implementation
  • Code of Conduct: The IIIF Code of Conduct applies to all IIIF events and related activities.
  • Social Media: Tweets about the event should use #iiif and @iiif_io.