{"id":321,"date":"2017-06-23T15:18:19","date_gmt":"2017-06-23T15:18:19","guid":{"rendered":"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/?p=321"},"modified":"2017-06-26T16:13:18","modified_gmt":"2017-06-26T16:13:18","slug":"automated-item-data-extraction-from-old-manuscripts","status":"publish","type":"post","link":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/","title":{"rendered":"Automated item data extraction from old documents"},"content":{"rendered":"<h2>Overview<\/h2>\n<h3>The Problem<\/h3>\n<p>We have a collection of historic papers from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Court_of_Session\">Scottish Court of Session<\/a>. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical search of likely volumes (if you&#8217;re lucky you might get an index at the start!).<\/p>\n<figure id=\"attachment_325\" aria-describedby=\"caption-attachment-325\" style=\"width: 604px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-325 size-large\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/SignetLib-1024x768.jpg\" alt=\"Volumes of Session Papers in the Signet Library, Edinburgh\" width=\"604\" height=\"453\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/SignetLib-1024x768.jpg 1024w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/SignetLib-300x225.jpg 300w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/SignetLib-768x576.jpg 768w\" sizes=\"(max-width: 604px) 100vw, 604px\" \/><figcaption id=\"caption-attachment-325\" class=\"wp-caption-text\">Volumes of Session Papers in the Signet Library, Edinburgh<\/figcaption><\/figure>\n<h3>The Aim<\/h3>\n<p>I am hoping to use computer vision techniques, OCR, and intelligent text analysis to automatically extract and parse case-level data in order to create an indexed, searchable digital resource for these items. The <a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/diu\/\">Digital Imaging Unit<\/a> have <a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/diu\/2017\/03\/03\/scottish-court-of-session-papers-digitisation-pilot\/#more-1729\">digitised a small selection of the papers<\/a>, which we will use as a pilot to assess the viability of the above aim.<\/p>\n<h2>Stage One &#8211; Image preparation<\/h2>\n<h3>Using Python and OpenCV to extract text blocks<\/h3>\n<p>I am indebted to <a href=\"http:\/\/www.danvk.org\">Dan Vanderkam<\/a>&#8216;s work in this area, especially his blog post <a href=\"http:\/\/www.danvk.org\/2015\/01\/07\/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html\">&#8216;Finding blocks of text in an image using Python, OpenCV and numpy&#8217;<\/a> upon which this work is largely based.<\/p>\n<p>The items in the Scottish Session Papers collection differ from the images that Dan was processing, being images of older works, which were printed with a letterpress rather than being typewritten.<\/p>\n<p>The Session Papers images are lacking a delineating border, backing paper, and other features that were used to ease the image processing. In addition, the amount, density and layout of text items is incredibly varied across the corpus, further complicating the task.<\/p>\n<p>The initial task is to find a crop of the image to pass to the OCR engine. We want to give it as much text as possible in as few pixels as possible!<\/p>\n<div id='gallery-1' class='gallery galleryid-321 gallery-columns-3 gallery-size-medium'><figure class='gallery-item'>\n\t\t\t<div class='gallery-icon portrait'>\n\t\t\t\t<a href='https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/pages3\/'><img width=\"234\" height=\"300\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages3-234x300.jpg\" class=\"attachment-medium size-medium\" alt=\"Example pages from the collection\" decoding=\"async\" loading=\"lazy\" aria-describedby=\"gallery-1-328\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages3-234x300.jpg 234w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages3.jpg 359w\" sizes=\"(max-width: 234px) 100vw, 234px\" \/><\/a>\n\t\t\t<\/div>\n\t\t\t\t<figcaption class='wp-caption-text gallery-caption' id='gallery-1-328'>\n\t\t\t\tExample pages from the collection\n\t\t\t\t<\/figcaption><\/figure><figure class='gallery-item'>\n\t\t\t<div class='gallery-icon portrait'>\n\t\t\t\t<a href='https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/pages2\/'><img width=\"230\" height=\"300\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages2-230x300.jpg\" class=\"attachment-medium size-medium\" alt=\"Example pages from the collection\" decoding=\"async\" loading=\"lazy\" aria-describedby=\"gallery-1-327\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages2-230x300.jpg 230w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages2.jpg 354w\" sizes=\"(max-width: 230px) 100vw, 230px\" \/><\/a>\n\t\t\t<\/div>\n\t\t\t\t<figcaption class='wp-caption-text gallery-caption' id='gallery-1-327'>\n\t\t\t\tExample pages from the collection\n\t\t\t\t<\/figcaption><\/figure><figure class='gallery-item'>\n\t\t\t<div class='gallery-icon portrait'>\n\t\t\t\t<a href='https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/pages1\/'><img width=\"227\" height=\"300\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages1-227x300.jpg\" class=\"attachment-medium size-medium\" alt=\"Example pages from the collection\" decoding=\"async\" loading=\"lazy\" aria-describedby=\"gallery-1-326\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages1-227x300.jpg 227w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/pages1.jpg 349w\" sizes=\"(max-width: 227px) 100vw, 227px\" \/><\/a>\n\t\t\t<\/div>\n\t\t\t\t<figcaption class='wp-caption-text gallery-caption' id='gallery-1-326'>\n\t\t\t\tExample pages from the collection\n\t\t\t\t<\/figcaption><\/figure>\n\t\t<\/div>\n\n<p>Due to the nature of the images, there is often a small amount of text from the opposite page visible (<a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/diu\/2017\/03\/03\/scottish-court-of-session-papers-digitisation-pilot\">John&#8217;s blog explains why<\/a>) and so to save some hassle later, we&#8217;re going to start by cropping 50px from each horizontal side of the image, hopefully eliminating these bits of page overspill.<\/p>\n<figure id=\"attachment_329\" aria-describedby=\"caption-attachment-329\" style=\"width: 604px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-329 size-large\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/cropped-709x1024.png\" alt=\"A cropped version of the page\" width=\"604\" height=\"872\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/cropped-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/cropped-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/cropped-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/cropped.png 1063w\" sizes=\"(max-width: 604px) 100vw, 604px\" \/><figcaption id=\"caption-attachment-329\" class=\"wp-caption-text\">A cropped version of the page<\/figcaption><\/figure>\n<p>Now that we have the base image to work on, I&#8217;ve started with the simple steps of converting it to grayscale, and then applying an inverted binary threshold, turning everything above ~75% gray to white, and everything else to black. The inversion is to ease visual understanding of the process. You can view full size versions by clicking each image.<\/p>\n<figure id=\"attachment_331\" aria-describedby=\"caption-attachment-331\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/gray-1.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-331 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/gray-1-208x300.png\" alt=\"A grayscale version of the page\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/gray-1-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/gray-1-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/gray-1-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/gray-1.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-331\" class=\"wp-caption-text\">Grayscale<\/figcaption><\/figure>\n<figure id=\"attachment_332\" aria-describedby=\"caption-attachment-332\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin180.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-332 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin180-208x300.png\" alt=\"75% Threshold\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin180-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin180-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin180-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin180.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-332\" class=\"wp-caption-text\">75% Threshold<\/figcaption><\/figure>\n<p>The ideal outcome is that we eliminate smudges and speckles, leaving only the clear printed letters. This entailed some experimenting with the threshold level, as you can see in the image above, a lot of speckling remains. Dropping the threshold to only leave pixels above ~60% gray was a large improvement, and to ~45% even more so:<\/p>\n<figure id=\"attachment_334\" aria-describedby=\"caption-attachment-334\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin150.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-334 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin150-208x300.png\" alt=\"60% Threshold\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin150-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin150-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin150-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin150.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-334\" class=\"wp-caption-text\">60% Threshold<\/figcaption><\/figure>\n<figure id=\"attachment_333\" aria-describedby=\"caption-attachment-333\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin120.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-333 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin120-208x300.png\" alt=\"45% Threshold\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin120-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin120-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin120-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bin120.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-333\" class=\"wp-caption-text\">45% Threshold<\/figcaption><\/figure>\n<p>At a threshold of 45%, some of the letters are also beginning to fade, but this should not be an issue, as we have successfully eliminated almost all the noise, which was the aim here.<\/p>\n<p>We&#8217;re still left with a large block at the top, which was the black backing behind the edge of the original image. To eliminate this, I experimented with several approaches:<\/p>\n<ul>\n<li>Also crop 50px from the top and bottom of the images &#8211; unfortunately this had too much &#8220;collateral damage&#8221; as a large amount of the images have text within this region.<\/li>\n<li>Dynamic cropping based on removing any segments touching the top and bottom of the image &#8211; this was a more effective approach but the logic for determining the crop became a bit convoluted.<\/li>\n<li>Using Dan&#8217;s technique of\u00a0 applying Canny edge detection and then use a rank filter to remove ~1px edges &#8211; this was the most successful approach, although it still had some issues when the text had a non-standard layout.<\/li>\n<\/ul>\n<p>I settled on the Canny\/Rank filter approach to produce these results:<\/p>\n<figure id=\"attachment_337\" aria-describedby=\"caption-attachment-337\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/canny.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-337 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/canny-208x300.png\" alt=\"Result of Canny edge finder\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/canny-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/canny-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/canny-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/canny.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-337\" class=\"wp-caption-text\">Result of Canny edge finder<\/figcaption><\/figure>\n<figure id=\"attachment_343\" aria-describedby=\"caption-attachment-343\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/rank-1.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-343 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/rank-1-208x300.png\" alt=\"With rank filter\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/rank-1-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/rank-1-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/rank-1-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/rank-1.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-343\" class=\"wp-caption-text\">With rank filter<\/figcaption><\/figure>\n<p>Next up, we want to find a set of masks that covers the remaining white pixels on the page. This is achieved by repeatedly dilating the image, until only a few connected components remain:<\/p>\n<div id='gallery-2' class='gallery galleryid-321 gallery-columns-3 gallery-size-medium'><figure class='gallery-item'>\n\t\t\t<div class='gallery-icon portrait'>\n\t\t\t\t<a href='https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/dil1\/'><img width=\"208\" height=\"300\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil1-208x300.png\" class=\"attachment-medium size-medium\" alt=\"First round of dilation\" decoding=\"async\" loading=\"lazy\" aria-describedby=\"gallery-2-338\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil1-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil1-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil1-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil1.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a>\n\t\t\t<\/div>\n\t\t\t\t<figcaption class='wp-caption-text gallery-caption' id='gallery-2-338'>\n\t\t\t\tFirst round of dilation\n\t\t\t\t<\/figcaption><\/figure><figure class='gallery-item'>\n\t\t\t<div class='gallery-icon portrait'>\n\t\t\t\t<a href='https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/dil2\/'><img width=\"208\" height=\"300\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil2-208x300.png\" class=\"attachment-medium size-medium\" alt=\"Fifth round of dilation\" decoding=\"async\" loading=\"lazy\" aria-describedby=\"gallery-2-339\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil2-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil2-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil2-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil2.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a>\n\t\t\t<\/div>\n\t\t\t\t<figcaption class='wp-caption-text gallery-caption' id='gallery-2-339'>\n\t\t\t\tFifth round of dilation\n\t\t\t\t<\/figcaption><\/figure><figure class='gallery-item'>\n\t\t\t<div class='gallery-icon portrait'>\n\t\t\t\t<a href='https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/dil3\/'><img width=\"208\" height=\"300\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil3-208x300.png\" class=\"attachment-medium size-medium\" alt=\"Twelth round of dilation\" decoding=\"async\" loading=\"lazy\" aria-describedby=\"gallery-2-340\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil3-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil3-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil3-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/dil3.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a>\n\t\t\t<\/div>\n\t\t\t\t<figcaption class='wp-caption-text gallery-caption' id='gallery-2-340'>\n\t\t\t\tTwelth round of dilation\n\t\t\t\t<\/figcaption><\/figure>\n\t\t<\/div>\n\n<p>You can see here that the &#8220;faded&#8221; letters from the thresholding above have enough presence to be captured by the dilation process. These white blocks now give us a pretty good record of where the text is on the page, so we now move onto cropping the image.<\/p>\n<p><a href=\"http:\/\/www.danvk.org\/2015\/01\/07\/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html\">Dan&#8217;s blog <\/a>has a good explanation of solving the Subset Sum problem for a dilated image, so I will apply his technique (start with the largest white block, and add more if they improve the amount of white pixels at a favourable increase in total area size, with some tweaking to the exact ratio):<\/p>\n<figure id=\"attachment_344\" aria-describedby=\"caption-attachment-344\" style=\"width: 208px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bounding2-1.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-344 size-medium\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bounding2-1-208x300.png\" alt=\"With final bounding\" width=\"208\" height=\"300\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bounding2-1-208x300.png 208w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bounding2-1-768x1110.png 768w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bounding2-1-709x1024.png 709w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/bounding2-1.png 1063w\" sizes=\"(max-width: 208px) 100vw, 208px\" \/><\/a><figcaption id=\"caption-attachment-344\" class=\"wp-caption-text\">With final bounding<\/figcaption><\/figure>\n<div class=\"mceTemp\"><\/div>\n<p>So finally, we apply this crop to the original image:<\/p>\n<figure id=\"attachment_345\" aria-describedby=\"caption-attachment-345\" style=\"width: 771px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-345\" src=\"http:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/0133011c.crop_.png\" alt=\"Final cropped version\" width=\"771\" height=\"592\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/0133011c.crop_.png 771w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/0133011c.crop_-300x230.png 300w, https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/files\/2017\/06\/0133011c.crop_-768x590.png 768w\" sizes=\"(max-width: 771px) 100vw, 771px\" \/><figcaption id=\"caption-attachment-345\" class=\"wp-caption-text\">Final cropped version<\/figcaption><\/figure>\n<p>As you can see, we&#8217;ve now managed to accurately crop out the text from the image, helping to significantly reduce the work of the OCR engine.<\/p>\n<p>My final modified version of <a href=\"https:\/\/github.com\/danvk\/oldnyc\/blob\/master\/ocr\/tess\/crop_morphology.py\">Dan&#8217;s code<\/a> can be found here:<a href=\"https:\/\/github.com\/mbennett-uoe\/sp-experiments\/blob\/master\/sp_crop.py\"> https:\/\/github.com\/mbennett-uoe\/sp-experiments\/blob\/master\/sp_crop.py<\/a><\/p>\n<p>In my next blog post, I&#8217;ll start to look at some OCR approaches and also go through some of the outliers and problem images and how I will look to tackle this.<\/p>\n<p>Comments and questions are more than welcome \ud83d\ude42<\/p>\n<p>Mike Bennett &#8211; Digital Scholarship Developer<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview The Problem We have a collection of historic papers from the Scottish Court of Session. These are collected into cases and bound together in large volumes, with no catalogue or item data other than a shelfmark. If you wish to find a particular case within the collection, you are restricted to a manual, physical &hellip; <a href=\"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/2017\/06\/23\/automated-item-data-extraction-from-old-manuscripts\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Automated item data extraction from old documents<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":137,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[52,27,53,22,51,54,9],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/posts\/321"}],"collection":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/comments?post=321"}],"version-history":[{"count":15,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/posts\/321\/revisions"}],"predecessor-version":[{"id":358,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/posts\/321\/revisions\/358"}],"wp:attachment":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/media?parent=321"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/categories?post=321"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/librarylabs\/wp-json\/wp\/v2\/tags?post=321"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}