{"id":3025,"date":"2023-08-30T13:35:48","date_gmt":"2023-08-30T13:35:48","guid":{"rendered":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/?p=3025"},"modified":"2023-08-30T13:35:48","modified_gmt":"2023-08-30T13:35:48","slug":"image-to-text-exploring-text-extraction-processes-in-luc","status":"publish","type":"post","link":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/2023\/08\/30\/image-to-text-exploring-text-extraction-processes-in-luc\/","title":{"rendered":"Image to Text: Exploring Text Extraction Processes in L&amp;UC"},"content":{"rendered":"<p><span data-contrast=\"auto\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-3035\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/files\/2023\/08\/0018585c.jpg\" alt=\"Row of books, bound in old, cracked brown leather resting on a black background with spines facing outwards.\" width=\"1726\" height=\"1415\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/files\/2023\/08\/0018585c.jpg 1726w, https:\/\/libraryblogs.is.ed.ac.uk\/diu\/files\/2023\/08\/0018585c-500x410.jpg 500w, https:\/\/libraryblogs.is.ed.ac.uk\/diu\/files\/2023\/08\/0018585c-1024x839.jpg 1024w, https:\/\/libraryblogs.is.ed.ac.uk\/diu\/files\/2023\/08\/0018585c-768x630.jpg 768w, https:\/\/libraryblogs.is.ed.ac.uk\/diu\/files\/2023\/08\/0018585c-1536x1259.jpg 1536w\" sizes=\"(max-width: 1726px) 100vw, 1726px\" \/>Since April I have been an intern with the University of Edinburgh\u2019s Cultural Heritage Digitisation Service (CHDS) and the Centre for Data, Culture and Society (CDCS), looking into text extraction processes at the University, both in library practice and thinking about how this is taught within digital scholarship. Throughout the internship I have had the opportunity to do both independent research and discussions with staff across the Library and University Collections (L&amp;UC) to get a more in-depth understanding of text recognition processes. <\/span><span data-ccp-props=\"{&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><!--more--><\/p>\n<p><span data-contrast=\"auto\">Text extraction is commonly\u00a0done through\u00a0a process called Optical Character Recognition (OCR), where an image of text is scanned and the software recognises the writing and creates a text output. In older technology this was done on a character-by-character level, but now many OCR engines incorporate deep\u00a0learning and natural language processing models in their approaches for\u00a0optimised and more advanced text recognition.\u00a0This is valuable for cultural heritage as it provides wider access to items in the collection. The generated text can be presented in a range of ways such as searchable pdfs, downloadable text datasets, or side-by-side displays of images of documents and the text. OCR technology has been widely used by libraries and archives for the past twenty years and the technology has been developing to produce higher quality text with fewer errors. OCR has reasonably good success rates with good quality printed texts due to the standardisation of printed texts. Due to this, new technologies have been developed to target handwritten text recognition. For more general information on text extraction processes, see the section on <\/span><a href=\"https:\/\/www.cdcs.ed.ac.uk\/training\/training-pathways\/managing-digitised-documents-pathway\/text-extraction-preparation\"><span data-contrast=\"none\">text extraction<\/span><\/a><span data-contrast=\"auto\"> in the CDCS Training Pathway \u2018Managing Digitised Documents\u2019.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In the past, L&amp;UC have used various software and approaches for different projects, dependent on the materials and resources available. My task has been evaluating these approaches, looking at what other academic and research libraries have been using and making recommendations for the best options moving forward. Many printed books are available as searchable PDFs through the Library and University Collections\u2019 <\/span><a href=\"https:\/\/openbooks.is.ed.ac.uk\/\"><span data-contrast=\"none\">Open Books<\/span><\/a><span data-contrast=\"auto\"> site, relying on a text underlay to the PDF created with OCR with <\/span><a href=\"https:\/\/pdf.abbyy.com\/finereader-pdf\/?utm_term=abbyy%20finereader&amp;utm_campaign=(FR)+UK+-+FineReader+Search+-+Brand&amp;utm_source=adwords&amp;utm_medium=ppc&amp;hsa_acc=5771177125&amp;hsa_cam=86711114&amp;hsa_grp=4111648034&amp;hsa_ad=379585595230&amp;hsa_src=g&amp;hsa_tgt=kwd-14557921&amp;hsa_kw=abbyy%20finereader&amp;hsa_mt=b&amp;hsa_net=adwords&amp;hsa_ver=3&amp;gclid=CjwKCAjwzo2mBhAUEiwAf7wjkrLMpleuz8QzUnwk8Wc4L7deK61VyJ06CwIL_l4zrhEMhcDz-g72GBoCQZcQAvD_BwE\"><span data-contrast=\"none\">ABBYY FineReader<\/span><\/a><span data-contrast=\"auto\">. ABBYY FineReader is commonly used in cultural heritage institutions as its OCR is easily automated into digitisation process and it provides a reasonably good level of text recognition, although there are still errors as OCR is never 100% accurate.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">My research identified ABBYY FineReader as one of the more reliable automated paid services, able to process large volumes of text within the digitisation process, in several languages, and with a reasonable degree of accuracy on some materials. However, there are other viable options out there, including the open-source OCR engine <\/span><a href=\"https:\/\/github.com\/tesseract-ocr\"><span data-contrast=\"none\">Tesseract.<\/span><\/a><span data-contrast=\"auto\"> Tesseract requires a more hands-on approach and understanding of more advanced programming than automated options, however, but can provide a greater flexibility and has been found to perform well on ranges of materials (Olson &amp; Berry, 2021). <\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><span data-contrast=\"auto\">Resources such as staff hours, skill levels and training, as well as departmental and project budgets are key factors in determining which text extraction software should be used. The way the text output will be stored and presented by L&amp;UC are also important considerations as we need to be able to support the use of the text documents created, whether this is through searchable PDFs, text and image presentation alongside each other, or other options. <\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Due to the versatility of materials and projects that CHDS covers, currently there is not necessarily a one-size fits all option that will cover all bases for text recognition, however the technology is developing all the time, with more options available to cover even more languages, easier adaptation for tricky to scan materials and improvements in text quality. Hopefully the work done during my internship on OCR approaches in L&amp;UC and the wider sector is useful in directing how we approach text extraction and the presentation of our text materials, especially as we make the move to our new digital collctions platform. <\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">As part of the internship, I have designed a workshop and learning resource on hands-on text extraction for the <\/span><a href=\"https:\/\/www.cdcs.ed.ac.uk\/\"><span data-contrast=\"none\">CDCS<\/span><\/a><span data-contrast=\"auto\">. More details for the workshop to follow via their <\/span><a href=\"https:\/\/www.cdcs.ed.ac.uk\/events\"><span data-contrast=\"none\">events<\/span><\/a><span data-contrast=\"auto\"> page as we move into the 2023\/24 academic year.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This internship with CHDS and CDCS has given me an opportunity to work with two wonderful teams across the university. It has been highly valuable in gaining industry experience with the Library and University Collections and investigating current trends in text extraction technology use across the cultural heritage sector more broadly. My work with CDCS in creating teaching and training resources in the form of workshops and asynchronous learning resources has also been highly valuable and I hope to bring these research and critical pedagogy skills to my personal research and wider teaching at the university. <\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><i><span data-contrast=\"auto\">By Ash Charlton, Optical Character Recognition Intern and second-year PhD student based in History, Classics and Archaeology.<\/span><\/i><span data-ccp-props=\"{&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">A further blog from Ash about this internship can be found here<a href=\"https:\/\/blogs.ed.ac.uk\/cdcs\/2023\/08\/14\/from-image-to-text-experience-of-an-optical-character-recognition-intern\/\"> https:\/\/blogs.ed.ac.uk\/cdcs\/2023\/08\/14\/from-image-to-text-experience-of-an-optical-character-recognition-intern\/\u00a0\u00a0<\/a><\/span><\/p>\n<p><span data-contrast=\"auto\">References<\/span><span data-ccp-props=\"{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335551550&quot;:1,&quot;335551620&quot;:1,&quot;335559685&quot;:0,&quot;335559737&quot;:0,&quot;335559738&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Centre for Data, Culture &amp; Society. \u2018Managing Digitised Documents Training Pathway\u2019. (2022), <\/span><a href=\"https:\/\/www.cdcs.ed.ac.uk\/training\/training-pathways\/managing-digitised-documents-pathway.\"><span data-contrast=\"none\">https:\/\/www.cdcs.ed.ac.uk\/training\/training-pathways\/managing-digitised-documents-pathway.<\/span><\/a><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559685&quot;:720,&quot;335559739&quot;:0,&quot;335559740&quot;:240,&quot;335559991&quot;:720}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Olson, Leanne, and Veronica Berry. \u2018Digitization Decisions: Comparing OCR Software for Librarian and Archivist Use\u2019. <\/span><i><span data-contrast=\"none\">The Code4Lib Journal<\/span><\/i><span data-contrast=\"none\">, no. 52 (22 September 2021), <\/span><a href=\"https:\/\/journal.code4lib.org\/articles\/16132\"><span data-contrast=\"none\">https:\/\/journal.code4lib.org\/articles\/16132<\/span><\/a><span data-contrast=\"none\">.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559685&quot;:720,&quot;335559739&quot;:0,&quot;335559740&quot;:240,&quot;335559991&quot;:720}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since April I have been an intern with the University of Edinburgh\u2019s Cultural Heritage Digitisation Service (CHDS) and the Centre for Data, Culture and Society (CDCS), looking into text extraction<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/2023\/08\/30\/image-to-text-exploring-text-extraction-processes-in-luc\/\">Read More<span class=\"screen-reader-text\">Image to Text: Exploring Text Extraction Processes in L&amp;UC<\/span><\/a><\/div>\n","protected":false},"author":24,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[133,87],"tags":[173,103,138,191,192],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/posts\/3025"}],"collection":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/comments?post=3025"}],"version-history":[{"count":8,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/posts\/3025\/revisions"}],"predecessor-version":[{"id":3037,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/posts\/3025\/revisions\/3037"}],"wp:attachment":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/media?parent=3025"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/categories?post=3025"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/diu\/wp-json\/wp\/v2\/tags?post=3025"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}