February 2020 Journal Club : AI and Metadata

Photo credit Gavin Tracy, free to use at https://www.pexels.com/photo/artificial-intelligence-cgi-computers-low-poly-1162566/

What do cataloguing and cars have in common? One answer is that developments in artificial intelligence are changing both of them. Our February Journal Club discussed three recent articles about metadata and AI, and what these meant for us.

Our first article from the Economist(1) in 2019 looked at self-driving cars, an AI development which is highly visible in the news. We talked about the differences between automation and AI. The ordinary cars we drive are already automated and reliant on technology, but they are not making decisions about where to drive. This is a key issue – people don’t like the idea of machines making decisions. Also, people feel that humans are fallible and machines shouldn’t be – people have higher expectations of machines than they have expectations of people. Until people’s expectations and perceptions of AI change, the impact will be limited. For example, the evidence is that Google self driving cars are very safe, but self driving cars have safety drivers because human legislators make it a requirement.

Our second article from Elucidate (2) proposed that AI was already a game changer in digital academic publishing, content enrichment and knowledge identification. A project using a mathematical algorithm to analyse a corpus of texts identifies descriptive “significant phrases” to create clusters of concepts and identify semantic relationships. That is, they do the task previously undertaken by humans creating classification schemes, vocabularies, taxonomies and ontologies. Is a project like this geared towards removing humans from the equation? And what about people’s expectations here – would articles analysed in this way still be considered to have gone through quality peer review?

We discussed how AI’s capability to learn all the time has advantages and disadvantages. Good AI uses all new bits of data to improve how things work. However existing data can (will) have biases and machines will learn those biases. When Amazon used AI for recruitment and selection processes, machine learning specialists uncovered problems because the machines had learned how to be sexist and racist.

What’s in a name? Our third article by Ackerman and Reitz (3) was about homonym detection for authors with the same name. Cataloguers are familiar with the problems of name author control, including the synonym problem – where people don’t consistently refer themselves in the same way. Article level metadata is even more problematic because it doesn’t get the same level of bibliometric analysis as for printed books.

This article looked at a machine learning approach to this problem, trying to solve records which didn’t have a golden author identifier. Algorithms can match points of similarity between authors with the same name, e.g. this John Smith and this John Smith have twelve co-authors that overlap in the last year. We were impressed with the approach they took to solve the problem, which aimed to reduce the amount of human intervention required without removing the human intervention entirely. It’s important to identify problems that AI can solve and problems that AI can do the donkey work to help a human solve. The algorithm searched by using key words in titles and looking at the journals in which people published for accuracy matching. The project had a 74% success rate for uncovering people with similar author profiles in the database.

There were limitations – people who have less than 2 publications or co-authors did not provide enough data for the algorithm. The algorithm needs enough data points to make a statistically significant calculation. There was a bias towards people who have a greater footprint in the publishing world – but then again these are the people for whom funding decisions may be made. The paper didn’t mention existing identification schemes like ORCID which was surprising. We thought it was strange that time period wasn’t used as a search parameter, as there could be people with the same names 2 decades apart. It’s important to remember though that even data produced by human cataloguers is not infallible, because human beings make mistakes too. Errors are corrected all the time.

We talked about using this approach with our digitised PhDs collection, using an algorithm to discover if the library has other holdings by thesis authors. When you’re cataloguing books you can create a lot of information about people, like a mini Wikipedia page, to create match points with bibliographic metadata.

Finally, it was suggested that Google has unique data identifier for everybody. But can they link it back to what people have published?

Traffic, jammed. (2019, Oct 12). The Economist, 433, 16-16,18.
Horrocks,G. (2019) ‘The impact of artificial intelligence: brave new world?’, Elucidate, 16 pp 4-9
Ackermann, M.R. & Reitz, F., 2018. Homonym Detection in Curated Bibliographies: Learning from dblp’s Experience. Available at: arXiv:1806.06017 [cs.DL]

cloverodgers

Be First to Comment

Leave a Reply Cancel reply