I was lucky enough to have a paper accepted to Csv,conf,2* in Berlin on the 3rd-4th of May, which was great to do, but it also got me through the door to see loads of great things going on in data and its surrounding technology. Yes, there was heavy mention made of CSV and Spreadsheets; in fact, at times it was akin to an AA meeting ,with people guiltily admitting their love of Excel. This left me feeling- quite worryingly- vindicated in a lot of the things I do.
As is always the case with any conference review blogpost, it’s not viable to list every link or ruminate on the message of every talk, so I’ll just home in on a few highlights. The talks (available in slide or video form) are appearing over at Lanyrd.com, and they’ll give a lot more depth to what was spoken about.
My own reason for being there- as far as my talk was concerned- was to look at better ways of processing workflow, enriching data, and improving engagement with our collections. Afterwards, I had some interesting conversations: I was alerted to the tool NeuralTalk2 by Maciej Gryka of rainforestqa.com, a company that specialises in cleaning data using mechanical turks and crowdsourced test-cases. Neural Talk, though, is a captioning tool, which will attempt to visually recognise what your image is “of”. I’m sure it fails as much as it succeeds, but, as I pointed out, we’ve not really used this kind of tech to enhance our metadata, so there’d be no harm in running some of our images through and seeing what it comes up with. Another chat, with a lady from UC Santa Cruz, made it clear that we are quite liberal with our approach to crowdsourced data. Where we have generally decided it’s fine to surface as long as it is properly marked as such, they are proceeding rather slowly, due to a particularly strict metadata librarian.
The keynotes were deliberately intended to cover a range of disciplines that might be new to most people at this highly eclectic conference. Resultantly, there were interesting talks on technology and activism (including visualisations of the Ebola crisis and police brutality); ethics in technology and workflows to give your consent without clicking on unreadable terms and conditions (do you know what SmartBins are taking from you as you pass?); dealing with messy spreadsheets (the Enron crisis showed this institution to manage them terribly), and open data with neuroscience (a lot of mouse brains in action).
Some other tools that we could be looking at:
Zegami – great for exploring large banks of images and spotting patterns across them. Can it work with IIIF, I wonder?
OpenRefine– a tool that we perhaps should have been using for some time to rationalise spreadsheet data, which could certainly save lots of time. Our former colleague, Richard Jones of cottagelabs, is a great advocate of these kinds of tools, as his talk made clear.
DataBaker– created in collaboration between the Office of National Statistics and ScraperWiki. This Python application can convert any ‘pretty’ spreadsheet into usable source data in CSV.
Finally, here are three interesting observations, which certainly struck a chord with me:
- It is now deemed quite acceptable, as a symptom of rapid development, perhaps, that the CSV is used as the master dataset; perhaps the file-based database’s day is not over. I certainly found out about some interesting applications built in this way.
- EVERYONE suffers from problems with diacritics, glyphs, badly formatted data and what happens when you import a CSV into a spreadsheet tool that tries to be too clever. It is not just me.
All in all, an excellent couple of days, which have filled me with ideas for improvements for existing workflows. Hopefully the likes of the DIU will reap some benefits!
Scott Renton, Digital Developer
*The commas are intentional, by the way!