Open data repository – file size analysis

The University of Edinburgh’s open data sharing repository, DataShare, has been running since 2009.  During this time, over 125 items of research data have been published online. This blog post provides a quick overview of the the number, extent, and distribution of file sizes and file types held in the repository.

First, some high level statistics (as at March 2014):

  • Number of items: 125
  • Total number of files: 1946
  • Mean number of files per item: 16
  • Total disk space used by files: 76GB (0.074TB)

DataShare uses the open source DSpace repository platform. As well as stroring the raw data files that are uploaded, it creates derivative files such as thumbnails of images, and plain text versions of text documents such as PDF or Word files, which are then used for full-text indexing.  Of the files held within DataShare, about 80% are the original files, and 20% are derived files (including for example, licence attachments).

filetypes

When considering capacity planning for repositories, it is useful to look at the likely file size of files that may be uploaded.  Often with research data, the assumption is that the file size will be quite large.  Sometimes this can be true, but the next graph shows the distribution of files by file size.  The largest proportion of files are under 1/10th of a megabyte (100KB).  Ignoring these small files, there is a normal distribution peaking at about 100MB.  The largest files are nearer to 2GB, but there are very few of these.

filesizes

Finally, it is interesting to look at the file formats stored in the repository.  Unsurprisingly the largest number of files are plain text, followed by a number of Wave Audio files (from the Dinka Songs collection).  Other common file formats include XML files, ZIP files, and JPEG images.

fileformats

Stuart Lewis
Head of Research and Learning Services, Library & University Collections

Data provided by the DataShare team.

Leave a Reply

Your email address will not be published. Required fields are marked *